The matching logic is a complex set of algorithms that have been developed and refined by CareEvolution over the past 15+ years. These algorithms have been validated to ensure appropriate sensitivity and specificity against production data in a variety of settings.
Contents
To determine whether two given records are a match, the BMPI performs several steps:
These steps are discussed in the following sections.
The BMPI is designed to minimize false positives (where two records are erroneously linked when they truly represent different individuals) by only linking records when there is high confidence of a match. This is because false positives are generally considered to have a greater impact than false negatives (where two records from the same individual are not linked). For clinicians, dealing with incomplete data is the norm; dealing with wrong data is more likely to lead to adverse clinical outcomes.
The BMPI performs some pre-processing of demographic fields to account for common variations among input sources. This includes:
The BMPI will also ignore certain known invalid values and placeholders, treating them as missing fields. See Ignored Values for details.
Before storing a record, the BMPI performs an irreversible hash operation on the demographic data. The BMPI only stores and performs match operations with the hashed values, not the original values. This is the “blinded” nature of the BMPI.
The CareEvolution BMPI uses only FIPS compliant hash functions (HMAC/SHA256), with a unique client-specific key to ensure privacy and guard against brute force or dictionary attacks. For more information, see How the BMPI Secures Data.
For example, consider the first names of several records. Although hashed values are actually alphanumeric (e.g., “9f86d081884c…”), emoji are used in these examples for clarity.
Record A | Record B | Record C | |
---|---|---|---|
Original Value | Daniel | Gio | Dan |
Known Variants | Dan, Danny | Daniel, Danny | |
Hashed Value | 🍎 | 🍇 | 🍑 |
Hashed Variants | 🍑, 🍐 | 🍎, 🍐 |
The BMPI never stores the original values or variants, only the hashed ones. Yet even with the hashed values, the BMPI can determine that Person A and C potentially have matching first names (A’s original value and C’s variant are both 🍎), while Person B has no values in common with either A or C. The BMPI will consider every available field in this manner before reaching its final conclusion about whether two persons are a match.
When determining if two records are a match, the BMPI will compare the hashed values of every available demographic field. For each field individually, it will consider:
Then the BMPI will look at the overall status of all the fields combined to reach a final conclusion as to whether the records match.
The matching algorithm uses a combination of deterministic matching (the weighting system) and probabilistic matching (the statistical analysis of the overall confidence) to determine the final match status.
When the BMPI compares two demographic fields, it determines how closely each one matches. This gives a similarity status, which is one of:
Similarity Status | Example |
---|---|
Exact Match | Both records have a middle name of “Abigail.” |
Approximate Match | One record has a middle name of “Abby” and the other has “Abigail.” |
Data Missing | One record has a middle name of “A” and the other has no middle name. |
Disagreement | One record has a middle name of “A” and the other has a middle name of “T.” |
The individual match status results are then combined into an overall score. When doing so, some demographics are weighted more heavily than others.
Demographic | Weight |
---|---|
Government Identifier (SSN, Medicaid ID) | High |
First and Last Name, Date of Birth, Gender, Health Card ID, Address | Medium |
Middle Name, Email | Low |
Agreement or disagreement in high-weighted fields is more indicative of a match than agreement or disagreement in low-weighted fields.
The BMPI takes into account several factors when weighting demographics:
Once the BMPI has established the similarity and weight of each individual field, it looks at the entire record as a whole to determine whether the two records match.
Imagine the two records as a mosaic, with the size of each block representing the weight of the field and the color representing its hashed value. The BMPI considers the overall similarity of the images. Note that the block sizes are merely illustrative, and should not be considered an actual representation of BMPI weights.
Some things to notice:
For more information about how different combinations of demographics may perform, see Demographic Combinations.
The following examples illustrate how the BMPI considers the above factors in determining a match.
Consider two records:
First Name | Last Name | Gender | SSN |
---|---|---|---|
John | Smith | Male | 123-45-6789 |
Mary | Johnson | Female | 123-45-6789 |
Disagreement | Disagreement | Disagreement | Exact Match |
This would not be considered a match. The disagreement in several medium-weight fields (first and last name, gender) overrides the exactly matching SSN.
Consider a pair of records:
First Name | Middle Name | Last Name | Date of Birth | Medicaid ID |
---|---|---|---|---|
Mary | Jane | Johnson | 12/1/1956 | 123456 |
Mary | Jane | Smith | 12/1/1956 | 123456 |
Exact Match | Exact Match | Disagreement | Exact Match | Exact Match |
This would be considered a match. The high-weight Medicaid ID, coupled with the agreement in other medium-weight fields (first and middle name, gender, and date of birth), is enough to overcome the disagreement in last name (possibly due to marriage).
Consider this record pair:
First Name | Last Name | Date of Birth | SSN | Health Card ID |
---|---|---|---|---|
John | Smith | 12/1/1956 | 123-45-6789 | A1234 |
Jon | Smith | 12/1/1956 | Not Provided | A1234 |
Approximate Match | Exact Match | Exact Match | No Disagreement | Exact Match |
This would be considered a match. Notice that the absence of a SSN in the second record is not considered a disagreement. The preponderance of exact matches across medium-weight fields is enough to constitute a match even without the SSN.
Consider this record pair:
First Name | Last Name | Date of Birth | Gender | Address |
---|---|---|---|---|
John | Smith | Male | 12/1/1956 | 123 Main Street, Anytown MI, 12345 |
John | Smith | Male | 12/1/1956 | 123 Main Street, Anytown MI, 12345 |
Exact Match | Exact Match | Exact Match | Exact Match | Exact Match |
This would be considered a match even though no high-weight identifiers are provided. Name, date of birth, and gender are insufficient for a match on their own, but exact matches among multiple fields can provide enough corroboration to make a match.
The BMPI will ignore certain values that are commonly entered into records as placeholders when real information is not available. For example:
This is not an all-inclusive list, merely examples.
Values like these are not effective for linking records. They are unlikely to match other records from the same individual, and may provide erroneous matching weight towards records with similar placeholders from different individuals. As such, the BMPI will simply ignore these values and treat them as if the field were missing.
One specific case deserves special mention in the matching algorithm: The BMPI does not consider name, date of birth, and gender sufficient for a match. While it may seem unlikely for two individuals to share exactly the same first and last name, date of birth, and gender, it actually happens with some frequency in large population sets.
For example, in Perspectives on Patient Matching: Approaches, Findings, and Challenges the authors report:
“The Social Security Administration’s death master file contains more than 20,000 distinct males named “William Smith,” 5 of whom were born on January 20, 1920 (Social Security Administration, 2001). If name, gender, and date of birth are the only matching criteria used, these individuals may be falsely linked.”
Another report prepared for the for the Office of the National Coordinator for Health Information Technology found that:
“Many regions experience a high number of individuals who share the exact name and birthdate, leading to the need for additional identifying attributes to be used when matching patient records.”
Additional factors that can cause shared name/date of birth/gender include:
For these reasons, you must always supplement name/DOB/gender with at least one other field that specifically identifies the individual.
Non-specific fields like zip code or city are insufficient supplements to name/DOB/gender. The distribution of names is highly influenced by cultural factors, and is neither uniform nor random. An ethnic neighborhood in a big city may have several individuals with the same name, some of whom may even share a birthday.
If the BMPI receives two different records from the same source with matching demographics, it will link them.
In an ideal world, each source system would maintain only one record for a given individual. In reality, incomplete data and poor matching algorithms within EMR systems make duplicate records a pervasive problem.
The BMPI considers the source system to be the authority for its own records. Every time you upload data for a given source system/identifier, the BMPI will completely replace the existing record with the new data, then re-evaluate the validity of any existing record links. The response will include information about any updated matches.
This is true even if the new data is very different from the original. For example, CommunityClinic
record A12345
may have originally been uploaded as “John Smith, male, born 1/1/1952” and then updated to “Jane Jones, female, born 2/2/1980”. The BMPI will overwrite John’s data with Jane’s, and records previously linked to John’s will no longer match the updated record (though they may still match each other).
Below are some common combinations of demographics and their expected matching sensitivity.
Identifiers | Sensitivity | Notes |
---|---|---|
|
While it may seem unlikely for two individuals to share exactly the same first and last name, date of birth, and gender, it actually happens with some frequency. This data alone is insufficient for a match. For more information, see Name/Gender/DOB Matching. | |
|
As in the previous case, name/gender/DOB is not sufficient for a match. The addition of a non-specific identifier doesn't help. For more information, see Name/Gender/DOB Matching. | |
|
MRNs can be useful in certain intra-hospital situations, but their matching value is limited because they are generally not common across systems. | |
|
This combination will give decent results, but can have sensitivity issues when people move or change telephone numbers. It can also be sensitive to name changes due to marriage. Certain communal addresses (rehab centers, skilled nursing, homeless shelters, etc.) can create ambiguity. | |
|
This combination gives good results, though insurance identifiers can change and Medicaid ID is not always available. | |
|
This combination gives excellent results. Providing additional fields will improve results further, compensating for any gaps in individual records. |
The BMPI takes into account many factors when weighting demographics. The table below summarizes the identifiers based on:
Identifier | Specificity | Availability | Stability | Weight |
---|---|---|---|---|
SSN | High | Medium | High | |
SSN (last 4 digits only) | Medium | High | High | |
Medicaid ID | High | Low | High | |
Health Card ID | High | Medium | Medium | |
First Name | Low | High | High | [1] |
Last Name | Low | High | Medium | [1] |
Date of Birth | Low | High | High | [1] |
Street | Medium | High | Medium | [2] |
City | Low | High | Medium | [2] |
State | Low | High | Medium | [2] |
Zip | Low | High | Medium | [2] |
Gender | Low | High | High | |
Home Phone | Medium | Medium | Medium | |
Cell Phone | High | Medium | Medium | |
Race | Low | Medium | High | |
Middle Name | Low | Medium | Medium | |
Maiden Name | Low | Low | Medium | |
Medium | Low | Medium | ||
MRN | High | High | Low |