Understanding Matching

The matching logic is a complex set of algorithms that have been developed and refined by CareEvolution over the past 15+ years. These algorithms have been validated to ensure appropriate sensitivity and specificity against production data in a variety of settings.

Overview

To determine whether two given records are a match, the BMPI performs several steps:

  1. Account for common variations in demographics.
  2. Perform an irreversible hash on the demographics.
  3. Perform a weighted comparison of demographic fields to establish whether the records are a match.

These steps are discussed in the following sections.

The BMPI is designed to minimize false positives (where two records are erroneously linked when they truly represent different individuals) by only linking records when there is high confidence of a match. This is because false positives are generally considered to have a greater impact than false negatives (where two records from the same individual are not linked). For clinicians, dealing with incomplete data is the norm; dealing with wrong data is more likely to lead to adverse clinical outcomes.

Standardizing and Variations

The BMPI performs some pre-processing of demographic fields to account for common variations among input sources. This includes:

  1. Transforming data to a common format. For example, normalizing case and using standard formats. For example: “John” and “JOHN” would be normalized to “john”; “888-555-1234” and “(888)555-1234” would both be normalized to “8885551234”.
  2. Determining common variations, such as nicknames, typographical errors, and transpositions. For example: “Abby” for “Abigail”; a birthday of May 1st erroneously entered as 2001-01-05; a SSN that’s off by 1 digit.

The BMPI will also ignore certain known invalid values and placeholders, treating them as missing fields. See Ignored Values for details.

Hashing

Before storing a record, the BMPI performs an irreversible hash operation on the demographic data. The BMPI only stores and performs match operations with the hashed values, not the original values. This is the “blinded” nature of the BMPI.

The CareEvolution BMPI uses only FIPS compliant hash functions (HMAC/SHA256), with a unique client-specific key to ensure privacy and guard against brute force or dictionary attacks. For more information, see How the BMPI Secures Data.

For example, consider the first names of several records. Although hashed values are actually alphanumeric (e.g., “9f86d081884c…”), emoji are used in these examples for clarity.

  Record A Record B Record C
Original Value Daniel Gio Dan
Known Variants Dan, Danny Daniel, Danny
Hashed Value 🍎 🍇 🍑
Hashed Variants 🍑, 🍐 🍎, 🍐

The BMPI never stores the original values or variants, only the hashed ones. Yet even with the hashed values, the BMPI can determine that Person A and C potentially have matching first names (A’s original value and C’s variant are both 🍎), while Person B has no values in common with either A or C. The BMPI will consider every available field in this manner before reaching its final conclusion about whether two persons are a match.

Demographic Weighting and Match Confidence

When determining if two records are a match, the BMPI will compare the hashed values of every available demographic field. For each field individually, it will consider:

  • How similar the values are.
  • The field’s relative importance (or “weight”) in determining a match (e.g., name is more important than phone number).

Then the BMPI will look at the overall status of all the fields combined to reach a final conclusion as to whether the records match.

The matching algorithm uses a combination of deterministic matching (the weighting system) and probabilistic matching (the statistical analysis of the overall confidence) to determine the final match status.

Match Similarity

When the BMPI compares two demographic fields, it determines how closely each one matches. This gives a similarity status, which is one of:

Similarity Status Example
Exact Match Both records have a middle name of “Abigail.”
Approximate Match One record has a middle name of “Abby” and the other has “Abigail.”
Data Missing One record has a middle name of “A” and the other has no middle name.
Disagreement One record has a middle name of “A” and the other has a middle name of “T.”

Match Weight

The individual match status results are then combined into an overall score. When doing so, some demographics are weighted more heavily than others.

Demographic Weight
Government Identifier (SSN, Medicaid ID) High
First and Last Name, Date of Birth, Gender, Health Card ID, Address Medium
Middle Name, Email Low

Agreement or disagreement in high-weighted fields is more indicative of a match than agreement or disagreement in low-weighted fields.

The BMPI takes into account several factors when weighting demographics:

  • Specificity - Many people can share the same name or birthdate (or even both), but government IDs like SSN or Medicaid ID are intended to uniquely identify an individual. This is not a guarantee, of course. Government IDs can be incorrect due to dummy/placeholder SSNs, typographic errors, fraud, cases where a spouse or parent’s SSN was mistakenly added to a record for insurance reasons, etc. Nonetheless, records with the same SSN are likely to be from the same individual, while records with different SSNs are likely to be different people.
  • Reliability - Middle names are not as reliable as first/last name, due to common recording errors or cultural naming variations.
  • Stability - Some demographics are anticipated to change repeatedly over a person’s lifetime. Address and phone number can change often, and are commonly shared among different individuals (such as families or even homeless shelters). Even names can change due to marriage or legal name changes.

Overall Match Decision

Once the BMPI has established the similarity and weight of each individual field, it looks at the entire record as a whole to determine whether the two records match.

Imagine the two records as a mosaic, with the size of each block representing the weight of the field and the color representing its hashed value. The BMPI considers the overall similarity of the images. Note that the block sizes are merely illustrative, and should not be considered an actual representation of BMPI weights.

Some things to notice:

  • The record on the left has more fields to work with, giving it more opportunities to match with the record on the right.
  • All fields are considered. Address gives important corroboration to help conclude that these are indeed the same person.
  • It is not necessary for the images to match exactly, but rather the overall similarity (based on probability analysis) that determines whether two records match.

For more information about how different combinations of demographics may perform, see Demographic Combinations.

Examples

The following examples illustrate how the BMPI considers the above factors in determining a match.

Match Example 1

Consider two records:

First Name Last Name Gender SSN
John Smith Male 123-45-6789
Mary Johnson Female 123-45-6789
Disagreement Disagreement Disagreement Exact Match

This would not be considered a match. The disagreement in several medium-weight fields (first and last name, gender) overrides the exactly matching SSN.

Match Example 2

Consider a pair of records:

First Name Middle Name Last Name Date of Birth Medicaid ID
Mary Jane Johnson 12/1/1956 123456
Mary Jane Smith 12/1/1956 123456
Exact Match Exact Match Disagreement Exact Match Exact Match

This would be considered a match. The high-weight Medicaid ID, coupled with the agreement in other medium-weight fields (first and middle name, gender, and date of birth), is enough to overcome the disagreement in last name (possibly due to marriage).

Match Example 3

Consider this record pair:

First Name Last Name Date of Birth SSN Health Card ID
John Smith 12/1/1956 123-45-6789 A1234
Jon Smith 12/1/1956 Not Provided A1234
Approximate Match Exact Match Exact Match No Disagreement Exact Match

This would be considered a match. Notice that the absence of a SSN in the second record is not considered a disagreement. The preponderance of exact matches across medium-weight fields is enough to constitute a match even without the SSN.

Match Example 4

Consider this record pair:

First Name Last Name Date of Birth Gender Address
John Smith Male 12/1/1956 123 Main Street, Anytown MI, 12345
John Smith Male 12/1/1956 123 Main Street, Anytown MI, 12345
Exact Match Exact Match Exact Match Exact Match Exact Match

This would be considered a match even though no high-weight identifiers are provided. Name, date of birth, and gender are insufficient for a match on their own, but exact matches among multiple fields can provide enough corroboration to make a match.

Invalid/Ignored Values

The BMPI will ignore certain values that are commonly entered into records as placeholders when real information is not available. For example:

  • SSNs like 123-45-6789 or 111-11-1111. Also invalid SSNs, such as those beginning with 000 or 999.
  • MRNs that are all a single digit (like 1111 or 2222222222).
  • DOBs like 1/1/1900.
  • First names like “Baby” or “Baby Boy.”
  • Phone numbers like (111)111-1111.

This is not an all-inclusive list, merely examples.

Values like these are not effective for linking records. They are unlikely to match other records from the same individual, and may provide erroneous matching weight towards records with similar placeholders from different individuals. As such, the BMPI will simply ignore these values and treat them as if the field were missing.

Name/DOB/Gender Matching

One specific case deserves special mention in the matching algorithm: The BMPI does not consider name, date of birth, and gender sufficient for a match. While it may seem unlikely for two individuals to share exactly the same first and last name, date of birth, and gender, it actually happens with some frequency in large population sets.

For example, in Perspectives on Patient Matching: Approaches, Findings, and Challenges the authors report:

“The Social Security Administration’s death master file contains more than 20,000 distinct males named “William Smith,” 5 of whom were born on January 20, 1920 (Social Security Administration, 2001). If name, gender, and date of birth are the only matching criteria used, these individuals may be falsely linked.”

Another report prepared for the for the Office of the National Coordinator for Health Information Technology found that:

“Many regions experience a high number of individuals who share the exact name and birthdate, leading to the need for additional identifying attributes to be used when matching patient records.”

Additional factors that can cause shared name/date of birth/gender include:

  • Naming trends (such as the prevalence of individuals born on 12/25 named “Noelle” or “Holly”, or the popularity of certain names each year).
  • Family members with similar demographics (e.g., twins with similar names).
  • Cultural factors (such as certain names being extremely common in some healthcare populations).

For these reasons, you must always supplement name/DOB/gender with at least one other field that specifically identifies the individual.

Non-specific fields like zip code or city are insufficient supplements to name/DOB/gender. The distribution of names is highly influenced by cultural factors, and is neither uniform nor random. An ethnic neighborhood in a big city may have several individuals with the same name, some of whom may even share a birthday.

Linking Within the Same Source (Duplicate Records)

If the BMPI receives two different records from the same source with matching demographics, it will link them.

In an ideal world, each source system would maintain only one record for a given individual. In reality, incomplete data and poor matching algorithms within EMR systems make duplicate records a pervasive problem.

Overwriting Records

The BMPI considers the source system to be the authority for its own records. Every time you upload data for a given source system/identifier, the BMPI will completely replace the existing record with the new data, then re-evaluate the validity of any existing record links. The response will include information about any updated matches.

This is true even if the new data is very different from the original. For example, CommunityClinic record A12345 may have originally been uploaded as “John Smith, male, born 1/1/1952” and then updated to “Jane Jones, female, born 2/2/1980”. The BMPI will overwrite John’s data with Jane’s, and records previously linked to John’s will no longer match the updated record (though they may still match each other).

Demographic Combinations

Below are some common combinations of demographics and their expected matching sensitivity.

Identifiers Sensitivity Notes
  • First Name
  • Last Name
  • Gender
  • Date of Birth
While it may seem unlikely for two individuals to share exactly the same first and last name, date of birth, and gender, it actually happens with some frequency. This data alone is insufficient for a match. For more information, see Name/Gender/DOB Matching.
  • First/Last/Gender/DOB
  • A non-specific identifier (e.g., City or Zip Code)
As in the previous case, name/gender/DOB is not sufficient for a match. The addition of a non-specific identifier doesn't help. For more information, see Name/Gender/DOB Matching.
  • First/Last/Gender/DOB
  • MRN
MRNs can be useful in certain intra-hospital situations, but their matching value is limited because they are generally not common across systems.
  • First/Last/Gender/DOB
  • Complete Street Address (Street/City/State/Zip)
  • Phone Number (Home or Cell)
This combination will give decent results, but can have sensitivity issues when people move or change telephone numbers. It can also be sensitive to name changes due to marriage. Certain communal addresses (rehab centers, skilled nursing, homeless shelters, etc.) can create ambiguity.
  • First/Last/Gender/DOB
  • Government or Insurance ID (SSN/Health Card/Medicaid)
This combination gives good results, though insurance identifiers can change and Medicaid ID is not always available.
  • First/Last/Gender/DOB
  • Government or Insurance ID (SSN/Health Card/Medicaid)
  • Complete Street Address (Street/City/State/Zip)
  • Phone Number (Home or Cell)
This combination gives excellent results. Providing additional fields will improve results further, compensating for any gaps in individual records.

Identifier Match Weights

The BMPI takes into account many factors when weighting demographics. The table below summarizes the identifiers based on:

  • Specificity - How likely it is that no two individuals share the same identifier.
  • Availability - How likely is it that this identifier is available in any given source record.
  • Stability - How likely is it that the value will be the same across multiple sources and records for the same individual.
  • Weight - The overall weight of this field in the matching algorithm.
Identifier Specificity Availability Stability Weight
SSN High Medium High
SSN (last 4 digits only) Medium High High
Medicaid ID High Low High
Health Card ID High Medium Medium
First Name Low High High [1]
Last Name Low High Medium [1]
Date of Birth Low High High [1]
Street Medium High Medium [2]
City Low High Medium [2]
State Low High Medium [2]
Zip Low High Medium [2]
Gender Low High High
Home Phone Medium Medium Medium
Cell Phone High Medium Medium
Race Low Medium High
Middle Name Low Medium Medium
Maiden Name Low Low Medium
Email Medium Low Medium
MRN High High Low

Notes:

  1. First name, last name, and date of birth are not very specific on their own, and are not sufficient for a match. However, together they are a core part of the match algorithm.
  2. Individual components of the address do not have a high match weight on their own, but as a group they are useful.