Understanding Matching

The matching logic is a complex set of algorithms that have been developed and refined by CareEvolution over the past 15+ years. These algorithms have been validated to ensure appropriate sensitivity and specificity against production data in a variety of settings.

Contents

Overview
Standardizing and Variations
Hashing
Demographic Weighting and Match Confidence
Examples
Invalid/Ignored Values
Name/DOB/Gender Matching
Linking Within the Same Source (Duplicate Records)
Overwriting Records
Demographic Combinations
Identifier Match Weights

Overview

To determine whether two given records are a match, the Identity service performs several steps:

Account for common variations in demographics.
Perform an irreversible hash on the demographics.
Perform a weighted comparison of demographic fields to establish whether the records are a match.

These steps are discussed in the following sections.

The Identity service is designed to minimize false positives (where two records are erroneously linked when they truly represent different individuals) by only linking records when there is high confidence of a match. This is because false positives are generally considered to have a greater impact than false negatives (where two records from the same individual are not linked). For clinicians, dealing with incomplete data is the norm; dealing with wrong data is more likely to lead to adverse clinical outcomes.

Standardizing and Variations

The Identity service performs some pre-processing of demographic fields to account for common variations among input sources. The first thing it does is normalize data to a standard format. For example, it will account for the following (not an exhaustive list):

Different capitalization (John vs JOHN)
Symbols in names (O’Hare vs Ohare)
Phone number formats (888-555-1234 vs (888)555-1234)
Zip Code Lengths (5-digit zip codes vs zip+4)
Social Security Numbers (with/without hyphens, last-4 vs full numbers)

The service also determines common variations. For example (not an exhaustive list):

Nicknames (Abby for Abigail or John for Jon)
Transpositions (a birthday of May 1st erroneously entered as 2001-01-05)
Typos (a SSN that’s off by 1 digit)

Note
Variants can influence whether two records are considered a match, but are not given as much weight as exact matches.

Known invalid values and placeholders are treated as missing fields. See Ignored Values for details.

Hashing

Before storing a record, the Identity service performs an irreversible hash operation on the demographic data. The Identity service only stores and performs match operations with the hashed values, not the original values. This makes the Identity service “blinded.”

The Identity service uses only FIPS compliant hash functions (HMAC/SHA256), with a unique client-specific key to ensure privacy and guard against brute force or dictionary attacks. For more information, see How Identity Secures Data.

Note
Source system and identifier are not hashed, and will be retrieved in later operations. See Record Identifiers for details.

For example, consider the first names of several records. Although hashed values are actually alphanumeric (e.g., “9f86d081884c…”), emoji are used in these examples for clarity.

	Record A	Record B	Record C
Original Value	Daniel	Gio	Dan
Known Variants	Dan, Danny		Daniel, Danny
Hashed Value	🍎	🍇	🍑
Hashed Variants	🍑, 🍐		🍎, 🍐

The Identity service never stores the original values or variants, only the hashed ones. Yet even with the hashed values, it can determine that Person A and C potentially have matching first names (A’s original value and C’s variant are both 🍎), while Person B has no values in common with either A or C. The Identity service will consider every available field in this manner before reaching its final conclusion about whether two persons are a match.

Demographic Weighting and Match Confidence

When determining if two records are a match, the Identity service will compare the hashed values of every available demographic field. For each field individually, it will consider:

How similar the values are.
The field’s relative importance (or “weight”) in determining a match (e.g., name is more important than phone number).

Then Identity service will look at the overall status of all the fields combined to reach a final conclusion as to whether the records match.

The matching algorithm uses a combination of deterministic matching (the weighting system) and probabilistic matching (the statistical analysis of the overall confidence) to determine the final match status.

Note
The Identity service works best when you provide it all available demographics. No single field is sufficient for a match on its own; all demographics contribute to the final decision about whether two records match.

Match Similarity

When the Identity service compares two demographic fields, it determines how closely each one matches. This gives a similarity status, which is one of:

Similarity Status	Example
Exact Match	Both records have a middle name of “Abigail.”
Approximate Match	One record has a middle name of “Abby” and the other has “Abigail.”
Data Missing	One record has a middle name of “A” and the other has no middle name.
Disagreement	One record has a middle name of “A” and the other has a middle name of “T.”

Match Weight

The individual match status results are then combined into an overall score. When doing so, some demographics are weighted more heavily than others.

Demographic	Weight
Government Identifier (SSN, Medicaid ID)	High
First and Last Name, Date of Birth, Gender, Health Card ID, Address	Medium
Middle Name, Email	Low

Agreement or disagreement in high-weighted fields is more indicative of a match than agreement or disagreement in low-weighted fields.

Note
No single field is sufficient for matching on its own. Even medium and low weight fields contribute to the overall chance of matching. For a more detailed breakdown of identifier weights and the rationale behind them, see Identifier Match Weights.

The Identity service takes into account several factors when weighting demographics:

Specificity - Many people can share the same name or birthdate (or even both), but government IDs like SSN or Medicaid ID are intended to uniquely identify an individual. This is not a guarantee, of course. Government IDs can be incorrect due to dummy/placeholder SSNs, typographic errors, fraud, cases where a spouse or parent’s SSN was mistakenly added to a record for insurance reasons, etc. Nonetheless, records with the same SSN are likely to be from the same individual, while records with different SSNs are likely to be different people.
Reliability - Middle names are not as reliable as first/last name, due to common recording errors or cultural naming variations.
Stability - Some demographics are anticipated to change repeatedly over a person’s lifetime. Address and phone number can change often, and are commonly shared among different individuals (such as families or even homeless shelters). Even names can change due to marriage or legal name changes.

Overall Match Decision

Once the Identity service has established the similarity and weight of each individual field, it looks at the entire record as a whole to determine whether the two records match.

Imagine the two records as a mosaic, with the size of each block representing the weight of the field and the color representing its hashed value. The Identity service considers the overall similarity of the images. Note that the block sizes are merely illustrative, and should not be considered an accurate representation of the weights used.

Some things to notice:

The record on the left has more fields to work with, giving it more opportunities to match with the record on the right.
All fields are considered. Even the small address block gives important corroboration to help conclude that these are indeed the same person.
It is not necessary for the images to match exactly, but rather the overall similarity (based on probability analysis) that determines whether two records match.

For more information about how different combinations of demographics may perform, see Demographic Combinations.

Examples

The following examples illustrate how the Identity service considers the above factors in determining a match.

Note
Although the tables below use raw values (like “John Smith”) for clarity, remember that the Identity service does not store or use raw values in its matching algorithm. It does all comparisons using the blinded, hashed values.

Match Example 1

Consider two records:

First Name	Last Name	Gender	SSN
John	Smith	Male	123-45-6789
Mary	Johnson	Female	123-45-6789
Disagreement	Disagreement	Disagreement	Exact Match

This would not be considered a match. The disagreement in several medium-weight fields (first and last name, gender) overrides the exactly matching SSN.

Match Example 2

Consider a pair of records:

First Name	Middle Name	Last Name	Date of Birth	Medicaid ID
Mary	Jane	Johnson	12/1/1956	123456
Mary	Jane	Smith	12/1/1956	123456
Exact Match	Exact Match	Disagreement	Exact Match	Exact Match

This would be considered a match. The high-weight Medicaid ID, coupled with the agreement in other medium-weight fields (first and middle name, gender, and date of birth), is enough to overcome the disagreement in last name (possibly due to marriage).

Match Example 3

Consider this record pair:

First Name	Last Name	Date of Birth	SSN	Health Card ID
John	Smith	12/1/1956	123-45-6789	A1234
Jon	Smith	12/1/1956	Not Provided	A1234
Approximate Match	Exact Match	Exact Match	No Disagreement	Exact Match

This would be considered a match. Notice that the absence of a SSN in the second record is not considered a disagreement. The preponderance of exact matches across medium-weight fields is enough to constitute a match even without the SSN.

Match Example 4

Consider this record pair:

First Name	Last Name	Date of Birth	Gender	Address
John	Smith	Male	12/1/1956	123 Main Street, Anytown MI, 12345
John	Smith	Male	12/1/1956	123 Main Street, Anytown MI, 12345
Exact Match	Exact Match	Exact Match	Exact Match	Exact Match

This would be considered a match even though no high-weight identifiers are provided. Name, date of birth, and gender are insufficient for a match on their own, but exact matches among multiple fields can provide enough corroboration to make a match.

Invalid/Ignored Values

The Identity service will ignore certain values that are commonly used as placeholders when real information is not available. For example:

SSNs like 123-45-6789 or 111-11-1111. Also invalid SSNs, such as those beginning with 000 or 999.
MRNs that are all a single digit (like 1111 or 2222222222).
DOBs like 1/1/1900.
First names like “Baby” or “Baby Boy.”
Phone numbers like (111)111-1111.

This is not an all-inclusive list, merely examples.

Values like these are not effective for linking records. They are unlikely to match other records from the same individual, and may provide erroneous matching weight towards records with similar placeholders from different individuals. These values will be ignored, treated as if the field were missing.

Name/DOB/Gender Matching

One specific case deserves special mention in the matching algorithm: The Identity service does not consider name, date of birth, and gender alone sufficient for a match. While it may seem unlikely for two individuals to share exactly the same first and last name, date of birth, and gender, it actually happens with some frequency in large population sets.

For example, in Perspectives on Patient Matching: Approaches, Findings, and Challenges the authors report:

“The Social Security Administration’s death master file contains more than 20,000 distinct males named “William Smith,” 5 of whom were born on January 20, 1920 (Social Security Administration, 2001). If name, gender, and date of birth are the only matching criteria used, these individuals may be falsely linked.”

Another report prepared for the for the Office of the National Coordinator for Health Information Technology found that:

“Many regions experience a high number of individuals who share the exact name and birthdate, leading to the need for additional identifying attributes to be used when matching patient records.”

Additional factors that can cause shared name/date of birth/gender include:

Naming trends (such as the prevalence of individuals born on 12/25 named “Noelle” or “Holly”, or the popularity of certain names each year).
Family members with similar demographics (e.g., twins with similar names).
Cultural factors (such as certain names being extremely common in some healthcare populations).

For these reasons, you must always supplement name/DOB/gender with at least one other field that specifically identifies the individual.

Non-specific fields like zip code or city are insufficient supplements to name/DOB/gender. The distribution of names is highly influenced by cultural factors, and is neither uniform nor random. An ethnic neighborhood in a big city may have several individuals with the same name, some of whom may even share a birthday.

Linking Within the Same Source (Duplicate Records)

If the Identity service receives two different records from the same source with matching demographics, it will link them.

In an ideal world, each source system would maintain only one record for a given individual. In reality, incomplete data and poor matching algorithms within EMR systems make duplicate records a common occurrence.

Overwriting Records

The Identity service considers the source system to be the authority for its own records. Every time you upload data for a given source system/identifier, it will completely replace the existing record with the new data, then re-evaluate the validity of any existing record links. The response will include information about any updated matches.

This is true even if the new data is very different from the original. For example, CommunityClinic record A12345 may have originally been uploaded as “John Smith, male, born 1/1/1952” and then updated to “Jane Jones, female, born 2/2/1980”. The Identity service will overwrite John’s data with Jane’s, and records previously linked to John’s will no longer match the updated record (though they may still match each other).

Demographic Combinations

Below are some common combinations of demographics and their expected matching sensitivity.

Note
First Name + Last Name + Date of Birth (DOB) + Gender is considered foundational information for all combinations, but is insufficient for a match on its own.

Identifiers	Sensitivity	Notes
First Name Last Name Gender Date of Birth		While it may seem unlikely for two individuals to share exactly the same first and last name, date of birth, and gender, it actually happens with some frequency. This data alone is insufficient for a match. For more information, see Name/Gender/DOB Matching.
First/Last/Gender/DOB A non-specific identifier (e.g., City or Zip Code)		As in the previous case, the combination of name/gender/DOB alone is not sufficient for a match. The addition of a non-specific identifier doesn't help. For more information, see Name/Gender/DOB Matching.
First/Last/Gender/DOB MRN		MRNs can be useful in certain intra-hospital situations, but their matching value is limited because they are generally not common across systems.
First/Last/Gender/DOB Complete Street Address (Street/City/State/Zip) Phone Number (Home or Cell)		This combination will give decent results, but can have sensitivity issues when people move or change telephone numbers. It can also be sensitive to name changes due to marriage. Certain communal addresses (rehab centers, skilled nursing, homeless shelters, etc.) can create ambiguity.
First/Last/Gender/DOB Government or Insurance ID (SSN/Health Card/Medicaid)		This combination gives good results, though insurance identifiers can change and Medicaid ID is not always available.
First/Last/Gender/DOB Government or Insurance ID (SSN/Health Card/Medicaid) Complete Street Address (Street/City/State/Zip) Phone Number (Home or Cell)		This combination gives excellent results. Providing additional fields will improve results further, compensating for any gaps in individual records.

Identifier Match Weights

The Identity service takes into account many factors when weighting demographics. The table below summarizes the identifiers based on:

Specificity - How likely it is that no two individuals share the same identifier.
Availability - How likely is it that this identifier is available in any given source record.
Stability - How likely is it that the value will be the same across multiple sources and records for the same individual.
Weight - The overall weight of this field in the matching algorithm.

Note
No single field is sufficient for matching on its own. Even medium and low weight fields contribute to the overall chance of matching.

Identifier	Specificity	Availability	Stability	Weight
SSN	High	Medium	High
SSN (last 4 digits only)	Medium	High	High
Medicaid ID	High	Low	High
Health Card ID	High	Medium	Medium
First Name	Low	High	High	[1]
Last Name	Low	High	Medium	[1]
Date of Birth	Low	High	High	[1]
Street	Medium	High	Medium	[2]
City	Low	High	Medium	[2]
State	Low	High	Medium	[2]
Zip	Low	High	Medium	[2]
Gender	Low	High	High
Home Phone	Medium	Medium	Medium
Cell Phone	High	Medium	Medium
Race	Low	Medium	High
Middle Name	Low	Medium	Medium
Maiden Name	Low	Low	Medium
Email	Medium	Low	Medium
MRN	High	High	Low

Notes:

First name, last name, and date of birth are not very specific on their own, and are not sufficient for a match. However, together they are a core part of the match algorithm.
Individual components of the address do not have a high match weight on their own, but as a group they are useful.