How Identity Secures Data

All demographic fields are irreversibly hashed upon being received by the Identity service. This secures identifying information by making it irretrievable in subsequent operations. In this article, you will learn more about how the hashing process works, and how Identity protects your data.

What is Hashing?

Most people are familiar with encryption. The original text is encoded into an unrecognizable string of gibberish, called the ciphertext. With the proper decryption key, you can retrieve the original text.

While encryption algorithms can be highly secure, they all share the quality of being reversible. With the proper key (or a malicious attacker cracking the key), you can get back the original text.

The Identity service uses a secure, keyed hash, rather than encryption, to safeguard its data. Unlike encryption, hashing is a one-way, irreversible process. Even with the secret key, there is no way to restore the original data.

For example, consider these two hashes:

Input --> Hash
John --> a8cfcd74832004951b4408cdb0a5dbcd8c7e52d43f7fe244bf720582e05241da
Jon -->  5f39b51ae9a4dacbb8d9538229d726bfb7e1a03633e37d64598c32989a8c1277

Changing just a single letter of the input results in a completely different hash.

Hashing is analogous to cooking. If you use the exact same recipe with the exact same ingredients and follow it perfectly each time, you will always get the same food. However, there’s no way to un-cook the food to get back its original ingredients. Similarly, the hash of “John” will always be “a8cfcd748320049…”, but there’s no way to get from “a8cfcd748320049…” back to “John”.

How are Hashes Used?

The Identity service only stores and performs match operations with the hashed values, not the original values. This makes it “blinded.”

For example, consider the first names of several records. Although hashed values are actually alphanumeric (e.g., “9f86d081884c…”), thinking of them as emoji can make the example easier to follow.

  • Record A’s hashed first name: “a8cfcd7…” (🍎)
  • Record B’s hashed first name: “b125956…” (🍐)
  • Record C’s hashed first name: “a8cfcd7…” (🍎)

You can tell that A and C have the same first name (🍎) without knowing that they’re both “John”. Similarly, you can tell that their names are different from B (🍎 vs. 🍐) without knowing that B is “Dan”. In this way, the Identity service is able to match demographic values across records without actually knowing the values.

Hash Security

Not all hashing algorithms are created equal. The Identity service uses the highly secure HMAC/SHA256, one of the hashing algorithms recommended by the National Institute of Standards and Technology (NIST) in compliance with the Federal Information Processing Standards (FIPS-140).

SHA256 has extremely high pre-image and collision resistance ratings, which are basically measures of its irreversibility and effectiveness.

For added security, the Identity service uses a client-specific secret key in its hashing algorithm.

In a basic (no-key) SHA256 hash, the hashed value of “John” is always the same:

John --> (no key) --> a8cfcd74832004951b4408cdb0a5dbcd8c7e52d43f7fe244bf720582e05241da

Even though you can’t get directly from “a8cfcd748320049…” back to “John”, knowing the hashed values of various common first names could potentially yield information about the data.

With a client-specific secret key, the hashed value of “John” is different for each client:

John --> (client 1 key) --> 8db6b4da0d673d21d66a0f565229c061f4ade17fd95c34e06ca5c174a99da454
John --> (client 2 key) --> f822e2c093717180bf56df10c8d0bd1692f6a74684e4548294e69a728252fc4e

This guards against dictionary attacks and rainbow tables, two common attacks against stolen data.

Is Hashing Sufficient for PII/PHI?

NIST recognizes hashing as a common way to safeguard and de-identify personally identifiable information (PII). In its Guide to Protecting the Confidentiality of Personally Identifiable Information (PII), NIST notes:

“A common de-identification technique for obscuring PII is to use a one-way cryptographic function, also known as a hash function, on the PII.

De-identified information can be assigned a PII confidentiality impact level of low, as long as the following are both true:

  • The re-identification algorithm, code, or pseudonym is maintained in a separate system, with appropriate controls in place to prevent unauthorized access to the re-identification information.
  • The data elements are not linkable, via public records or other reasonably available external records, in order to re-identify the data.”

As described above, the use of a strong hashing algorithm with a client-specific secret key prevents re-identification as called for by the NIST guidelines.

The NIST guide for De-Identifying Government Datasets includes “hashing with a keyed hash, such as a Hash-based Message Authentication Code (HMAC)…for example, SHA-256 HMAC with a 256-bit randomly generated key” (the algorithm used by the Identity service) as an acceptable mechanism for removing identifiable information from a data set. Further, CareEvolution’s robust security system implements the procedural and administrative safeguards NIST recommends to ensure that the secret keys are properly selected and secured.

The HIPAA Guidance Regarding Methods for De-identification of Protected Health Information standards acknowledges cryptographic hashing as a pathway for de-identifying data via its “expert determination” method:

“For clarification, our guidance is similar to that provided by the National Institutes of Standards and Technology…The re-identification provision in §164.514(c) does not preclude the transformation of PHI into values derived by cryptographic hash functions using the expert determination method, provided the keys associated with such functions are not disclosed, including to the recipients of the de-identified information.”

Local Hashing Service

Clients seeking even more security can use Privacy-Preserving (Blinded) Mode to hash demographics with a local hashing service before they are even sent.

The local hashing service uses the same secure hashing algorithm as the Identity service, but you control the secret key. In the industry, this is known as a Bring Your Own Key (BYOK) or Customer-Supplied Encryption Keys (CSEK) encryption model.