While many organizations will mask the identities of customers, consumers, or patients for analytic projects, combinations of data elements may lead to re-identifying an individual. Such combinations of data elements that individually remain de-identified but can be combined to identify an individual are considered toxic combinations. This is often the case with data lakes that take in a diverse mix of data sources and data source types such as structured, unstructured and semi-structured data. Data in-motion can also be a blind spot for many companies, given most organizations don’t know what data is entering and leaving their organization every day.
Toxic combinations of data are the unintentional combination of data elements that can lead to unauthorized re-identification of individuals. An example would be a dataset that provides date of birth, zip code, and gender of an individual. Based on this information, ~90% of the US population can be identified. The rates may be lower for de-identified health or legal data, but organizations must exercise due diligence and due care to ensure they protect the privacy of individuals whose data is used for analytics.