Privacy Best Practices - Generating, Using, and Sharing Synthetic Data

Data Privacy

Rahul Sharma

October 27, 2023

3

Mins

Home

/

Blog

/

Privacy Best Practices - Generating, Using, and Sharing Synthetic Data

Amidst rising global concerns regarding data privacy, many emerging technologies have evolved to strike a balance between privacy and access. Access to sensitive data is often governed by internal policies and safeguarded by regulations such as GDPR, HIPAA, CPRA, etc.

For many organizations, such regulation scan restrict access to data critical to accomplish important research and analysis. On the other hand, these regulations are essential to protect the sensitive information of individuals, which interested parties can often use for commercial reasons.

In such a situation, one solution that has presented itself as a practical compromise is synthetic data.By taking actual data as a starting point and creating a data model, computer scan generate new datasets that are entirely synthetic, retaining their value for research and analysis. While this solves many problems and might make data analysis accessible to more organizations, there are still specific privacy considerations to remember while generating, using, and sharing synthetic data.

Let us look closely at synthetic data and how to maintain privacy when dealing with it.

What is Synthetic Data?

In a nutshell,synthetic data is artificially created using models derived from real data sources. Several methods can be employed to create synthetic data, with the most used techniques comprising de-identification, machine learning,statistical modeling, and computer simulations.

Synthetic data is created artificially using various methods, including computer simulations,statistical modeling, and machine learning. Synthetic data can be used for multiple purposes, including training machine learning models, testing software, and creating visualizations.

The starting point for generating synthetic data is real data, which can include personally identifiable information. However, instead of using that data straightaway for analysis, the data is instead used to create a data model. This model can then generate sets of synthetic data with the same properties as the actual data in terms of statistical and analytical significance while being virtually non-identifiable.

This synthetic data can then be used in many of the same ways real data is used for analysis.In fact, synthetic data has several advantages over real-world data. It can be created in large quantities, which can be helpful for training machine learning models. It can also be created to match specific requirements, such as having a particular distribution of values or being free of certain biases. Moreover,synthetic data can be used to create data sets that would be difficult or impossible to collect from real-world sources.

Most of all, this opens the doors for organizations to create and share more data, create more analytical studies, and provide the many benefits therein without compromising privacy.

Suggested Read: Protect PII and Sensitive Data with Data Tokenization

Privacy Concerns Regarding Synthetic Data

To understand the privacy implications of synthetic data and the legal liabilities that might warrant consideration, it is essential first to develop an understanding of all relevant laws and regulations, especially in the GDPR. Recital 26 of the HDPR and Article 29 of the Data Protection Working Party’s Opinion are essential documents in this regard, as they clearly spell out the scope and legal thresholds of anonymized and pseudonymized data, identifiability, and viability.

According to this understanding, privacy concerns regarding synthetic data can be classified into three broad categories –

1. Since the source data is real-world data containing sensitive and personally identifiable information, there are legal and compliance-related requirements pertaining to its use and analysis.

2. The synthetic data derived from this real-world data must also be reliable and complete for any further analysis to be considered valid.

3. Finally, since synthetic data generation might not be possible to carry out in-house,organizations might have to involve third parties for the actual modeling and generation process, which entails sharing real-world data with those third parties, another act that falls under the compliance umbrella.

While synthetic data falls under the umbrella of non-identifiable data, specific legal implications still remain with the generation, use, and sharing of this data. The most important points to keep in mind include –

● The creation and use of a clear data governance process for the entire synthetic data workflow

● Making sure that the data modeling and generation process falls in line with all relevant privacy regulations

● Studying the end product, the generated synthetic data, to ensure that it is non-identifiable and, therefore, complies with data protection regulations.

● Building a system to log and study changes and updates made to synthetic data over time to ensure round-the-clock compliance

● A clear and transparent channel of communication with all customers and stakeholders about the process of using and sharing synthetic data

● Follow every applicable legal and ethical standard while handling the original real-world data used for synthetic data generatio

Must Read: Removing PII from AI Training Data to Reduce Privacy Risks

Synthetic Data Privacy Best Practices

When it comes to data privacy best practices regarding synthetic data, the ideal scenario is for organizations to have a data privacy plan in place even before the generation process begins. Planning in advance helps remove regulatory bottlenecks,protects the organization from exposure and legal issues, and helps establish a clean, transparent image.

In order to generate, use, and share synthetic datasets that are truly privacy-preserving, there are a few factors that should be worked into the synthetic data plan from the outset.Here are some important considerations.

Before and During Generation

Removing personally identifiable information from your original dataset even before using it to train your synthetic data generation model can provide an extra layer of data privacy. Similarly, training the generation model in small batches of data with intermittent regularization will ensure that the model does not completely memorize data inputs. Another technique, known as differential privacy,involves introducing noise to the model during the training process to create a mathematical basis for privacy.

Avoid Reconstruction Attacks

While synthetic data needs to be very close in terms of statistical impact to the original, real-world data,there is a need to exercise caution as this kind of synthetic data might fall victim to reconstruction attacks. A synthetic dataset that preserves all statistics of the original data can leak privacy, as attackers can then use the dataset to reverse-engineer the personally identified information of the source data using reconstruction techniques. Therefore, generating synthetic data needs to implement a balance between including enough statistical similarities with the actual data while implementing enough differences to still be considered private.

Calculate Membership Inference Score

When a synthetic data generation model has been fully trained on its source dataset and requires no further training, it can start generating synthetic results on previously unanalyzed data without needing further access to the original data. This leaves the system vulnerable to a membership inference attack, through which attackers can start forming a picture of the original data by analyzing newly generated data and working backward. Models that generate a lot of outlier records are especially vulnerable to this attack. Evaluating the model in this regard and arriving at a membership inference score can help mitigate the issue. Models with higher scores are more protected from this type of attack.

Exact Match and Neighboring Match Tests

These are reasonably basic tests that should be used with all forms of synthetic data. First, an exact match test examines if any actual data records can be found within the generated synthetic data, with the ideal score being zero. Similarly,neighboring match tests try to establish the percentage of synthetic data points that are similar to the real data to such a high degree as to compromise privacy.

Zero Loopholes in Handling Real Data

As with any scenario that involves working with real data containing personally identifiable information,there should be no loopholes in the process of generating synthetic data that exposes the real data to attacks. Data handling should be carried out with all relevant compliance metrics in mind, and stringent data security measures should be implemented. All personal data used in the original dataset should be collected, stored, and processed in full compliance with all applicable privacy regulations, including the GDPR.

Final Thoughts

While understanding the currently applicable regulations and creating a finely-tuned process of generating, using, and sharing synthetic data is an excellent starting point,it is also important to remember that this is a young, dynamic, and rapidly growing field where new development can happen anytime. It is crucial for organizations to stay abreast of these developments and always find ways to improve and evolve their synthetic data systems to keep on top of data privacy regulations. Proactive mitigation of potential future risks can guarantee maximum utility from synthetic data for years to come.