Until 2023, India’s data privacy landscape was largely unregulated – businesses didn’t have to worry about how they process and store data. Sensitive customer data like Personally Identifiable Information (PII) could travel around the world in 80 days and land back to its source without violating a single regulation.
While the unregulated digital space was a boon for data dependent businesses, it was a bane for customer privacy. Customers had little or no control over their sensitive data – if the business failed to enforce strong security protocols, data was at risk of being accessed by malicious actors.
Law and order in India’s data privacy landscape
These unsafe data practices are changing with India’s Digital Personal Data Protection (DPDP) Act knocking on their doors with hefty penalties for violation. The regulation, which came into effect in June 2024, is the most comprehensive data privacy law that governs how personal data is collected, processed, and stored.
One of key provisions of DPDP is around data residency; the physical location of data storage. As per the law, if sensitive data of Indian customers resides somewhere outside India’s borders, the government has to be notified.
Post its enactment, businesses such as market research/intelligence and customer behavior analytics platforms using generative intelligent models like GPT 4.1 to process data are finding themselves in a pickle.
GPT 4.1 and DPDP’s data residency bill
Open AI’s latest model, GPT 4.1 has gained the attention of market research and intelligence agencies with its long context comprehension capabilities. Its ability to contextualize and process large documents makes it suitable for extensive data analysis.
All that’s great, but there’s a catch – GPT 4.1 is using an AWS hosted in the U.S- which means data is residing outside India. This is troubling news for agencies using this model to process data of Indian customers.
On one hand, using an inferior model can compromise the quality of output. On the other hand, violating DPDP is an inevitable path to penalties, loss of customer trust, and irreparable damage to brand reputation.
Navigating DPDP’s data residency law – without breaking it
If your business processes data of Indian customers using GPT 4.1, it will come under the radar of the Data Protection Board of India (DPBI).
So, how can you avoid this sticky situation without violating DPDP and while ensuring data privacy?
One solution that has worked for startups and enterprises alike is by embedding data privacy and security in GenAI applications and LLM models. A growing number of AI privacy tools have flooded the market, driven by heightened regulatory scrutiny and concerns over an increasing number of privacy breaches.
These privacy tools work in either of two ways:
Method 1: Block and tackle
The first method is identifying and blocking or masking sensitive data like Personally Identifiable Information (PII) and Protected Health Information (PHI) and then completely blocking it. This method prevents sensitive data exposure so that it does not leak into the permanent, non-erasable LLM memory.
Blocking sensitive data before it enters the LLM system completely eliminates the risk of being misused or sold by malicious actors, even if they manage to hack it. Such privacy tools are helping companies stay compliant and avoid AI attacks.
However, fully blocking data or masking it has a number of pitfalls that take a hit on productivity. Here are some of the trade offs:
- Loss of context: When you mask, anonymize, or block data, it strips away the semantic meaning. This impacts the performance capability of the language model – resulting in hallucination or reduced accuracy.
- False positives and negatives: AI privacy tools work using techniques like pattern matching to identify PII with similar context. The downside to these methods is that 100% identification accuracy is nearly impossible every time, especially for unstructured inputs. Incorrect identification results in false positives (identifying non sensitive data as sensitive) and negatives (not identifying sensitive data).
Method 2: Identify and enable
The second method is similar to the 1st one but approaches privacy differently. In this method, you identify sensitive data, mask it using secure tokens, and maintain the overall data context and format. Replacing sensitive data with tokens helps maintain data confidentiality while enabling AI to generate more accurate outputs using relevant data.
Using tokens instead of masking or anonymizing to mask data allows AI models to generate accurate output without accessing the original dataset. This process helps to maintain the usability of sensitive data across various enterprise use cases without compromising its confidentiality.
For example, AI data guardrails like Protecto use a powerful sensitive data identification engine and smart tokenization techniques to help Indian banks adopt GPT and other LLMs. Compared to regular tokenization which compromises accuracy, smart tokenization works by adding context to the masked data.
Lets understand the differences using a simple example.
Without Protecto | With Protecto | |
How AI/LLM sees data | Without data guardrails, AI/LLM see PII data like this:
“John Michael Doe”s credit card number is 4567-8912-3456-7890.” |
With data guardrails, AI/LLM see data like this:
“<PER>hsbd eidhf sjfbr</PER>’s credit card number is <CRD>4984020658256988804</CRD>.” |
How does this method work? | Here, raw data is fed directly into the Open AI models.
Once added, it permanently stays in the database and cannot be erased or modified. |
In this case, raw data is changed to a random value; the name and card details are completely different.
This does not compromise output accuracy since smart tokenization gives AI sufficient context to correctly process it. For example, the value in the <PER> is a person’s name and the value in the <CRD> is a credit card number. |
PII leakage issues | The above data contains two PIIs; name and credit card number.
If malicious actors break into LLM systems, they can identify the data owner by combining these two pieces of information. Moreover, submitting PII to LLMs without privacy safeguards creates a risk of it being stored or used for future model training. |
The data above does not contain any PII.
Even if malicious actors gain access to the LLM database, it is impossible to identify the real individual/ data owner associated with this information. |
What does this mean for businesses? | Feeding data to LLMs in this format breaches privacy, puts security at risk, and violates regulations like DPDP and ISO 42001.
Businesses risk:
|
Feeding data using smart tokenized format does not breach privacy, risk security, or violate AI regulations.
Businesses ensure:
|
Using smart tokens allows AI to generate the same quality of output as it would without privacy tools. Banks can use AI to its fullest potential while ensuring that PII does not leave the country – staying fully compliant with DPDP’s data residency bill.
Protecto is a data security and privacy tool for Gen AI/LLM based use cases. When users enter a prompt, Protecto’s uses APIs to identify and mask sensitive information in these prompts and then send it to OpenAI. The masked data fed to AI systems facilitates sufficient training for AI and RAG models.
Once GenAI processes the masked data, Protecto re-identifies/unmasks the data once again and shares the results with the user. This not only safeguards your data from leaving your country or region but also makes sure your sensitive data is never reaching any public LLMs.
In addition, Protecto’s sensitive data identification capabilities are the highest in industry. Users can significantly reduce the risk of false positives and negatives, thereby ensuring higher protection of sensitive data.
We help you comply without compromising quality
DPDP is the biggest standalone legal framework governing data. Businesses handling PII/PHI are suddenly required to change their existing data security practices. Since compliance is not optional, they need to rethink their approach to processing customers’ data.
DPDP, however, is not a simple regulation. Given that Indian businesses are compelled to oblige with a data regulation for the first time, challenges and uncertainties abound. Lack of clarity over what constitutes sensitive data, adopting privacy by design principles, and understanding compliance gaps requires expertise and significant system redesign.
Data guardrail tools like Protecto leverage intelligent tokenization to maintain the machine readability of sanitized data, create an extra layer of security, and facilitate harnessing data driven insights while using AI to its full capability. This way, you can secure sensitive data and comply with AI regulations without compromising the quality of output.
Start a free trial to see how Protecto can help your business across custom use cases or just schedule a demo today.