For the first two years of Protecto, we did what most teams in this space do. We built our own Named Entity Recognition models.
It made sense at the time. NER models had been the industry standard for identifying sensitive data. With enough training data, tuning, and handwritten rules, you could get solid performance on structured PII: names, emails, phone numbers, Social Security numbers, and account identifiers.
But over time, we ran into a wall.
Not because NER was wrong. Because the problem had changed.
NER was the right solution for the previous generation of data problems. We used it, learned from it, and are grateful for what it taught us about the problem space.
But enterprise AI has changed the shape of the problem. Data is messier. Context matters more. Sensitivity depends on who is asking, why they are asking, and what they plan to do with the answer.
The Data Changed First
Enterprise data today does not look like the data NER was designed for.
It is not clean. It is not labeled. It is not even consistently written.
Here is a synthetic version of a snippet from a customer support system we were asked to protect:
“John D. called about his claim #4456. He mentioned his SSN ends in 7821 and wants a callback at 415-332-1298.”
Read that as a human, and you immediately see the sensitive data: a name, a claim number, a partial SSN, and a phone number spread across a casually written note.
Our NER model? It caught the phone number and the name. It missed the partial SSN and the claim number. It had no idea this was an insurance context where the combination of these fragments is a compliance risk.
This was not an outlier. Enterprise data is messy in ways that are hard to appreciate until you are knee-deep in it:
- Typos and shorthand that would make a spell-checker cry (“govt approvel pendng for pt xfer”)
- Mixed languages in the same prompt (“El paciente Juan Reyes tiene cita el martes. Please update his insurance ID and confirm the copay amount before the visit.”)
- Context scattered across multiple lines where no single sentence contains enough information to classify on its own
- Implied meaning where the sensitivity is not in what is said, but in what it means
We kept improving it. More training data. More rules. More entity types. Every time we improved one case, another emerged. Each iteration is expensive.
That was the first signal that something fundamental needed to change.
Then the Definition of “Sensitive” Changed
Here is what really broke the model.
It was no longer just about PII.
Over the past 18 months, the requests from enterprise customers shifted dramatically. They don’t just ask us to “find all the Social Security numbers.” They started asking things like:
- “Flag anything that reveals our pricing strategy to external partners.”
- “Detect internal risk signals in customer escalation threads.”
- “Prevent confidential instructions from leaking into third-party AI agents.”
- “Mask data differently depending on whether the viewer is a contractor or a full-time employee.”
This is not a list of entities. This is a question of context and intent.
Let me give you a concrete example that changed how I think about this problem.
Consider this sentence:
“Let’s align the offer with Tier 1 partners at the 22% margin we discussed.”
There is zero PII here. No names, no emails, no phone numbers. A traditional NER system gives this a clean bill of health.
But if this sentence appears in an email thread that is about to be shared with one of those Tier 1 partners? It just leaked your pricing strategy.
Now consider this sentence in an internal strategy document shared only with the VP of Sales. Same sentence. Completely fine.
The sensitivity is not in the tokens. It is in who is reading them, why, and where.
NER models classify tokens. They do not reason about intent, audience, or policy. Asking NER to solve this problem is like asking a spell-checker to evaluate the quality of your argument. It is the wrong tool for the job.
That was the second signal.
Large Language Models Changed the Equation
Around the same time, large language models crossed a threshold that mattered for our use case.
They were not perfect. But they could do something fundamentally different from NER: they could understand.
Go back to that messy customer support note:
“John D. called about his claim #4456. He mentioned his SSN ends in 7821 and wants a callback at 415-332-1298.”
When we fed this through an LLM-based classifier, it correctly inferred:
- A person is present (“John D.”)
- The domain context is insurance (claim reference)
- There are sensitive identifiers: a partial SSN (“ends in 7821”) and a claim number
- There is contact information: a phone number
It did this without custom training data for this format. It understood that a partial SSN combined with a name and claim number is a higher risk than any one of those elements alone. It understood because understanding language is what these models do.
Here is another example that made the case internally at Protecto.
A financial services client sent us this Slack message from their operations channel:
“hey @team just fyi the Thompson account is flagged again, same issue as Q3. Lisa is handling but we might need to loop in compliance if the exposure is above the 2M threshold”
There are three distinct pieces of sensitive information in this message:
- Personal information (“the Thompson account,” “Lisa is handling”)
- Regulatory signal (“flagged again,” “compliance,” “exposure”)
- Financial details (“2M threshold”)
Our old NER model caught “Thompson” as a name and “Lisa” as a name. It caught two obvious names, but missed the actual risk signal. It missed the risk signal entirely because those are not “entities” in the NER sense. They are contextual signals that only become sensitive when you understand the full picture.
The LLM caught all three.
We realized we had a choice. Keep investing engineering effort into building and maintaining our own NER models. Or leverage the reasoning capabilities of large language models, deployed within the customer’s own infrastructure, for what they are uniquely good at: understanding context.
We chose the second path.
But Language Models Alone Are Not Enough
I want to be honest about this, because the “just use an LLM” crowd glosses over real problems.
Moving to language models did not solve the problem. It changed the problem.
These models are powerful, but they are not deterministic:
- Run the same prompt twice, and you might get different classifications
- They have no concept of your enterprise’s specific policies
- They can be confidently wrong, which is risky in compliance workflows
- Their internal safety controls over-flag sarcasm and under-flag subtle financial signals
- Picking the right model requires significant experimentation to balance performance, speed, and ability to handle variability
- You need proper evals to track drift of the agentic classification over time
Enterprises do not need “usually right.” They need predictable, explainable, policy-driven behavior. A CISO does not want to hear “the model sometimes catches sensitive data.” That is not how compliance works.
So the real engineering challenge became: How do you take a probabilistic reasoning engine and make it behave like a reliable enterprise component?
From Detection to Agentic Data Classification
This is where our architecture fundamentally evolved.
We stopped thinking of the problem as “PII detection.” A single detection pass was not enough for the kind of sensitive data enterprises needed to identify and control.
We started building a system that can reason through messy input, understand sensitivity in context, apply enterprise policy, validate the result, and produce outputs that downstream systems can reliably use.
We call this agentic data classification.
By agentic, we mean the classification process is no longer a single model prediction. It is a structured workflow. The system examines the input, reasons about the domain and context, identifies sensitive signals, applies policy, checks the output for consistency, and then decides what action should be taken.
The language model is the reasoning engine. The system around it is what makes it predictable, controllable, and usable in production. It can also be deployed within the customer’s own infrastructure, so sensitive data does not need to leave their environment.
Where We Are Now
Today, we still use models to identify sensitive data. But we do not rely on any single model, and we certainly do not rely on NER alone.
We use language models for understanding, wrapped in a structured system that ensures consistency, control, and compliance. No customer data leaves their infrastructure. The engineering effort that used to go into training and maintaining custom NER models now goes into:
Old Focus -> New Focus
- Training custom NER models → Building agentic orchestration around language models
- Writing entity-specific rules → Building context-aware classification logic
- Fixing edge cases one by one → Making outputs predictable and auditable
- Supporting one data type at a time → Scaling across enterprise workflows
- Focusing on entity lists → Enforcing policy and access control
The result is not just better detection. It is a system that enterprises can actually deploy in production, with the confidence that it will behave consistently and the transparency to prove it.
Final Thought
When the problem moves beyond detection, you need understanding, policy enforcement, and a system that ties it all together.
That is why we moved to agentic data classification. And that is where we believe the industry is headed.