For much of security history, one metric dominated: recall.
Recall means: of all the sensitive data that exists, how much did you catch?
If there are 100 pieces of PII in a document and your system finds 95, your recall is 95 percent.
This made sense in the old security world. If a firewall missed a real threat, the company had a serious problem. If it blocked something safe, someone could investigate and fix it. So the industry learned a simple habit: catch as much as possible, even if it creates extra noise.
Data loss prevention inherited the same mindset. Scan for Social Security numbers. Scan for credit cards. Scan for PII. If something looks sensitive, flag it, quarantine it, or block it.
In that world, a false positive was often a review problem. Someone checked the alert. Someone cleared the queue. The system could afford to be noisy.
Then enterprises started feeding sensitive data into AI models.
And the old scorecard stopped being enough.
The cost of a false positive changed
In traditional security, flagged data often stops somewhere. A person can review it before anything else happens.
In an AI workflow, the data usually keeps moving. You detect the PII, you mask it, and the masked text flows straight into the model. The model reasons over what you gave it. It answers questions, drafts responses, makes decisions, and may even call tools.
That means every masking mistake becomes an input mistake.
Mask too little and you leak sensitive data. Mask too much and you damage the prompt. There is no analyst sitting in the middle to fix it. The model just trusts what you hand it.
That is why precision becomes just as important as recall.
Precision means: of everything the system flagged as sensitive, how much was actually sensitive?
Here is a simple example.
A support assistant receives this ticket:
“Customer says the April invoice is wrong.”
April is a month. But it can also be a person’s name. A recall-heavy detector may mask it like this:
“Customer says the <PERSON> invoice is wrong.”
Now the model does not know which invoice the customer is asking about.
Nothing was protected. The task was broken.
This is why AI data protection cannot look only at recall. A good system needs recall to catch sensitive data, precision to avoid masking useful context, and F1 to show whether both are healthy at the same time.
F1 is the harmonic mean of precision and recall. That matters because it punishes imbalance. You cannot get a strong F1 score by pushing recall to the top and ignoring precision.
For example, a detector with 99 percent recall and 20 percent precision lands around 0.33 F1.
That is the math proving the problem.
A recall-only scorecard would make 99 percent look excellent. But in an AI workflow, 20 percent precision means too much useful context is being masked. The model is now reasoning over damaged input.
That is not a small false positive problem.
That is an AI quality problem.
But the usual F1 score does not catch the full problem
There is another issue most teams do not measure.
Many companies and even customers calculate F1 based on whether the sensitive span was detected. That is useful. It is much better than recall alone.
But for AI masking, it still does not catch the full problem.
Detecting sensitive data is only the first step. When you mask data for AI, you replace the original text with a label such as [PERSON], [LOCATION], [DATE], [ORG], or [MONEY].
The model reads that label. The label becomes the meaning.
So the question is not just: did we detect the sensitive data?
The better question is: did we detect it and label it correctly?
Because if the label is wrong, the model understands the sentence wrong.
A simple example
Take this sentence from a banking workflow:
“Paris reviewed the loan for Jordan and approved it from Madrid office.”
In this sentence, Paris is a person, Jordan is a person, and Madrid is a location.
All three are ambiguous. Paris, Jordan, and Madrid can all look like places.
A recall-only system may catch all three. So recall looks perfect.
But if it masks the sentence like this:
“<LOCATION 1> reviewed the loan for <LOCATION 2> and approved it from <LOCATION 3> office.”
The sensitive data is hidden, but the meaning is gone.
The model can no longer tell who reviewed the loan, who the loan was for, or where it was approved from. Ask it “who approved the loan?” and it has nothing useful. Three different roles collapsed into one label.
Now compare that with correct entity labeling:
“<PERSON 1> reviewed the loan for <PERSON 2> and approved it from <LOCATION 1> office.”
Same recall. Same protection. Very different outcome.
The model still understands that two people and one location are involved. It can answer questions. It can route the workflow. It can apply the right policy.
Same recall score.
Completely different AI result.
That gap is invisible if recall is the only thing you measure.
Entity type errors are not rare
People often assume entity confusion is an edge case. It is not.
Enterprise text is full of ambiguous tokens.
Morgan can be a person or a bank. Chase can be a person or a bank. Jordan can be a person or a country. Phoenix can be a person or a city. Will can be a person or just a verb.
Numbers are even harder. A nine-digit number can be an SSN, an account ID, or a routing number. Detecting that it is sensitive is only half the job. Typing it correctly is what determines the policy.
You do not treat an SSN the same way you treat an internal account ID. If the type is wrong, the wrong rule gets applied.
Why type accuracy is a privacy control
Entity type is not a cosmetic detail. It is load-bearing.
First, it controls meaning. A model treats <PERSON> and <LOCATION> differently. If you turn a person into a location, you change what the model believes about the input.
Second, it controls unmasking. Many AI workflows mask data before the model sees it and unmask approved data later. If the system labels data incorrectly, the wrong value can come back in the wrong place.
Third, it controls policy. A person, a medical condition, a location, and an account number may all have different access rules. If the entity type is wrong, the wrong policy may be applied.
That is not a small accuracy issue.
That is a privacy and compliance issue.
The metric we actually hold ourselves to
We still need recall. A missed entity can become a data leak.
We still need precision. A false positive can damage the prompt and reduce AI accuracy.
We still need F1. It gives a balanced view of recall and precision.
But for AI, we need one more layer: entity-level accuracy.
Entity-level accuracy asks: did we identify the sensitive data and label it as the right type?
A person labeled as a location should fail. A month labeled as a person should fail. An account number labeled as an SSN should fail.
Because in AI, the label is not just a label. It becomes part of the model’s understanding.
This is the stricter metric we hold ourselves to at Protecto.
We are not just recommending this as a better way to think about AI data protection. We actually run evaluations this way. We measure recall, precision, and F1, but we also evaluate whether sensitive entities are classified into the right type.
Because for AI, blocking the sensitive data is not enough.
The model needs the right protected meaning, not just a masked token.
The takeaway
The firewall era taught security teams to chase recall. That made sense when the main goal was to block threats and send alerts for review.
But AI is different.
In AI, masked data flows directly into a model. The model reasons over it. The model acts on it. The model does not know when the masking layer made a mistake.
So AI data protection needs a stricter scorecard.
It needs recall to prevent leaks. It needs precision to prevent over-masking. It needs F1 to balance both. And it needs entity-level accuracy to preserve meaning.
Because in AI, masking is not just blocking.
Masking is translation.
And a translation that turns a person into a place is wrong, no matter how high your recall is.