The Old Scorecard

For much of security history, one metric dominated: recall.

Recall means: of all the sensitive data that exists, how much did you catch? If there are 100 pieces of PII in a document and your system finds 95, your recall is 95 percent.

This made sense in the old security world.

If a firewall missed a real threat, the company had a serious problem.
If it blocked something safe, someone could investigate and fix it.

So the industry learned a simple habit: catch as much as possible, even if it creates extra noise.

Data loss prevention inherited the same mindset. Scan for Social Security numbers. Scan for credit cards. Scan for PII. If something looks sensitive, flag it, quarantine it, or block it.

In that world, a false positive was a review problem. Someone checked the alert. Someone cleared the queue. The system could afford to be noisy. Then enterprises started feeding sensitive data into AI models. And the old scorecard stopped being enough.

The Cost of a False Positive Has Changed

In traditional security, flagged data often stops somewhere. A person can review it before anything else happens.

In an AI workflow, the data keeps moving.

You detect the PII, you mask it, and the masked text flows straight into the model. The model reasons over what you gave it. It answers questions, drafts responses, makes decisions, and may even call tools.

That means every masking mistake becomes an input mistake.

Mask too little → you leak sensitive data.
Mask too much → you damage the prompt.

There is no analyst sitting in the middle to fix it. The model just trusts what you hand it.

That is why precision becomes just as important as recall.

Precision means: of everything the system flagged as sensitive, how much was actually sensitive?

A Simple Example

A support assistant receives this ticket:

“Customer says the April invoice is wrong.”

April is a month. But it can also be a person’s name. A recall-heavy detector may mask it like this:

“Customer says the <PERSON> invoice is wrong.”

Now the model does not know which invoice the customer is asking about.

Nothing was protected. The task was broken.

Why F1 Matters (But Is Still Not Enough)

A good AI data protection system needs three things at once:

Recall- to catch sensitive data.
Precision- to avoid masking useful context.
F1- to show whether both are healthy at the same time.

F1 is the harmonic mean of precision and recall. That matters because it punishes imbalance. You cannot game F1 by pushing recall to 99 and ignoring precision.

For example, a detector with 99 percent recall and 20 percent precision lands around 0.33 F1.

That is the math proving the problem. A recall-only scorecard would make 99 percent look excellent. But in an AI workflow, 20 percent precision means too much useful context is being masked. The model is now reasoning over damaged input.

That is not a small false positive problem. That is an AI quality problem.

But Standard F1 Still Misses Something

Here is the part most teams do not measure.

Many companies, and even customers, calculate F1 based on whether the sensitive span was detected. That is useful. It is much better than recall alone.

But for AI masking, it still does not catch the full problem.

Detecting sensitive data is only the first step. When you mask data for AI, you replace the original text with a label such as [PERSON], [LOCATION], [DATE], [ORG], or [MONEY].

The model reads that label. The label becomes the meaning.

So the question is not just:

“Did we detect the sensitive data?”

The better question is:

“Did we detect it and label it correctly?”

Because if the label is wrong, the model understands the sentence wrong.

A Banking Example

Take this sentence from a banking workflow:

“Paris reviewed the loan for Jordan and approved it from Madrid office.”

In this sentence:

Paris is a person.
Jordan is a person.
Madrid is a location.

All three are ambiguous. Paris, Jordan, and Madrid can all look like places.

A recall-only system may catch all three. Recall looks perfect. But watch what happens if the labels are wrong:

Wrong Labels

“<LOCATION 1> reviewed the loan for <LOCATION 2> and approved it from <LOCATION 3> office.”

The sensitive data is hidden, but the meaning is gone.

The model can no longer tell who reviewed the loan, who the loan was for, or where it was approved from. Ask it “who approved the loan?” and it has nothing useful. Three different roles collapsed into one label.

Correct Labels

“<PERSON 1> reviewed the loan for <PERSON 2> and approved it from <LOCATION 1> office.”

Same recall. Same protection. Very different outcome.

The model still understands that two people and one location are involved. It can answer questions. It can route the workflow. It can apply the right policy.

Same recall score. Completely different AI result.

That gap is invisible if recall is the only thing you measure.

Entity Type Errors Are Not Rare

People often assume entity confusion is an edge case. It is not.

Enterprise text is full of ambiguous tokens:

Morgan: a person or a bank.
Chase: a person or a bank.
Jordan: a person or a country.
Phoenix: a person or a city.
Will: a person or just a verb.

Numbers are even harder. A nine-digit number can be an SSN, an account ID, or a routing number. Detecting that it is sensitive is only half the job. Typing it correctly is what determines the policy. You do not treat an SSN the same way you treat an internal account ID. If the type is wrong, the wrong rule gets applied.

Why Type Accuracy Is a Privacy Control

Entity type is not a cosmetic detail. It is load-bearing in three different ways.

1. It Controls Meaning

A model treats <PERSON> and <LOCATION> differently. If you turn a person into a location, you change what the model believes about the input.

2. It Controls Unmasking

Many AI workflows mask data before the model sees it and unmask approved data later. If the system labels data incorrectly, the wrong value can come back in the wrong place.

3. It Controls Policy

A person, a medical condition, a location, and an account number may all have different access rules. If the entity type is wrong, the wrong policy may be applied.

That is not a small accuracy issue. That is a privacy and compliance issue.

The Metric We Actually Hold Ourselves To

For AI workflows, the scorecard needs four layers, not one:

Metric	What It Asks	Why It Matters
Recall	Did we catch the sensitive data?	A miss becomes a data leak.
Precision	Were our detections actually sensitive?	False positives damage the prompt.
F1	Are recall and precision both healthy?	Balances the trade-off.
Entity-level accuracy	Did we label it as the right type?	Preserves meaning for the model.

Entity-level accuracy is the stricter metric most teams skip.

A person labeled as a location should fail.
A month labeled as a person should fail.
An account number labeled as an SSN should fail.

Because in AI, the label is not just a label. It becomes part of the model’s understanding.

This is the stricter metric we hold ourselves to at Protecto. We are not just recommending this as a better way to think about AI data protection. We run our evaluations this way. We measure recall, precision, and F1, and we evaluate whether sensitive entities are classified into the right type.

Because for AI, blocking the sensitive data is not enough. The model needs the right protected meaning, not just a masked token.

The Takeaway

The firewall era taught security teams to chase recall. That made sense when the main goal was to block threats and send alerts for review.

AI is different.

In AI, masked data flows directly into a model. The model reasons over it. The model acts on it. The model does not know when the masking layer made a mistake.

So AI data protection needs a stricter scorecard:

Recall to prevent leaks.
Precision to prevent over-masking.
F1 to balance both.
Entity-level accuracy to preserve meaning.

Because in AI, masking is not just blocking. Masking is translation.

And a translation that turns a person into a place is wrong, no matter how high your recall is.

Amar Kanagaraj

Founder and CEO of Protecto

Amar Kanagaraj is the Founder and CEO of Protecto, a company focused on securing enterprise data for LLMs, AI agents, and agentic workflows. He is a second-time entrepreneur with 20+ years of experience across engineering, product, AI, go-to-market, and business leadership. Before Protecto, Amar co-founded FileCloud and helped scale it to over $10M ARR as CMO. Earlier in his career, he worked at Sun Microsystems, Booz & Company, and Microsoft Search & AI. He holds an MBA from Carnegie Mellon University and an MS in Computer Science from Louisiana State University.

‘Recall’ Was Enough for Firewalls. AI Needs a Stricter Scorecard

Table of Contents