When Cosine Similarity Works Great, and When It Does Not

Written by
Amar Kanagaraj
Founder and CEO of Protecto
Cosine similarity

Table of Contents

Share Article

In my last post, I explained the math behind cosine similarity.

Cosine similarity is a powerful search technique. When you are dealing with thousands or millions of chunks, it provides a fast, scalable way to find content conceptually similar to the user’s question.

That is a major breakthrough. Without vector search, modern RAG would be much harder to build.

But the mistake is pushing every retrieval problem into vector search.

That is where practical retrieval starts breaking down.

Imagine a large document that discusses roadmap planning, quarterly goals, customer retention, infrastructure costs, launch timelines, and vendor negotiations, with only one mention of “Project Falcon.”

Now someone searches:

“Project Falcon”

The chunk may not show up.

Not because “Project Falcon” is absent.

Because the whole chunk is compressed into a single vector. Roadmap, retention, infrastructure, launch planning, and vendor discussions become the main signal. “Project Falcon” becomes a tiny part of the average meaning.

This is a form of exact-match failure. The entity is present in the text, but the retrieval system still misses it because the entity’s signal gets diluted inside the larger chunk. I call this entity dilution.

And it is one of the most common reasons enterprise RAG systems fail.

Why Exact-Match Failures Happen

Cosine similarity is useful, but it is still geometry. It measures how close two vectors are in a feature space.

That works well when the goal is conceptual similarity.

For example, if a user searches for “customer churn,” a document that says “users are leaving the platform” may still be a good match. The words are different, but the idea is similar. Vector search is very good at this.

But in enterprise use cases, the question often goes further.

They do not only ask:

“What related information might help answer this?”

They also ask:

“What exact thing is being referenced?”

That is a different problem.

If the user asks for a contract ID, username, customer name, ticket number, legal clause, project codename, or policy violation tied to a named person, conceptual similarity is not enough.

These are precision problems.

Embeddings are not optimized for exact symbolic fidelity. They compress information. During compression, specific entities can get diluted inside a broader context.

That is why a chunk containing “Project Falcon” can still fail to appear when the user searches for “Project Falcon.”

The data is there. The vector representation just made the signal too weak.

Where Cosine Similarity Works Well

Cosine similarity works extremely well when the user wants related ideas.

Questions like these are good fits:

“Find similar support tickets.”

“Find documents about pricing strategy.”

“Find conversations similar to this customer complaint.”

“Find related incidents.”

In these cases, the system does not need exactness. It needs thematic closeness. Vector search helps find content even when the wording changes.

That is why semantic retrieval became so powerful. It handles paraphrasing, synonyms, tone variation, and natural language ambiguity far better than traditional keyword search.

But enterprise retrieval is not only semantic retrieval.

That is where many architectures go wrong.

Where Cosine Similarity Breaks Down

Enterprise systems often need deterministic retrieval.

For example:

  • Find every document that mentions a specific customer.
  • Retrieve a contract by contract ID.
  • Find references to an internal project codename.
  • Locate exact legal clauses.
  • Query policy violations tied to a named employee.
  • Search for a ticket number or case ID.

These are not “similarity” questions. They are exactness questions.

A vector model may understand that “John Smith” and “employee onboarding” appear in related contexts. But when the user asks, “Show me documents for John Smith specifically,” the system has to retrieve John Smith documents, not generally related onboarding content.

At enterprise scale, exact references matter.

Semantic similarity alone cannot guarantee deterministic recall.

Why Hybrid Retrieval Exists

This is why production-grade retrieval systems combine multiple techniques.

Not because vector search is bad.

Because vector search answers only one question:

“What is close?”

Enterprise retrieval has to answer more:

  • “What exactly matches?”
  • “What belongs to the right scope?”
  • “What specific entity is being referenced?”
  • “How are these things connected?”
  • “What is this user allowed to see?”

These are different retrieval dimensions. Treating all of them as vector similarity creates brittle systems.

A better retrieval system is a pipeline.

i) Vector search finds conceptually related content. If the user asks, “Which customers are at risk of churn?”, the relevant document may say, “Usage has dropped, renewal confidence is low, and the account has not adopted key features.” The wording is different, but the concept is related. Vector search helps find that.

ii) Keyword search finds exact matches. If the user asks for “Project Falcon,” the system should find documents that mention Project Falcon exactly. A vector search may return general roadmap or product planning content, but the exact codename matters.

iii) Entity extraction preserves important objects. Project names, customer names, employee names, ticket IDs, contract IDs, product names, legal clauses, and regulatory terms should not be lost in a large vector. They need to be preserved as first-class retrieval signals.

iv) Metadata filtering narrows the scope. If the user asks for “Q4 renewal risks for healthcare customers in Europe,” semantic similarity is not enough. The system also needs customer segment, region, time period, and status.

v) Relationship mapping helps when the answer is spread across documents.

For example:

Project Delta is an on-premises product initiative.

  • Project Delta uses Component X.
  • Component X is part of Platform Y.
  • Platform Y has a security limitation.

Now the user asks:

“What Platform Y risks affect Project Delta?”

There may be no single document that directly says, “Platform Y risks affect Project Delta.”

One document describes Project Delta. Another describes Component X. Another explains that Component X depends on Platform Y. Another describes the security limitation in Platform Y.

Pure vector search may miss this because no single chunk is directly similar to the full question.

A relationship layer or knowledge graph can connect the facts: Project Delta -> uses Component X. Component X -> is part of Platform Y. Platform Y -> has a security limitation.

Now the system can retrieve the right context even when the answer spans multiple documents.

That is more than similarity search.

Why Re-ranking Matters

Initial retrieval is usually about recall. It brings back a broad set of possible matches.

But the top vector match is not always the best evidence.

To understand why, it helps to see how vector search and re-ranking are different.

Vector search is built for scale. The system precomputes a vector for every chunk and stores it in an index. At query time, it creates a vector for the user’s question and quickly compares it against the stored vectors using cosine similarity.

That is fast because the document vectors are already computed.

But it also means the query and the chunk were encoded separately. The system is comparing two compressed representations.

A re-ranker works differently. A common approach is a cross-encoder. Instead of comparing two precomputed vectors, it looks at the query and the retrieved chunk together. It can examine the tokens in both and score whether the chunk actually helps answer the question.

In simple terms, a re-ranker asks:

“Does this retrieved chunk actually help answer this question?”

That is a different question from cosine similarity.

Cosine similarity asks:

“Is this chunk close to the query in vector space?”

Those two questions can produce very different rankings.

So why not use a re-ranker directly instead of vector search?

Because re-ranking is done at runtime. It is more expensive than cosine-similarity search. A cross-encoder has to read the query and each chunk together at runtime. That is not practical across millions of chunks.

But it is practical after the initial retrieval has already narrowed the search space.

So the flow is:

  1. Vector search, keyword search, and metadata filters retrieve a smaller candidate set, say 50 chunks.
  2. The re-ranker examines the query and each candidate chunk together.
  3. It reorders the chunks based on usefulness, not just similarity.

That is why re-ranking matters.

Vector search finds possible matches.

Re-ranking finds better evidence.

Suppose a user asks:

“What is the launch risk for Project Delta?”

Vector search may return:

  • A chunk that repeats “Project Delta launch” several times.
  • A meeting note saying the security review is blocked.
  • A support ticket showing Component X is delayed.
  • A roadmap note saying the launch is planned for June.
  • A customer email saying the on-prem requirement is mandatory.

The first chunk may have the highest cosine score because the words match closely. But the best answer may require the meeting note, the support ticket, and the customer email.

The best answer is not always the most similar text.

It is the most useful evidence.

Then the LLM can answer using the strongest retrieved context, rather than relying on chunks that merely looked closest in vector space.

The Takeaway

Cosine similarity should be treated as a strong signal in the retrieval pipeline, not as the entire retrieval system.

Enterprise retrieval is not only about finding what is close. It is also about finding what is exact, scoped, entity-aware, connected, useful, allowed, and answerable.

The best systems use vector search where it shines, and combine it with keyword search, metadata, entity awareness, relationship mapping, re-ranking, and policy filtering, where precision and control matter.

One critical layer we have not gone deep into yet is the policy layer.

In enterprise RAG, retrieval cannot be separated from access control. The system should not only ask, “What is the most relevant content?” It must also ask, “Is this user allowed to use this content for this task?”

That is the topic for the next blog.

In Part 3, I will go deeper into why the policy layer must be part of the retrieval step, not bolted on after the answer is generated, and why GPTGuard MCP, our retrieval layer for enterprise RAG, includes a built-in policy layer alongside retrieval.

Amar Kanagaraj
Founder and CEO of Protecto
Amar Kanagaraj is the Founder and CEO of Protecto, a company focused on securing enterprise data for LLMs, AI agents, and agentic workflows. He is a second-time entrepreneur with 20+ years of experience across engineering, product, AI, go-to-market, and business leadership. Before Protecto, Amar co-founded FileCloud and helped scale it to over $10M ARR as CMO. Earlier in his career, he worked at Sun Microsystems, Booz & Company, and Microsoft Search & AI. He holds an MBA from Carnegie Mellon University and an MS in Computer Science from Louisiana State University.

Related Articles

Cosine Similarity Is Math, Not Magic

HIPAA vs. GDPR Compliance: What Is the Difference and Why Does It Matter?

Learn the real difference between HIPAA vs. GDPR Compliance and why AI-driven businesses must rethink data privacy today....

OpenAI HIPAA BAA: What It Actually Covers (And What Leaves PHI Exposed)