Recap: Where We Left Off
In the last post, I explained the math behind cosine similarity, why it works, and why it became the foundation of modern semantic search.
Cosine similarity is a powerful retrieval technique. When you are dealing with thousands or millions of chunks, it provides a fast, scalable way to find content conceptually similar to a user’s question. That is a major breakthrough. Without vector search, modern RAG would be much harder to build.
But here is the mistake most teams make: pushing every retrieval problem into vector search.
That is where practical retrieval starts breaking down. In this post, I will walk through where cosine similarity quietly fails in enterprise systems, why it fails, and what a production-grade retrieval pipeline actually looks like.
The Problem: Entity Dilution
Imagine a large document that discusses:
- Roadmap planning
- Quarterly goals
- Customer retention
- Infrastructure costs
- Launch timelines
- Vendor negotiations
with only one mention of “Project Falcon.”
Now someone searches:
“Project Falcon”
The chunk may not show up.
Not because “Project Falcon” is absent. It is right there in the text.
It fails because the entire chunk is compressed into a single vector. Roadmap, retention, infrastructure, launch planning, and vendor discussions become the dominant signal. “Project Falcon” becomes a tiny part of the average meaning.
This is a form of exact-match failure. The entity is present in the text, but the retrieval system still misses it because the entity’s signal gets diluted inside the larger chunk.
I call this entity dilution, and it is one of the most common reasons enterprise RAG systems fail in production.
Why Exact-Match Failures Happen
Cosine similarity is useful, but it is still geometry. It measures how close two vectors are in a feature space.
That works well when the goal is conceptual similarity.
For example, if a user searches for “customer churn,” a document that says “users are leaving the platform” may still surface as a strong match. The words are different, but the idea is similar. Vector search is excellent at this.
But in enterprise use cases, the question often goes further. Users do not only ask:
“What related information might help answer this?”
They also ask:
“What exact thing is being referenced?”
That is a different problem.
If the user asks for a contract ID, username, customer name, ticket number, legal clause, project codename, or policy violation tied to a named person, conceptual similarity is not enough. These are precision problems.
Embeddings are not optimized for exact symbolic fidelity. They compress information, and during compression, specific entities can get diluted inside a broader context.
That is why a chunk containing “Project Falcon” can still fail to appear when the user searches for “Project Falcon.” The data is there. The vector representation just made the signal too weak.
Where Cosine Similarity Works Well
Before going further, let me be clear: cosine similarity is not the problem. It is excellent when used for the right job.
It works extremely well when the user wants related ideas. Good fits include:
- “Find similar support tickets.”
- “Find documents about pricing strategy.”
- “Find conversations similar to this customer complaint.”
- “Find related incidents.”
In these cases, the system does not need exactness. It needs thematic closeness. Vector search handles paraphrasing, synonyms, tone variation, and natural language ambiguity far better than traditional keyword search ever could.
That is why semantic retrieval became so powerful. The mistake is assuming this is all enterprise retrieval needs.
Where Cosine Similarity Breaks Down
Enterprise systems often need deterministic retrieval.
For example:
- Find every document that mentions a specific customer.
- Retrieve a contract by contract ID.
- Find references to an internal project codename.
- Locate exact legal clauses.
- Query policy violations tied to a named employee.
- Search for a ticket number or case ID.
These are not “similarity” questions. They are exactness questions.
A vector model may understand that “John Smith” and “employee onboarding” appear in related contexts. But when the user asks, “Show me documents for John Smith specifically,” the system has to retrieve John Smith’s documents, not generally related onboarding content.
At enterprise scale, exact references matter. Semantic similarity alone cannot guarantee deterministic recall.
Why Hybrid Retrieval Exists
This is why production-grade retrieval systems combine multiple techniques. Not because vector search is bad, but because vector search answers only one question:
“What is close?”
Enterprise retrieval has to answer more:
- “What exactly matches?”
- “What belongs to the right scope?”
- “What specific entity is being referenced?”
- “How are these things connected?”
- “What is this user allowed to see?”
These are different retrieval dimensions. Treating all of them as vector similarity creates brittle systems.
A better retrieval system is a pipeline. Here are the five layers that make it work.
1. Vector Search- Finds Conceptually Related Content
If the user asks, “Which customers are at risk of churn?”, the relevant document may say, “Usage has dropped, renewal confidence is low, and the account has not adopted key features.”
The wording is different, but the concept is related. Vector search shines here.
2. Keyword Search- Finds Exact Matches
If the user asks for “Project Falcon,” the system should find documents that mention Project Falcon exactly. A vector search may return general roadmap or product planning content, but the exact codename matters.
3. Entity Extraction- Preserves Important Objects
Project names, customer names, employee names, ticket IDs, contract IDs, product names, legal clauses, and regulatory terms should not be lost inside a large vector. They need to be preserved as first-class retrieval signals.
4. Metadata Filtering- Narrows the Scope
If the user asks for “Q4 renewal risks for healthcare customers in Europe,” semantic similarity is not enough. The system also needs customer segment, region, time period, and status.
5. Relationship Mapping- Connects Facts Across Documents
This one is often missed. Consider this scenario:
- Project Delta is an on-premises product initiative.
- Project Delta uses Component X.
- Component X is part of Platform Y.
- Platform Y has a security limitation.
Now the user asks:
“What Platform Y risks affect Project Delta?”
There may be no single document that directly says, “Platform Y risks affect Project Delta.” One document describes Project Delta. Another describes Component X. Another explains the Component X to Platform Y dependency. Another describes the security limitation in Platform Y.
Pure vector search may miss this because no single chunk is directly similar to the full question.
A relationship layer or knowledge graph can connect the facts:
Project Delta → uses Component X → is part of Platform Y → has a security limitation.
Now the system can retrieve the right context even when the answer spans multiple documents. That is more than similarity search.
Why Re-Ranking Matters
Initial retrieval is usually about recall. It brings back a broad set of possible matches. But the top vector match is not always the best evidence.
To understand why, it helps to see how vector search and re-ranking are different.
How Vector Search Works
Vector search is built for scale. The system precomputes a vector for every chunk and stores it in an index. At query time, it creates a vector for the user’s question and quickly compares it against the stored vectors using cosine similarity.
That is fast because the document vectors are already computed. But it also means the query and the chunk were encoded separately. The system is comparing two compressed representations.
How Re-Ranking Works
A re-ranker works differently. A common approach is a cross-encoder. Instead of comparing two precomputed vectors, it looks at the query and the retrieved chunk together. It examines the tokens in both and scores whether the chunk actually helps answer the question.
In simple terms, a re-ranker asks:
“Does this retrieved chunk actually help answer this question?”
That is a different question from cosine similarity, which asks:
“Is this chunk close to the query in vector space?”
Those two questions can produce very different rankings.
Why Not Use Re-Ranking Directly?
Because re-ranking happens at runtime, it is much more expensive than cosine-similarity search. A cross-encoder has to read the query and each chunk together at runtime. That is not practical across millions of chunks.
But it is practical after the initial retrieval has already narrowed the search space.
So the flow becomes:
- Vector search, keyword search, and metadata filters retrieve a smaller candidate set (say, 50 chunks).
- The re-ranker examines the query and each candidate chunk together.
- It reorders the chunks based on usefulness, not just similarity.
That is why re-ranking matters. Vector search finds possible matches. Re-ranking finds better evidence.
A Concrete Example
Suppose a user asks:
“What is the launch risk for Project Delta?”
Vector search may return:
- A chunk that repeats “Project Delta launch” several times.
- A meeting note saying the security review is blocked.
- A support ticket showing Component X is delayed.
- A roadmap note saying the launch is planned for June.
- A customer email saying the on-prem requirement is mandatory.
The first chunk may have the highest cosine score because the words match closely. But the best answer may require the meeting note, the support ticket, and the customer email.
The best answer is not always the most similar text. It is the most useful evidence.
Then the LLM can answer using the strongest retrieved context, rather than relying on chunks that merely looked closest in vector space.
The Takeaway
Cosine similarity should be treated as a strong signal in the retrieval pipeline, not as the entire retrieval system.
Enterprise retrieval is not only about finding what is close. It is also about finding what is:
- Exact: the right entity, ID, or reference
- Scoped: filtered by region, time, segment, or status
- Entity-aware: preserves names, codenames, and identifiers
- Connected: links facts across multiple documents
- Useful: re-ranked for actual evidence value
- Allowed: respects access policies
- Answerable: produces context an LLM can reason over
The best systems use vector search where it shines, and combine it with keyword search, metadata, entity awareness, relationship mapping, and re-ranking where precision and control matter.