Tip
TL;DR: Bi-encoder vector similarity is good at pulling the right documents into a candidate pool and mediocre at ordering the top few. A cross-encoder reranker reads query and document together and fixes the ordering. The pattern is: overfetch N candidates with the vector store, rerank them with a cross-encoder, keep the top K for the prompt. This walkthrough shows how to wire it up on a hypothetical support-docs RAG, what the latency budget looks like, and which reranker families are worth trying.
Key takeaways:
- Bi-encoders and cross-encoders are not competitors. They sit at different stages — bi-encoder for recall across the whole corpus, cross-encoder for precise ordering within a small candidate pool.
- Most bi-encoder-only RAG systems have a latent quality ceiling. The right document is often in the top 50 but not the top 5, and the generator never sees it.
- Overfetch-N is the hyperparameter that matters most. Too small and you miss candidates; too large and latency balloons without improving quality.
- Reranker choice is a three-way tradeoff between quality, latency, and cost. There is no universal winner.
- Reranking does not fix chunking, embedding quality, or missing documents. It reorders what is already there. Diagnose your bottleneck before adding a pass.
- Evaluate reranking with metrics — top-K recall, nDCG, and downstream answer accuracy. See RAGAS Evaluation for the broader eval harness.
Why bi-encoder similarity hits a ceiling
Every RAG pipeline starts with a retriever, and almost every retriever starts with a bi-encoder embedding model. The query and each document get mapped to vectors independently, and similarity between vectors (cosine, dot product) proxies for relevance. This works — until you look closely at the top of the result list.
The architecture itself is the ceiling. A bi-encoder encodes each side of the match in isolation. It produces a single vector per query and a single vector per document, and those two vectors have to capture enough information about the full text that a simple dot product ranks them correctly. Fine-grained signal — "this exact phrase in the query is answered by this exact sentence in the document" — gets smeared into the average. The bi-encoder knows the document is roughly relevant. It does not know how relevant it is compared to nine other roughly-relevant documents.
The practical consequence is familiar. You look at your RAG system's failures and see a pattern: the generator is producing weak answers, and when you dig into the retrieved context, the right document is in position 6 or position 12, while positions 1-5 are full of almost-but-not-quite matches. The generator cannot recover from a truncated context that excludes the actual answer. That is not a generator problem. That is a ranking problem at the top of the retrieval list.
You can throw a bigger embedding model at this, and it helps a bit. You can tune chunk sizes — see chunking — and that also helps. You can add hybrid search with BM25, which helps on keyword-ish queries. But the bi-encoder architecture keeps biting at the top. What you actually need is a scoring function that reads the query and the candidate together.
The cross-encoder upgrade
A cross-encoder is the architectural opposite of a bi-encoder. It concatenates the query and the candidate document into a single input — something like [CLS] query [SEP] document [SEP] — and runs that joint input through a transformer. The output is a scalar relevance score. Because both sides go through the attention layers together, the model can directly compare tokens across them. "Does this phrase in the document actually answer the specific thing this query is asking?" is a question a cross-encoder can answer. A bi-encoder cannot.
The catch is obvious. A bi-encoder precomputes document vectors at index time, so retrieval is just nearest-neighbor search in vector space. A cross-encoder cannot precompute anything — each (query, document) pair has to be scored at query time. Running that over a ten-million-document corpus per query is not a real option. Running it over 50 or 100 candidates already surfaced by a fast bi-encoder is.
That is the whole pattern. Bi-encoder for the first-stage recall across the corpus. Cross-encoder reranker for the second-stage precision within a small candidate pool. Two stages, two tools, each doing what it is good at.
Reranking is how you spend a tractable amount of extra inference to buy a material quality jump — especially for queries where recall was fine but ordering was off.
Worked example: a hypothetical support-docs RAG
Hypothetical scenario, not a shipped product. A SaaS company runs a RAG assistant over its help center — roughly 8,000 articles spanning pricing, setup, integrations, and troubleshooting. The user asks:
"How do I rotate the API key for our webhook integration without breaking in-flight deliveries?"
The initial pipeline is bi-encoder-only. A 1,024-dimensional embedding model indexes all articles at 400-token chunk size. At query time, the system embeds the question, pulls the top 5 chunks from the vector database, and passes them to the generator.
Here is what the top 5 (illustratively) looks like before any reranking:
- "Using API keys with our REST endpoints" — adjacent topic, scores high because of token overlap on "API key."
- "Authentication basics" — general auth article, high similarity on auth terminology.
- "Webhooks: an overview" — matches "webhook" but not rotation or in-flight deliveries.
- "Rate limits for API consumers" — marginal match.
- "Troubleshooting 401 errors" — adjacent but not the target.
The actual target article — "Rotating webhook API keys with graceful deploy" — sits at position 11 in the bi-encoder ranking. It is in the candidate pool if you look far enough down, but not in the context window. The generator, working from the top 5, produces a confident-sounding answer about API key rotation in general and misses the whole "in-flight deliveries" part. Top-5 recall on the eval set is (illustratively) 0.62.
Now add the reranker. Change the retrieval step to pull 50 candidates instead of 5, pass those 50 through a cross-encoder reranker, keep the top 5 by rerank score.
After reranking, the same query (illustratively) produces:
- "Rotating webhook API keys with graceful deploy" — exact target, rescued from position 11.
- "Webhook delivery guarantees and retries" — directly relevant to "in-flight deliveries."
- "Key rotation best practices" — supporting context.
- "Webhooks: an overview" — kept for general context.
- "Using API keys with our REST endpoints" — demoted from position 1 to position 5.
Top-5 recall on the eval set jumps from (illustratively) 0.62 to 0.88. End-to-end answer accuracy, measured with an LLM-as-judge on a golden set, improves by roughly 14 points. Added latency: around 150-250ms per query, depending on reranker.
Two things to notice. First, the target document was already in the candidate pool after the bi-encoder — rerank could not have rescued it otherwise. That is the invariant: reranking fixes ordering, not recall. Second, the bi-encoder's top 5 was not wrong in any gross sense. Every document it returned was related to API keys. The cross-encoder's improvement is specifically about matching the intent of the query, not just its tokens.
The overfetch-N knob
The one hyperparameter worth thinking about carefully is N — the size of the candidate pool fed to the reranker.
Three forces are in tension. Quality goes up with N, up to a point. Latency goes up linearly with N. Cost goes up linearly with N. The question is where the quality curve flattens relative to your latency and cost budget.
Some practical guidance from the structure of the problem:
- N too small (e.g., N=5). You have not given the reranker enough to work with. If the right document is not in the top 5 of the bi-encoder, it will not be in the top 5 after rerank either. You are paying for a pass that cannot help.
- N moderate (e.g., N=25 to N=50). This is the sweet spot for most production systems. Big enough to include the documents the bi-encoder demoted unfairly, small enough that latency stays manageable.
- N too large (e.g., N=200). Quality gains flatten — you are mostly reranking noise — while latency and cost keep climbing. You are paying for scores on documents the cross-encoder will rank low anyway.
K — the number of documents you keep after reranking — is a separate knob tied to your prompt's context budget and the generator's ability to digest noise. K=3 to K=10 is typical. Smaller K reduces noise in the prompt; larger K is safer when any single chunk might be incomplete.
A useful rule: start with N=50 and K=5, measure top-K recall and downstream answer quality, then adjust. The goal is to find the smallest N where top-K recall plateaus. Anything larger is waste.
Choosing a reranker family
Several reranker families are worth knowing about as of 2026. Each solves the same problem with different tradeoffs.
BGE-reranker (open source). BAAI's BGE reranker family is the common open-source default. It comes in small and base sizes and runs locally on a GPU. The appeal is simple: no API dependency, no per-query cost, and the quality is strong enough for most production workloads. The cost is that you operate the model — GPU capacity, queue management, updates. For teams already running model inference, this is usually a small marginal cost.
Cohere Rerank (managed API). Cohere's rerank endpoint is the common managed-service choice. You send the query and candidate list to an API, get back scored candidates. The appeal is operational: no GPU to manage, versioned model, good latency under their hosting. The cost is per-query fees and the usual tradeoffs of a managed dependency — you are subject to their pricing changes and their model updates.
Voyage Rerank (managed API). Voyage's reranker is another managed option with a similar API shape. The pitch is quality on specific domain benchmarks Voyage maintains, though as with any vendor benchmark, the honest move is to test on your own corpus rather than trust the leaderboard. Operationally similar to Cohere — you trade per-query cost for not running a model.
ColBERT and late-interaction models (research-leaning). ColBERT uses a middle-ground architecture: it encodes query and document separately like a bi-encoder but keeps per-token vectors, then computes a "late interaction" similarity that compares tokens across the two sides. This gets closer to cross-encoder quality at lower latency, but the index is much larger — you are storing per-token vectors per document, not one vector per document. For some teams, the storage tradeoff is worth it. For most, it is a research-flavored option that sits outside the usual cross-encoder vs. bi-encoder dichotomy.
The honest guidance: pick whichever you can ship fastest, evaluate on your own corpus, and treat reranker choice as reversible. Unlike an embedding model swap (which forces you to reindex), reranker swap is a call-site change.
Failure modes
Four anti-patterns worth flagging before you ship.
Adding a reranker to fix a recall problem. If your bi-encoder is not surfacing the right document in the top N, a reranker cannot help — there is nothing to rescue. This is the most common misdiagnosis. If top-50 recall is low, fix chunking, try a better embedding model, add hybrid search, or consider HyDE — reranking is the wrong layer.
Picking N by guessing. Teams often default to N=10 or N=20 because it sounds reasonable, then declare the reranker "didn't help." The reranker cannot reorder documents that were never in the candidate pool. Measure at N=50 and N=100 at least once before concluding anything.
Ignoring latency in production. Rerank latency adds to every query, not just the slow ones. If your end-to-end p95 budget is 800ms and your reranker eats 300ms, that is a hard constraint on the rest of the pipeline. Measure, budget, and be willing to reduce N to fit.
Treating reranker scores as truth. Cross-encoder scores are more accurate than bi-encoder scores, but they are not ground truth. They are another model's opinion. Evaluate the full pipeline with RAGAS, semantic search quality metrics, and user-facing outcomes — not by staring at rerank scores.
Our position
Five opinionated stances.
- Every serious RAG system should ship with a reranker. Bi-encoder-only retrieval has a known ceiling that a cross-encoder fixes for modest latency. The cases where it does not help (pure-recall failures, voice-latency systems) are the minority. If you are shipping to real users at production scale, rerank is part of the stack.
- Start with N=50, K=5, and measure. Any other starting point is a guess. These numbers fit most workloads, leave room to tune, and give the reranker enough candidates to work with. Adjust from data, not intuition.
- Reranker choice is reversible. Embedding model choice is not. Spend your architectural care on the embedding and chunking layers — they force reindexes. Reranker is a call-site swap. Pick something reasonable, ship it, swap later if needed.
- Evaluate reranking with downstream metrics, not rerank scores. Top-K recall, answer faithfulness, and user outcomes matter. Rerank score distributions are diagnostic, not a KPI. See RAGAS Evaluation for the full harness.
- Reranking is a context-engineering discipline, not a prompt trick. It shapes what the model sees, which is the whole game. Treat it as part of your context engineering maturity — see the Context Engineering Maturity Model for where rerank sits in the broader picture, and the Agentic Prompt Stack for how retrieval layers compose with agent loops.