Reranking Retrieval Results: A Cross-Encoder Walkthrough

Q: What is reranking in a RAG pipeline?

Reranking is a secondary scoring pass over an initial set of retrieval candidates before they reach the generator. A typical flow is: a bi-encoder vector search pulls the top 50 candidates by embedding similarity, then a cross-encoder reranker re-scores each (query, document) pair by attending over them jointly, and only the new top 5 or 10 are placed in the prompt. The bi-encoder is fast but approximate — it encodes query and document independently. The cross-encoder is accurate but expensive — it reads them together and cannot be pre-indexed. Reranking is the standard way to get both: fast recall with the bi-encoder, precise ordering with the cross-encoder.

Q: Why do cross-encoders score better than bi-encoders?

A bi-encoder maps the query and each document to vectors independently and compares them via cosine similarity. It has no way to notice that a specific phrase in the query happens to match a specific phrase deep inside a document — both sides get averaged into a single vector. A cross-encoder, by contrast, concatenates query and document and runs them through a transformer together, so the attention layers can directly compare tokens on both sides. That joint attention is why cross-encoders pick up fine-grained relevance signals that bi-encoder similarity smears. The tradeoff: a cross-encoder cannot be pre-indexed, because each pair has to be computed at query time.

Q: What is the overfetch-N knob?

Overfetch-N is the size of the candidate pool you hand to the reranker — typically the top N results from your bi-encoder vector search, where N is much larger than the K documents you actually want in the final context. If the right document is in the top 50 but not the top 5, setting N=50 and K=5 lets the reranker rescue it. If N is too small, the reranker cannot fix what bi-encoder recall missed. If N is too large, latency and cost go up without much quality gain. Common starting points are N=25 to N=100 and K=3 to K=10, tuned per workload.

Q: How much latency does reranking add?

It depends on the reranker, the candidate count, and whether you call an API or run locally. A small local cross-encoder over 50 candidates might add around 100-250ms on a single GPU. An API-hosted reranker over the same candidates is usually in the same ballpark, plus network round-trip. The exact numbers vary enough that you should measure on your own hardware and traffic rather than quoting a leaderboard figure. The useful question is not 'how fast is reranker X on someone else's benchmark' but 'what fraction of my p95 end-to-end latency am I willing to spend on better ordering.'

Q: When should I skip reranking?

Skip reranking when your bi-encoder is already putting the right document in the top slot on most queries and the failure mode is retrieval missing the document entirely — that is a recall problem, and reranking cannot help because the right document is not in the candidate pool. Also skip when strict latency budgets (voice assistants, low-latency agents) leave no room for a second pass. The honest rule: profile retrieval failures first. If the right doc is usually in the top 50 but rarely in the top 5, reranking will help a lot. If the right doc is rarely in the top 50 at all, fix chunking or embeddings first.

Q: How does reranking relate to hybrid search?

Hybrid search and reranking solve different problems at different layers. [Hybrid search](/glossary/hybrid-search) fuses BM25 keyword retrieval with vector retrieval at the candidate-generation stage, so the top 50 contains both lexical and semantic hits. Reranking then reorders those 50 regardless of how they got there. The two compose cleanly: hybrid retrieval for a better candidate pool, cross-encoder reranking for a better final order. Teams often add hybrid search first (it is a config change on most vector databases) and reranking second (it needs a model and a latency budget).

Imtiaz Rayhan

Key takeaways:

Bi-encoders and cross-encoders are not competitors. They sit at different stages — bi-encoder for recall across the whole corpus, cross-encoder for precise ordering within a small candidate pool.
Most bi-encoder-only RAG systems have a latent quality ceiling. The right document is often in the top 50 but not the top 5, and the generator never sees it.
Overfetch-N is the hyperparameter that matters most. Too small and you miss candidates; too large and latency balloons without improving quality.
Reranker choice is a three-way tradeoff between quality, latency, and cost. There is no universal winner.
Reranking does not fix chunking, embedding quality, or missing documents. It reorders what is already there. Diagnose your bottleneck before adding a pass.
Evaluate reranking with metrics — top-K recall, nDCG, and downstream answer accuracy. See RAGAS Evaluation for the broader eval harness.

Why bi-encoder similarity hits a ceiling

Every RAG pipeline starts with a retriever, and almost every retriever starts with a bi-encoder embedding model. The query and each document get mapped to vectors independently, and similarity between vectors (cosine, dot product) proxies for relevance. This works — until you look closely at the top of the result list.

The architecture itself is the ceiling. A bi-encoder encodes each side of the match in isolation. It produces a single vector per query and a single vector per document, and those two vectors have to capture enough information about the full text that a simple dot product ranks them correctly. Fine-grained signal — "this exact phrase in the query is answered by this exact sentence in the document" — gets smeared into the average. The bi-encoder knows the document is roughly relevant. It does not know how relevant it is compared to nine other roughly-relevant documents.

The practical consequence is familiar. You look at your RAG system's failures and see a pattern: the generator is producing weak answers, and when you dig into the retrieved context, the right document is in position 6 or position 12, while positions 1-5 are full of almost-but-not-quite matches. The generator cannot recover from a truncated context that excludes the actual answer. That is not a generator problem. That is a ranking problem at the top of the retrieval list.

You can throw a bigger embedding model at this, and it helps a bit. You can tune chunk sizes — see chunking — and that also helps. You can add hybrid search with BM25, which helps on keyword-ish queries. But the bi-encoder architecture keeps biting at the top. What you actually need is a scoring function that reads the query and the candidate together.

The cross-encoder upgrade

A cross-encoder is the architectural opposite of a bi-encoder. It concatenates the query and the candidate document into a single input — something like [CLS] query [SEP] document [SEP] — and runs that joint input through a transformer. The output is a scalar relevance score. Because both sides go through the attention layers together, the model can directly compare tokens across them. "Does this phrase in the document actually answer the specific thing this query is asking?" is a question a cross-encoder can answer. A bi-encoder cannot.

The catch is obvious. A bi-encoder precomputes document vectors at index time, so retrieval is just nearest-neighbor search in vector space. A cross-encoder cannot precompute anything — each (query, document) pair has to be scored at query time. Running that over a ten-million-document corpus per query is not a real option. Running it over 50 or 100 candidates already surfaced by a fast bi-encoder is.

That is the whole pattern. Bi-encoder for the first-stage recall across the corpus. Cross-encoder reranker for the second-stage precision within a small candidate pool. Two stages, two tools, each doing what it is good at.

Reranking is how you spend a tractable amount of extra inference to buy a material quality jump — especially for queries where recall was fine but ordering was off.

Worked example: a hypothetical support-docs RAG

Hypothetical scenario, not a shipped product. A SaaS company runs a RAG assistant over its help center — roughly 8,000 articles spanning pricing, setup, integrations, and troubleshooting. The user asks:

"How do I rotate the API key for our webhook integration without breaking in-flight deliveries?"

The initial pipeline is bi-encoder-only. A 1,024-dimensional embedding model indexes all articles at 400-token chunk size. At query time, the system embeds the question, pulls the top 5 chunks from the vector database, and passes them to the generator.

Here is what the top 5 (illustratively) looks like before any reranking:

"Using API keys with our REST endpoints" — adjacent topic, scores high because of token overlap on "API key."
"Authentication basics" — general auth article, high similarity on auth terminology.
"Webhooks: an overview" — matches "webhook" but not rotation or in-flight deliveries.
"Rate limits for API consumers" — marginal match.
"Troubleshooting 401 errors" — adjacent but not the target.

The actual target article — "Rotating webhook API keys with graceful deploy" — sits at position 11 in the bi-encoder ranking. It is in the candidate pool if you look far enough down, but not in the context window. The generator, working from the top 5, produces a confident-sounding answer about API key rotation in general and misses the whole "in-flight deliveries" part. Top-5 recall on the eval set is (illustratively) 0.62.

Now add the reranker. Change the retrieval step to pull 50 candidates instead of 5, pass those 50 through a cross-encoder reranker, keep the top 5 by rerank score.

After reranking, the same query (illustratively) produces:

"Rotating webhook API keys with graceful deploy" — exact target, rescued from position 11.
"Webhook delivery guarantees and retries" — directly relevant to "in-flight deliveries."
"Key rotation best practices" — supporting context.
"Webhooks: an overview" — kept for general context.
"Using API keys with our REST endpoints" — demoted from position 1 to position 5.

Top-5 recall on the eval set jumps from (illustratively) 0.62 to 0.88. End-to-end answer accuracy, measured with an LLM-as-judge on a golden set, improves by roughly 14 points. Added latency: around 150-250ms per query, depending on reranker.

Two things to notice. First, the target document was already in the candidate pool after the bi-encoder — rerank could not have rescued it otherwise. That is the invariant: reranking fixes ordering, not recall. Second, the bi-encoder's top 5 was not wrong in any gross sense. Every document it returned was related to API keys. The cross-encoder's improvement is specifically about matching the intent of the query, not just its tokens.

The overfetch-N knob

The one hyperparameter worth thinking about carefully is N — the size of the candidate pool fed to the reranker.

Three forces are in tension. Quality goes up with N, up to a point. Latency goes up linearly with N. Cost goes up linearly with N. The question is where the quality curve flattens relative to your latency and cost budget.

Some practical guidance from the structure of the problem:

N too small (e.g., N=5). You have not given the reranker enough to work with. If the right document is not in the top 5 of the bi-encoder, it will not be in the top 5 after rerank either. You are paying for a pass that cannot help.
N moderate (e.g., N=25 to N=50). This is the sweet spot for most production systems. Big enough to include the documents the bi-encoder demoted unfairly, small enough that latency stays manageable.
N too large (e.g., N=200). Quality gains flatten — you are mostly reranking noise — while latency and cost keep climbing. You are paying for scores on documents the cross-encoder will rank low anyway.

K — the number of documents you keep after reranking — is a separate knob tied to your prompt's context budget and the generator's ability to digest noise. K=3 to K=10 is typical. Smaller K reduces noise in the prompt; larger K is safer when any single chunk might be incomplete.

A useful rule: start with N=50 and K=5, measure top-K recall and downstream answer quality, then adjust. The goal is to find the smallest N where top-K recall plateaus. Anything larger is waste.

Choosing a reranker family

Several reranker families are worth knowing about as of 2026. Each solves the same problem with different tradeoffs.

BGE-reranker (open source). BAAI's BGE reranker family is the common open-source default. It comes in small and base sizes and runs locally on a GPU. The appeal is simple: no API dependency, no per-query cost, and the quality is strong enough for most production workloads. The cost is that you operate the model — GPU capacity, queue management, updates. For teams already running model inference, this is usually a small marginal cost.

Cohere Rerank (managed API). Cohere's rerank endpoint is the common managed-service choice. You send the query and candidate list to an API, get back scored candidates. The appeal is operational: no GPU to manage, versioned model, good latency under their hosting. The cost is per-query fees and the usual tradeoffs of a managed dependency — you are subject to their pricing changes and their model updates.

Voyage Rerank (managed API). Voyage's reranker is another managed option with a similar API shape. The pitch is quality on specific domain benchmarks Voyage maintains, though as with any vendor benchmark, the honest move is to test on your own corpus rather than trust the leaderboard. Operationally similar to Cohere — you trade per-query cost for not running a model.

ColBERT and late-interaction models (research-leaning). ColBERT uses a middle-ground architecture: it encodes query and document separately like a bi-encoder but keeps per-token vectors, then computes a "late interaction" similarity that compares tokens across the two sides. This gets closer to cross-encoder quality at lower latency, but the index is much larger — you are storing per-token vectors per document, not one vector per document. For some teams, the storage tradeoff is worth it. For most, it is a research-flavored option that sits outside the usual cross-encoder vs. bi-encoder dichotomy.

The honest guidance: pick whichever you can ship fastest, evaluate on your own corpus, and treat reranker choice as reversible. Unlike an embedding model swap (which forces you to reindex), reranker swap is a call-site change.

Failure modes

Four anti-patterns worth flagging before you ship.

Adding a reranker to fix a recall problem. If your bi-encoder is not surfacing the right document in the top N, a reranker cannot help — there is nothing to rescue. This is the most common misdiagnosis. If top-50 recall is low, fix chunking — comparing chunking strategies for RAG is often the highest-leverage move — try a better embedding model, add hybrid search, or consider HyDE — reranking is the wrong layer.

Picking N by guessing. Teams often default to N=10 or N=20 because it sounds reasonable, then declare the reranker "didn't help." The reranker cannot reorder documents that were never in the candidate pool. Measure at N=50 and N=100 at least once before concluding anything.

Ignoring latency in production. Rerank latency adds to every query, not just the slow ones. If your end-to-end p95 budget is 800ms and your reranker eats 300ms, that is a hard constraint on the rest of the pipeline. Measure, budget, and be willing to reduce N to fit.

Treating reranker scores as truth. Cross-encoder scores are more accurate than bi-encoder scores, but they are not ground truth. They are another model's opinion. Evaluate the full pipeline with RAGAS, semantic search quality metrics, and user-facing outcomes — not by staring at rerank scores.

Our position

Five opinionated stances.

Every serious RAG system should ship with a reranker. Bi-encoder-only retrieval has a known ceiling that a cross-encoder fixes for modest latency. The cases where it does not help (pure-recall failures, voice-latency systems) are the minority. If you are shipping to real users at production scale, rerank is part of the stack.

Start with N=50, K=5, and measure. Any other starting point is a guess. These numbers fit most workloads, leave room to tune, and give the reranker enough candidates to work with. Adjust from data, not intuition.

Reranker choice is reversible. Embedding model choice is not. Spend your architectural care on the embedding and chunking layers — they force reindexes. Reranker is a call-site swap. Pick something reasonable, ship it, swap later if needed.

Evaluate reranking with downstream metrics, not rerank scores. Top-K recall, answer faithfulness, and user outcomes matter. Rerank score distributions are diagnostic, not a KPI. See RAGAS Evaluation for the full harness.

Reranking is a context-engineering discipline, not a prompt trick. It shapes what the model sees, which is the whole game. Treat it as part of your context engineering maturity — see the Context Engineering Maturity Model for where rerank sits in the broader picture, and the Agentic Prompt Stack for how retrieval layers compose with agent loops.

Reranking Retrieval Results: A Cross-Encoder Walkthrough

Why bi-encoder similarity hits a ceiling

The cross-encoder upgrade

Worked example: a hypothetical support-docs RAG

The overfetch-N knob

Choosing a reranker family

Failure modes

Our position

Ready to write better prompts?

Related Resources

RAG System Design Template

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality

Reranking Retrieval Results: A Cross-Encoder Walkthrough

Why bi-encoder similarity hits a ceiling

The cross-encoder upgrade

Worked example: a hypothetical support-docs RAG

The overfetch-N knob

Choosing a reranker family

Failure modes

Our position

Related reading

Ready to write better prompts?

Related Resources

RAG System Design Template

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality