Reranking
Reranking is a secondary scoring pass over an initial set of retrieval candidates to improve their ordering before they are handed to the generator. A typical flow is: vector search returns the top 50 candidates by embedding similarity, then a reranker — usually a cross-encoder model that reads the query and document together — re-scores each pair more carefully, and only the new top 5 or 10 are used as context. Cross-encoders are more accurate than bi-encoder embedding similarity because they attend jointly over the query and document rather than scoring each independently in a shared space, at the cost of being too slow to run over the whole corpus. Reranking adds inference latency and cost but usually yields a material quality jump, especially on queries where the right document is in the candidate set but not at the top.
Example
A support-docs assistant retrieves 50 candidate passages per question with an embedding model, then passes them through a small cross-encoder reranker. Before reranking, the correct passage is in the top 5 on 62% of eval questions; after reranking, it is in the top 5 on 88%. End-to-end answer accuracy improves by 14 points, at the cost of roughly 200ms of added per-query latency.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts