Cross-Encoder
A cross-encoder is a transformer architecture that takes a query and a candidate document as a single joint input — typically concatenated with a separator token — and outputs one scalar relevance score. Because attention runs across both sides together, a cross-encoder can model fine-grained interactions between query terms and document terms — negation, exact matches inside a paraphrase, positional constraints — that a bi-encoder cannot. The cost is quadratic attention over the combined length and one full forward pass per (query, document) pair, which makes cross-encoders impossibly slow to run over an entire corpus at query time. They are the standard reranker architecture: a fast first-stage retriever (BM25, bi-encoder, or both) narrows the corpus to the top 50-200 candidates, and a cross-encoder rescores only those to produce the final ordering.
Example
A technical-support search retrieves the top 100 candidates per query with a bi-encoder in around 30ms, then reranks the top 50 with a small cross-encoder in around 120ms. The two-stage pipeline lands at roughly 180ms end-to-end. A control experiment running the cross-encoder over the full corpus of 2M documents was projected at 40+ seconds per query — operationally unusable. The two-stage setup ships at illustrative nDCG@10 of 0.81 vs. 0.74 for bi-encoder alone, with per-query latency the user does not notice.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts