Bi-Encoder
A bi-encoder is a dual-tower transformer architecture in which the query and the document are encoded independently by the same (or twin) encoder into separate fixed-size vectors, and relevance is computed as cosine or dot-product similarity between those vectors. Because the two sides never attend to each other, documents can be embedded once at index time and stored in a vector database; at query time only the query needs encoding, and retrieval reduces to nearest-neighbor search — cheap enough to run over millions of documents in milliseconds. The tradeoff is accuracy: without cross-attention, the model cannot model joint interactions like "the document talks about X but denies Y", so bi-encoders typically underperform cross-encoders on reranking benchmarks. Bi-encoders are the standard architecture for first-stage retrieval and for embedding-model APIs generally.
Example
A product-search team uses a bi-encoder to embed 4M product descriptions offline. At query time, they embed the incoming query in roughly 5ms, run approximate nearest-neighbor search in roughly 15ms, and return the top 100 hits — total first-stage latency under 25ms. A cross-encoder reranker then reorders the top 50 in another 100ms. The bi-encoder does the cheap, broad sweep over millions of documents; the cross-encoder does the expensive, accurate reordering over 50. Neither architecture alone would be acceptable: bi-encoder alone loses accuracy, cross-encoder alone is hundreds of seconds per query.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts