ColBERT (Late Interaction Retrieval)
ColBERT is a retrieval architecture that sits between bi-encoders and cross-encoders. Instead of encoding a document into a single vector, it produces one contextual embedding per token; relevance against a query is then computed as the sum over query tokens of the maximum similarity between that query token and any document token — the so-called late-interaction or MaxSim operation.
Because the per-token embeddings are still precomputed at index time, ColBERT retains the offline-indexing property of bi-encoders and does not require a joint forward pass per (query, document) pair. Accuracy on paraphrase and rare-term queries is materially better than single-vector bi-encoders and approaches cross-encoder performance on several benchmarks. The main cost is storage: one vector per token instead of per document can inflate index size by 20-100x, which dedicated ColBERT indexes (PLAID, ColBERTv2 compression) partially mitigate.
Origin: Introduced by Khattab & Zaharia in "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" (SIGIR 2020). ColBERTv2 (2022) added residual compression and denoised supervision.
How it works
- 1
At index time, every document is run through a BERT-style encoder and stored as a matrix of per-token embeddings rather than a single pooled vector.
- 2
At query time, the query is also encoded into per-token embeddings, but no joint forward pass with each candidate document is needed.
- 3
Relevance is the MaxSim score: for each query token, take its maximum cosine similarity against any token of the candidate document, then sum those maxima across query tokens.
- 4
Candidate filtering uses an approximate-nearest-neighbor pass over individual token embeddings, followed by an exact MaxSim rerank — keeping latency close to bi-encoder retrieval while gaining most cross-encoder accuracy.
Example
A biomedical-search team finds that single-vector bi-encoders miss queries where a rare technical term ("mTOR inhibitor") must align precisely with a specific passage rather than with the document's overall topic. They migrate to a ColBERTv2 index with residual compression. Index storage grows from 40GB to roughly 300GB for the same corpus. Recall@10 on rare-term queries improves from illustrative 0.68 to 0.83, closing most of the gap to a cross-encoder reranker at a fraction of the query-time cost, and cross-encoder reranking becomes optional rather than necessary.
Not to be confused with
- Bi-encoder
- Pools each document into a single fixed-size vector. Faster and smaller index than ColBERT, but loses the per-token alignment that helps with rare-term and paraphrase queries.
- Cross-encoder
- Encodes (query, document) jointly in one forward pass, giving the best accuracy but at prohibitive query-time cost. ColBERT approximates cross-encoder quality with bi-encoder-style precomputation.
- BM25
- Classical sparse lexical retrieval. ColBERT is a dense neural method; the two are complementary and are often combined via score fusion in hybrid retrieval systems.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts