Skip to main content

ColBERT (Late Interaction Retrieval)

ColBERT is a retrieval architecture that sits between bi-encoders and cross-encoders. Instead of encoding a document into a single vector, it produces one contextual embedding per token; relevance against a query is then computed as the sum over query tokens of the maximum similarity between that query token and any document token — the so-called late-interaction or MaxSim operation. Because the per-token embeddings are still precomputed at index time, ColBERT retains the offline-indexing property of bi-encoders and does not require a joint forward pass per (query, document) pair. Accuracy on paraphrase and rare-term queries is materially better than single-vector bi-encoders and approaches cross-encoder performance on several benchmarks. The main cost is storage: one vector per token instead of per document can inflate index size by 20-100x, which dedicated ColBERT indexes (PLAID, ColBERTv2 compression) partially mitigate.

Example

A biomedical-search team finds that single-vector bi-encoders miss queries where a rare technical term ("mTOR inhibitor") must align precisely with a specific passage rather than with the document's overall topic. They migrate to a ColBERTv2 index with residual compression. Index storage grows from 40GB to roughly 300GB for the same corpus. Recall@10 on rare-term queries improves from illustrative 0.68 to 0.83, closing most of the gap to a cross-encoder reranker at a fraction of the query-time cost, and cross-encoder reranking becomes optional rather than necessary.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts