Contextual Compression
Contextual compression is a preprocessing step that sits between retrieval and generation in a RAG pipeline. Instead of passing the full retrieved chunks directly to the generator, a compression step filters or summarizes them against the specific query — dropping passages that turn out to be off-topic, extracting only the sentences that actually answer the question, or rewriting long passages into a tighter summary. The compressor can be rule-based (regex over keywords), embedding-based (cosine filter), or a small LLM call. Contextual compression reduces generator input tokens, speeds up inference, lowers cost, and reduces the noise that degrades answer quality in long contexts. The tradeoff is that an over-aggressive compressor can strip information the generator actually needs, so tuning the compression threshold on an eval set matters.
Example
A legal-research assistant retrieves ten 1,200-token case passages per query. Before contextual compression, the generator receives 12,000 tokens of context per call and often ignores relevant text buried deep in the prompt. A small compression step filters each passage down to the sentences semantically closest to the query, yielding roughly 3,000 tokens of context per call. Generator cost drops by 70%, p95 latency drops by 40%, and answer faithfulness on the eval set rises from illustrative 0.77 to 0.84 because the generator is no longer distracted by irrelevant surrounding text.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts