Compression is what you reach for when there's more context than budget. The context window reaches a million tokens on recent models, but cost, latency, and "lost in the middle" don't disappear because the ceiling rose. Compression shrinks context while preserving what the model actually needs — and "actually needs" is load-bearing, because every technique trades fidelity for tokens on a different curve. This post, under the context engineering pillar, walks three families and how to combine them without fooling yourself about what was lost.
Why Compression Matters Even at 1M Tokens
A million tokens is not the end of budgeting.
Cost scales with input. Filled prompts pay per token on every call. Caching helps on the static prefix, but the variable tail — history, retrieval, user input — still pays full freight. A compressed tail is a smaller invoice every call.
Latency scales with input. Time-to-first-token grows with context size. A million-token prompt that answers in thirty seconds is not the same UX as a 50,000-token prompt answering in four.
Attention is uneven. Content mid-prompt is recalled less reliably than content at either end. Noise dilutes signal even when it fits. See context window management strategies for the broader budgeting picture.
Summarization-Based Compression
The oldest and most general family: rewrite earlier context as a shorter summary. Replay a 200-token summary plus the last few turns verbatim instead of 5,000 tokens of chat history.
How it works. A model — sometimes the main one, often a smaller cheaper one — reads the material and produces a condensed version that enters the prompt in place of the original. Common flavors:
- Rolling summary. Running summary of everything prior to the recent window. After each turn, update it and drop old verbatim turns.
- Recursive summarization. For very long source material, summarize in chunks, then summarize the summaries. Compression rates can be large, but each level loses more.
- Structured summary. Force the summary into a schema — facts, decisions, open questions, action items — so what survives is predictable rather than at the summarizer's whim.
What you lose. Anything the summarizer judged unimportant. If the summary was produced before you knew a detail would matter, it's probably gone. Structured summaries mitigate this by fixing in advance what categories survive.
Where it fits. Chat histories, long document context queried repeatedly, multi-session agent state — coherent prose where downstream queries are reasonably predictable. Wrong tool for verbatim fidelity (quotes, code, contract language) or when future queries might ask about dropped details.
Semantic Chunking
Instead of rewriting, select. Chunk the source, embed each chunk, rank by relevance to the current query, pass only the top chunks. Most of the corpus never enters the prompt.
How it works. Standard retrieval pipeline — the machinery of RAG, pointed inward at the conversation or document you'd otherwise replay in full. Chunk size, embedding model, similarity threshold are tunable. Surviving chunks are verbatim; they just weren't all included.
What you lose. Anything below threshold. A feature when the query is specific and there's irrelevant material. A failure mode when the query is broad ("summarize what we discussed") or when relevance is flat and top-k misses something thinly spread.
Where it fits. Large corpora with narrow-slice queries — knowledge bases, long documents, histories where each question maps to a specific earlier exchange. Wrong tool when tasks need the whole picture. Chunking is also how hierarchical context loading decides what to pull at the leaf level.
Caveat. Embedding relevance is only as good as the model's match to your domain. Off-the-shelf embeddings on specialized jargon can miss. Measure — don't trust that top-k is actually top.
Token-Level Compression
The most aggressive family. Instead of rewriting or selecting, algorithmically remove tokens — the ones contributing least to meaning. LLMLingua is the best-known line of work: small language models score tokens by predictability, and low-information tokens are dropped.
How it works. A compressor reads the prompt, scores spans, outputs a shorter version stripping low-entropy filler — articles, redundant connectors, verbose phrasings — while keeping high-information tokens. Output often looks ungrammatical to a human but remains readable to the target LLM. Some approaches target a compression ratio; others a token budget.
What you lose. Mostly style, some redundancy, occasionally nuance. The bet is that the target model does not need pristine surface form — which holds much of the time and fails on material where precise wording matters (legal text, literal quotes, code). Fidelity loss is harder to predict than summarization because deletions are local, not holistic.
Where it fits. Prose where you control the compressor and have measured that task performance holds. Especially useful on top of summarization. Wrong tool for verbatim material or tasks that echo source back.
No specific ratio-vs-accuracy number here — published figures vary by model, task, and compressor, and your setup won't match any single paper. Measure on your traffic.
Comparing the Three Families
| Dimension | Summarization | Semantic chunking | Token-level |
|---|---|---|---|
| Operation | Rewrite | Select | Trim |
| Typical compression rate | High, tunable by target length | Very high when queries are narrow | Moderate per pass, stackable |
| Fidelity loss | Whatever the summarizer drops | Chunks below threshold | Style, some nuance; occasional meaning |
| Handles verbatim material | Poorly | Well (chunks are verbatim) | Poorly |
| Cost at compression time | One model call per update | Embedding + similarity | One model call per compression |
| Cost at inference time | None extra | None extra | None extra |
| Best for | Chat history, multi-session state | Large corpora, narrow queries | Prose context where you can measure loss |
| Weakest on | Broad queries that need dropped detail | Broad queries that need coverage | Tasks that need exact wording |
No family dominates. A well-run system picks per material — verbatim source stays verbatim and gets chunked, prose summaries get rewritten, and what remains gets token-trimmed at the end.
Hybrid Approaches
Most real systems combine two or three techniques. Common stacks:
- Summarize then chunk. Rolling summary keeps the thread; chunked retrieval pulls verbatim exchanges when a question needs them. Standard for long-lived chat assistants.
- Chunk then compress. Retrieval returns top-k; a token-level pass shrinks those chunks before they enter the prompt. Useful when top-k is already near budget.
- Structured summary + verbatim tail. Last N turns verbatim plus a structured summary of everything earlier. Recent precision plus older context at predictable cost.
- Tiered summarization. Daily summaries of hourly logs, weekly of daily, for systems across long horizons. See token economics guide for how layers interact with caching.
Rule of thumb: compress first where information is least valuable, and only as far as the downstream task can verifiably tolerate. Over-compression is silent — the model answers confidently on material you've quietly gutted.
Fidelity vs Compression Rate
Each family trades fidelity for rate on a different curve.
- Summarization drops in steps. Modest compression keeps most of what matters; past a point, whole topics disappear.
- Semantic chunking has a threshold cliff. Above threshold, verbatim; below, gone. Future queries that needed the dropped chunks see nothing.
- Token-level compression is smooth — each increment costs a little more fidelity on the whole text rather than losing pieces wholesale. Safest family to apply last.
In combination, curves compose unpredictably. The reliable way to find a good operating point is concrete tasks, a fidelity eval, and measurement.
Testing Compression
Compression changes outputs. A pipeline that applies it and never checks is flying blind.
- Set tasks that exercise what you wanted to keep. Not generic benchmarks — cases representative of your traffic. If users ask about a specific earlier exchange, the eval needs cases where the answer depends on that exchange.
- Measure uncompressed first. Run on full context to set a ceiling, then rerun each variant. The delta is your fidelity loss.
- Watch for silent failure. A compressed prompt that answers confidently but wrong is worse than one that refuses. Include cases where the compressor has plausibly dropped needed detail and check the model either answers correctly or flags uncertainty.
- Re-evaluate on change. New model, new compressor, new source material — compression that worked can quietly stop working. Versioned config, not a one-time decision.
Example: A Compressed Chat-History Prompt
Hypothetical — a customer support assistant where the conversation history is compressed with a structured rolling summary plus a verbatim tail.
[SYSTEM]
You are a support assistant for ACME. The user's conversation history is
provided in two forms below:
1. SUMMARY — a structured summary of all turns prior to the last 5.
Treat the SUMMARY as a faithful but lossy record. If a detail the user
references isn't in the SUMMARY or in RECENT_TURNS, ask them to
restate it rather than inventing it.
2. RECENT_TURNS — the last 5 turns verbatim. Treat these as authoritative
about what was just said.
[CONTEXT]
SUMMARY:
- Identified issue: {open_issue}
- Decisions made so far: {decisions}
- Open questions: {open_questions}
- Relevant account facts: {account_facts}
RECENT_TURNS:
{last_5_turns_verbatim}
[USER]
{user_message}
The shape: the model knows which block is compressed and which is verbatim, what to do when needed information is in neither (ask, don't invent), and sees a schema-structured summary whose shape determines what survives. Compare against the fully-verbatim approach in context window management strategies — same problem, different trade-off.
Common Anti-Patterns
- Compressing uniformly. The same ratio on every kind of material throws away verbatim fidelity where you needed it and keeps filler where you didn't. Tier by material.
- Unstructured summarization at scale. Without a schema, the summary drifts — one week it includes numbers, the next it doesn't. Structure the summary.
- Chunking without relevance measurement. Trusting off-the-shelf embeddings on your domain without checking is how retrieval silently misses what it should have found.
- Token-level compression on verbatim-sensitive text. Contracts, quotes, code, log lines — none survive token trimming intact. Different family, not a tighter ratio.
- No evaluation. "It still works in the demo" is how regressions reach production. Measure against tasks that depend on compressed material.
- Compressing cached prefixes. If what you're compressing was static and cacheable, you paid an inference cost to lose the bigger cache-hit discount. Compress variable material; cache the stable prefix.
FAQ
How much can I compress before the model gets worse?
Depends on material, compressor, target model, and task. Published ratios won't transfer. Build an eval on representative tasks, start modest, measure, and increase until the eval regresses. Back off one step.
Does compression interact with prompt caching?
Yes — often badly if applied naively. Compression mutates the prompt and invalidates the cache. Compress the variable part (history, retrieval); leave the stable prefix (system instructions, tool definitions) intact and cached. See token economics guide.
Summarization versus chunking — when?
Summarization fits coherent prose you reference repeatedly — chat history, meeting notes, document skims. Chunking fits large corpora where each query touches a narrow slice — knowledge bases, long reference documents. Many systems use both.
Is token-level compression worth it?
Often as a final pass after summarization and chunking have shaped the prompt. Rarely as a sole technique — structural compression usually wins first. Treat it as the last increment you squeeze out, not the primary mechanism.
What breaks if I over-compress?
Models answer confidently on material that no longer contains the answer. Hallucinations, subtle errors, contradictions with earlier conversation. The UI looks the same; only your eval catches the regression. Always have the eval.
Wrap-Up
Context compression is budget allocation in disguise. Summarization rewrites; chunking selects; token-level compression trims. Each trades fidelity on a different curve and fails in a characteristic way. Real systems combine them — structured summary plus verbatim tail plus top-k retrieval plus a final token-level pass — and measure at every step, because regressions are silent and confidence is cheap. Pick the family that matches what you can afford to lose, structure the output so what survives is predictable, and never ship a compression change you haven't evaluated.
For the broader frame, the context engineering pillar. For the budgeting picture, context window management strategies. For layered prompts, hierarchical context loading. For the cost side, token economics guide. For the vocabulary, context window.