Context Compression Techniques (2026)

Q: What are the three families of context compression?

There are three families, and each operates differently. Summarization-based compression rewrites earlier context into a shorter version — for example, replacing 5,000 tokens of chat history with a 200-token rolling summary plus the last few verbatim turns. Semantic chunking selects rather than rewrites: it chunks the source, embeds each chunk, ranks by relevance to the current query, and passes only the top chunks, so surviving content is verbatim. Token-level compression trims: an algorithm (LLMLingua is the best-known line of work) scores tokens by predictability and drops low-information filler while keeping high-information tokens, producing output that may look ungrammatical to a human but stays readable to the target model. Each trades fidelity for tokens on a different curve, and no family dominates — well-run systems pick per material.

Q: Why compress context if models now have million-token windows?

A million tokens is not the end of budgeting. Cost scales with input: filled prompts pay per token on every call, and while caching helps the static prefix, the variable tail — history, retrieval, user input — still pays full freight. Latency also scales with input, since time-to-first-token grows with context size; a million-token prompt answering in thirty seconds is a worse experience than a 50,000-token prompt answering in four. And attention is uneven — content in the middle of a long prompt is recalled less reliably than content at either end, so noise dilutes signal even when everything technically fits. Compression shrinks context while preserving what the model actually needs, which is why it stays relevant regardless of how high the ceiling rises.

Q: How much can I compress before the model gets worse?

There is no transferable number — it depends on the material, the compressor, the target model, and the task, and published ratios from any single paper will not match your setup. The reliable approach is measurement. Build an eval on tasks representative of your real traffic, not generic benchmarks, and include cases where the answer depends on specific compressed content. Run the model on full, uncompressed context first to set a ceiling, then rerun each compression variant; the delta is your fidelity loss. Start modest, measure, and increase compression until the eval regresses, then back off one step. Re-evaluate whenever the model, compressor, or source material changes, because compression that worked can quietly stop working.

Q: Does context compression interact with prompt caching?

Yes, and often badly if applied naively. Compression mutates the prompt, which invalidates the cache — so if you compress something that was static and cacheable, you have paid an inference cost to lose the larger cache-hit discount. The fix is to compress the variable part of the prompt (conversation history, retrieval results) while leaving the stable prefix (system instructions, tool definitions) intact and cached. This is one of the article's anti-patterns: compressing cached prefixes. Compress where information is least valuable and only as far as the downstream task can verifiably tolerate, and keep the stable prefix byte-identical so it continues to hit the cache.

Q: What happens if I over-compress context?

Over-compression fails silently, which is what makes it dangerous. The model keeps answering confidently on material that no longer contains the answer, producing hallucinations, subtle errors, and contradictions with earlier conversation. The user interface looks identical — nothing visibly breaks — so only an eval catches the regression. This is why the article insists on testing: set tasks that exercise exactly what you wanted to keep, measure uncompressed performance first as a ceiling, watch specifically for confident-but-wrong answers on cases where the compressor plausibly dropped needed detail, and re-evaluate on every change. A compressed prompt that answers confidently but wrong is worse than one that refuses, so the goal is for the model to either answer correctly or flag uncertainty.

Imtiaz Rayhan

Compression is what you reach for when there's more context than budget. The context window reaches a million tokens on recent models, but cost, latency, and "lost in the middle" don't disappear because the ceiling rose. Compression shrinks context while preserving what the model actually needs — and "actually needs" is load-bearing, because every technique trades fidelity for tokens on a different curve. This post, under the context engineering pillar, walks three families and how to combine them without fooling yourself about what was lost.

Why Compression Matters Even at 1M Tokens

A million tokens is not the end of budgeting.

Cost scales with input. Filled prompts pay per token on every call. Caching helps on the static prefix, but the variable tail — history, retrieval, user input — still pays full freight. A compressed tail is a smaller invoice every call.

Latency scales with input. Time-to-first-token grows with context size. A million-token prompt that answers in thirty seconds is not the same UX as a 50,000-token prompt answering in four.

Attention is uneven. Content mid-prompt is recalled less reliably than content at either end. Noise dilutes signal even when it fits. See context window management strategies for the broader budgeting picture.

Summarization-Based Compression

The oldest and most general family: rewrite earlier context as a shorter summary. Replay a 200-token summary plus the last few turns verbatim instead of 5,000 tokens of chat history.

How it works. A model — sometimes the main one, often a smaller cheaper one — reads the material and produces a condensed version that enters the prompt in place of the original. Common flavors:

Rolling summary. Running summary of everything prior to the recent window. After each turn, update it and drop old verbatim turns.
Recursive summarization. For very long source material, summarize in chunks, then summarize the summaries. Compression rates can be large, but each level loses more.
Structured summary. Force the summary into a schema — facts, decisions, open questions, action items — so what survives is predictable rather than at the summarizer's whim.

What you lose. Anything the summarizer judged unimportant. If the summary was produced before you knew a detail would matter, it's probably gone. Structured summaries mitigate this by fixing in advance what categories survive.

Where it fits. Chat histories, long document context queried repeatedly, multi-session agent state — coherent prose where downstream queries are reasonably predictable. Wrong tool for verbatim fidelity (quotes, code, contract language) or when future queries might ask about dropped details.

Semantic Chunking

Instead of rewriting, select. Chunk the source, embed each chunk, rank by relevance to the current query, pass only the top chunks. Most of the corpus never enters the prompt.

How it works. Standard retrieval pipeline — the machinery of RAG, pointed inward at the conversation or document you'd otherwise replay in full. Chunk size, embedding model, similarity threshold are tunable. Surviving chunks are verbatim; they just weren't all included.

What you lose. Anything below threshold. A feature when the query is specific and there's irrelevant material. A failure mode when the query is broad ("summarize what we discussed") or when relevance is flat and top-k misses something thinly spread.

Where it fits. Large corpora with narrow-slice queries — knowledge bases, long documents, histories where each question maps to a specific earlier exchange. Wrong tool when tasks need the whole picture. Chunking is also how hierarchical context loading decides what to pull at the leaf level.

Caveat. Embedding relevance is only as good as the model's match to your domain. Off-the-shelf embeddings on specialized jargon can miss. Measure — don't trust that top-k is actually top.

Token-Level Compression

The most aggressive family. Instead of rewriting or selecting, algorithmically remove tokens — the ones contributing least to meaning. LLMLingua is the best-known line of work: small language models score tokens by predictability, and low-information tokens are dropped.

How it works. A compressor reads the prompt, scores spans, outputs a shorter version stripping low-entropy filler — articles, redundant connectors, verbose phrasings — while keeping high-information tokens. Output often looks ungrammatical to a human but remains readable to the target LLM. Some approaches target a compression ratio; others a token budget.

What you lose. Mostly style, some redundancy, occasionally nuance. The bet is that the target model does not need pristine surface form — which holds much of the time and fails on material where precise wording matters (legal text, literal quotes, code). Fidelity loss is harder to predict than summarization because deletions are local, not holistic.

Where it fits. Prose where you control the compressor and have measured that task performance holds. Especially useful on top of summarization. Wrong tool for verbatim material or tasks that echo source back.

No specific ratio-vs-accuracy number here — published figures vary by model, task, and compressor, and your setup won't match any single paper. Measure on your traffic.

Comparing the Three Families

Dimension	Summarization	Semantic chunking	Token-level
Operation	Rewrite	Select	Trim
Typical compression rate	High, tunable by target length	Very high when queries are narrow	Moderate per pass, stackable
Fidelity loss	Whatever the summarizer drops	Chunks below threshold	Style, some nuance; occasional meaning
Handles verbatim material	Poorly	Well (chunks are verbatim)	Poorly
Cost at compression time	One model call per update	Embedding + similarity	One model call per compression
Cost at inference time	None extra	None extra	None extra
Best for	Chat history, multi-session state	Large corpora, narrow queries	Prose context where you can measure loss
Weakest on	Broad queries that need dropped detail	Broad queries that need coverage	Tasks that need exact wording

No family dominates. A well-run system picks per material — verbatim source stays verbatim and gets chunked, prose summaries get rewritten, and what remains gets token-trimmed at the end.

Hybrid Approaches

Most real systems combine two or three techniques. Common stacks:

Summarize then chunk. Rolling summary keeps the thread; chunked retrieval pulls verbatim exchanges when a question needs them. Standard for long-lived chat assistants.
Chunk then compress. Retrieval returns top-k; a token-level pass shrinks those chunks before they enter the prompt. Useful when top-k is already near budget.
Structured summary + verbatim tail. Last N turns verbatim plus a structured summary of everything earlier. Recent precision plus older context at predictable cost.
Tiered summarization. Daily summaries of hourly logs, weekly of daily, for systems across long horizons. See token economics guide for how layers interact with caching.

Rule of thumb: compress first where information is least valuable, and only as far as the downstream task can verifiably tolerate. Over-compression is silent — the model answers confidently on material you've quietly gutted.

Fidelity vs Compression Rate

Each family trades fidelity for rate on a different curve.

Summarization drops in steps. Modest compression keeps most of what matters; past a point, whole topics disappear.
Semantic chunking has a threshold cliff. Above threshold, verbatim; below, gone. Future queries that needed the dropped chunks see nothing.
Token-level compression is smooth — each increment costs a little more fidelity on the whole text rather than losing pieces wholesale. Safest family to apply last.

In combination, curves compose unpredictably. The reliable way to find a good operating point is concrete tasks, a fidelity eval, and measurement.

Testing Compression

Compression changes outputs. A pipeline that applies it and never checks is flying blind.

Set tasks that exercise what you wanted to keep. Not generic benchmarks — cases representative of your traffic. If users ask about a specific earlier exchange, the eval needs cases where the answer depends on that exchange.
Measure uncompressed first. Run on full context to set a ceiling, then rerun each variant. The delta is your fidelity loss.
Watch for silent failure. A compressed prompt that answers confidently but wrong is worse than one that refuses. Include cases where the compressor has plausibly dropped needed detail and check the model either answers correctly or flags uncertainty.
Re-evaluate on change. New model, new compressor, new source material — compression that worked can quietly stop working. Versioned config, not a one-time decision.

Example: A Compressed Chat-History Prompt

Hypothetical — a customer support assistant where the conversation history is compressed with a structured rolling summary plus a verbatim tail.

code

[SYSTEM]
You are a support assistant for ACME. The user's conversation history is
provided in two forms below:

1. SUMMARY — a structured summary of all turns prior to the last 5.
   Treat the SUMMARY as a faithful but lossy record. If a detail the user
   references isn't in the SUMMARY or in RECENT_TURNS, ask them to
   restate it rather than inventing it.

2. RECENT_TURNS — the last 5 turns verbatim. Treat these as authoritative
   about what was just said.

[CONTEXT]
SUMMARY:
- Identified issue: {open_issue}
- Decisions made so far: {decisions}
- Open questions: {open_questions}
- Relevant account facts: {account_facts}

RECENT_TURNS:
{last_5_turns_verbatim}

[USER]
{user_message}

The shape: the model knows which block is compressed and which is verbatim, what to do when needed information is in neither (ask, don't invent), and sees a schema-structured summary whose shape determines what survives. Compare against the fully-verbatim approach in context window management strategies — same problem, different trade-off.

Common Anti-Patterns

Compressing uniformly. The same ratio on every kind of material throws away verbatim fidelity where you needed it and keeps filler where you didn't. Tier by material.
Unstructured summarization at scale. Without a schema, the summary drifts — one week it includes numbers, the next it doesn't. Structure the summary.
Chunking without relevance measurement. Trusting off-the-shelf embeddings on your domain without checking is how retrieval silently misses what it should have found.
Token-level compression on verbatim-sensitive text. Contracts, quotes, code, log lines — none survive token trimming intact. Different family, not a tighter ratio.
No evaluation. "It still works in the demo" is how regressions reach production. Measure against tasks that depend on compressed material.
Compressing cached prefixes. If what you're compressing was static and cacheable, you paid an inference cost to lose the bigger cache-hit discount. Compress variable material; cache the stable prefix.

FAQ

How much can I compress before the model gets worse?

Depends on material, compressor, target model, and task. Published ratios won't transfer. Build an eval on representative tasks, start modest, measure, and increase until the eval regresses. Back off one step.

Does compression interact with prompt caching?

Yes — often badly if applied naively. Compression mutates the prompt and invalidates the cache. Compress the variable part (history, retrieval); leave the stable prefix (system instructions, tool definitions) intact and cached. See token economics guide.

Summarization versus chunking — when?

Summarization fits coherent prose you reference repeatedly — chat history, meeting notes, document skims. Chunking fits large corpora where each query touches a narrow slice — knowledge bases, long reference documents. Many systems use both.

Is token-level compression worth it?

Often as a final pass after summarization and chunking have shaped the prompt. Rarely as a sole technique — structural compression usually wins first. Treat it as the last increment you squeeze out, not the primary mechanism.

What breaks if I over-compress?

Models answer confidently on material that no longer contains the answer. Hallucinations, subtle errors, contradictions with earlier conversation. The UI looks the same; only your eval catches the regression. Always have the eval.

Wrap-Up

Context compression is budget allocation in disguise. Summarization rewrites; chunking selects; token-level compression trims. Each trades fidelity on a different curve and fails in a characteristic way. Real systems combine them — structured summary plus verbatim tail plus top-k retrieval plus a final token-level pass — and measure at every step, because regressions are silent and confidence is cheap. Pick the family that matches what you can afford to lose, structure the output so what survives is predictable, and never ship a compression change you haven't evaluated.

For the broader frame, the context engineering pillar. For the budgeting picture, context window management strategies. For layered prompts, hierarchical context loading. For the cost side, token economics guide. For the vocabulary, context window.

Context Compression Techniques (2026)

Why Compression Matters Even at 1M Tokens

Summarization-Based Compression

Semantic Chunking

Token-Level Compression

Comparing the Three Families

Hybrid Approaches

Fidelity vs Compression Rate

Testing Compression

Example: A Compressed Chat-History Prompt

Common Anti-Patterns

FAQ

How much can I compress before the model gets worse?

Does compression interact with prompt caching?

Summarization versus chunking — when?

Is token-level compression worth it?

What breaks if I over-compress?

Wrap-Up

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

Context Engineering: The 2026 Replacement for Prompt Engineering

Context Window Management Strategies (2026)

Hierarchical Context Loading: Load Specific First (2026)