Context windows keep getting bigger. Output quality doesn't track linearly — past a certain fill level, it goes the other way. That drift is context rot: observable quality decay inside a context still well within the model's advertised limit. Rot is the model quietly missing references, re-asking for information already in-window, skipping an instruction that sits three thousand tokens above the current question. This post, under the context engineering pillar, unpacks what rot is, why it happens, how to notice it, and what to do about it.
What Context Rot Is
Context rot is quality decay that kicks in before you hit the context window limit. The request fits; the model accepts it; the response looks plausible. Under the hood, accuracy on mid-context content is softer, references get dropped, and the model skews toward patterns from the top and bottom of the window.
Nothing breaks visibly. A system that worked fine at 8K misses things at 40K. Tests pass if they only probe the ends. Production complaints trickle in as "the model ignored my spec" or "it forgot what I told it earlier." The window didn't overflow. The quality ceiling dropped.
A working definition: context rot is the gap between the advertised context length and the length at which the model still uses all the content reliably. The gap exists for every model and is larger than marketing suggests.
Why It Happens
Three patterns drive rot.
Attention thins as the window fills. Attention heads concentrate around anchors — the latest user turn, the system prompt, the nearest labeled section — and coverage thins as you move away. A dense 8K context gives each token a meaningful share. A 100K context distributes that share across vastly more tokens, so any individual passage gets less.
Middle content is weaker. Long-context evaluations repeatedly show material buried in the middle is recalled less reliably than material at the start or end. The exact shape varies by model and task, but the direction holds. We go deeper in needle-in-a-haystack prompting.
Irrelevant context is active noise. Filler doesn't just dilute — it competes. Unrelated documents pull attention, surface distractor facts, and raise the chance the model grabs a plausible-looking passage instead of the right one. "More context" is not free.
Put together: fill the window, push the important piece into the middle, surround it with near-but-not-quite-relevant content, and effective accuracy drops well before the hard limit.
Rot vs. Limit — Different Problems
Treating rot as the same as overflow is a common mistake. They're not.
| Problem | Symptom | Fix |
|---|---|---|
| Context limit hit | Request truncates or errors | Trim, compress, or split |
| Context rot | Request succeeds, output quality drops | Compress, selectively retrieve, or restructure |
| Prompt ambiguity | Inconsistent answers on short inputs | Rewrite prompt, add examples |
| Knowledge gap | Wrong answers regardless of context | Ground with retrieval or tools |
Limit is a binary you trip over. Rot is a gradient you slide down. A system can sit under the limit and still rot — 80K used of a 200K window, answers at the 60K mark worse than at 20K. The fix isn't more tokens. It's less.
Signals That Rot Is in Play
Rot rarely announces itself. You spot it by its fingerprints.
- Reference questions fail. The model answers confidently but misquotes or skips the specific fact that was in the supplied context.
- It re-asks for information already given. A clarifying question about something stated two thousand tokens above.
- Cross-references get missed. Two items need joining; the model handles each alone but doesn't connect them.
- Hallucination climbs. Not wild fabrication — small plausible invented details replacing harder-to-retrieve real ones.
- Instructions leak. A system-prompt rule ("always cite the source file") gets dropped on longer requests but honored on short ones.
- Quality degrades with fill, not complexity. Same task shape, same difficulty, accuracy drops as context grows. The gradient is the tell.
If several of these show up together on longer inputs but not shorter ones, rot is the working hypothesis.
Detecting Rot Before It Hurts Production
You can't eyeball rot. It hides behind plausible answers. Active probes are the move.
Known-answer probes. Embed a deliberately placed fact the user would never supply — a synthetic customer ID, a made-up style rule — and ask a follow-up whose correct answer depends on it. Run the same probe at different context lengths. If pass rate drops as fill scales, rot is in play.
Reference questions. After the main answer, append a question that requires specific content from the supplied material: "which section states the cancellation policy?" If the answer cites content that isn't in the supplied doc, or cites nothing, the model isn't reading what you think.
Needle-in-a-haystack at production shape. The academic benchmark puts a fact in filler. Production rot involves structured context — retrieved chunks, tool outputs, conversation. Build a needle test using your real context shape, with the needle at varying depths.
Ablation. Keep the question constant. Trim context progressively and see where accuracy recovers. The size at which answers return is your rot boundary for that task.
Single-length evaluations miss rot. Evaluations that sweep length catch it.
An Example Rot Probe
A hypothetical reference-question probe for a support agent reading a policy document. Illustrative, not real code.
[SYSTEM]
You are a support agent. Answer using only the supplied policy. If the policy doesn't cover the question, say so.
[POLICY DOCUMENT — 40,000 tokens]
...long real policy text...
[Inserted at 25K depth — the probe]
§ Internal note 7.B: Accounts flagged with the internal code
"AUBERGINE-9" receive a 72-hour response SLA regardless of plan tier.
This applies even for free-tier accounts.
...remainder of policy text...
[USER]
Primary: A free-tier account tagged "AUBERGINE-9" has opened a ticket. What SLA applies?
Reference check: Quote the exact section you used to answer, by section number.
Expected pass: "72-hour SLA" cited as "§ 7.B (Internal note)." Rot signals: "standard free-tier SLA" (missed the fact), correct answer with a fabricated section cite, or ignoring the reference check. Move the note to 5K, 15K, 35K depth. The depth at which reference accuracy falls off is your rot boundary for that shape of context.
Mitigations
Four levers, roughly in order of reach.
1. Compression. Shrink what you send. Summarize prior turns into a running synopsis. Strip boilerplate from retrieved docs. Drop tool outputs to relevant fields. Less filler per useful token, less rot. See context compression techniques for methods that keep fidelity while cutting volume.
2. Selective retrieval. Don't dump top-k. Rank aggressively, cap k, and cut below a relevance threshold. Adding a fifth marginal chunk usually introduces more distractor risk than it adds signal. Tighter retrieval beats bigger retrieval.
3. Task splitting. If one prompt asks the model to digest a corpus and answer a specific question, split. First prompt: extract relevant passages. Second prompt: answer using only those. The second runs on a tiny dense context, well below any rot boundary. Map-reduce over documents — more accurate than single-shot long-context for tasks that tolerate the extra hop.
4. Hierarchical loading. Same size, better order. Task-specific material nearest the question, stable defaults in the system prompt, supporting material in between. Rot hits the middle hardest — if the middle is lower-priority, the damage is less. Covered in long context prompting guide.
For most production systems, compression plus selective retrieval clears the bulk of rot. Task splitting handles the rest when the corpus is genuinely large.
When Context Rot Isn't the Problem
Not every long-context failure is rot. Two common misdiagnoses.
It's actually a prompt problem. If the model fails on a short 2K request the same way it fails on 60K, the context isn't rotting — the instructions are ambiguous, the output format is underspecified, or the task is under-scoped. Fix the prompt first. Re-run with long context only if the short version works.
It's a knowledge gap. Sometimes the model doesn't know the domain well enough, and no amount of rearrangement fixes it. Signal: even a minimal, perfect context with the exact answer highlighted still produces a wrong response. The fix is grounding, not rot mitigation.
A cheap check: reproduce on short context with the same content density. Still fails? Not rot. Passes short, fails long? Working hypothesis: rot.
Common Anti-Patterns
Rot-inducing habits show up often.
- Dumping top-20 retrieval. Past the top few, added chunks are mostly distractors.
- Never trimming conversation history. Unbounded chat threads guarantee rot by turn 40.
- Treating "fits in window" as "works in window." The advertised limit is a hardware bound, not a quality guarantee.
- Testing only at one context length. Evals that never vary fill miss rot completely. Sweep at least two sizes.
- Tool-output sprawl. Pasting full API responses when the agent needed one field.
- Invisible duplication. Same fact in retrieval, memory, and system prompt. Deduplicate before send.
FAQ
Is context rot the same as "lost in the middle"?
Related but not identical. Lost in the middle is one mechanism — middle content is recalled less reliably. Rot is broader: it also includes attention thinning across the full window and noise from low-relevance content.
At what context length does rot start?
It depends on the model and task shape, and no single published number generalizes. Measure it for your setup with a known-answer probe sweep. Run at several fill levels, watch where pass rate drops, use that as your planning boundary.
Does a bigger context window help or hurt?
Both. A bigger window fits more when you need to, but a bigger budget encourages filling it, and the rot gradient doesn't care the window got larger. Bigger windows raise the ceiling, not the floor.
Can I trust the model to pick what's important?
Not reliably. Models skim long context with attention, not comprehension, and are more likely to latch onto a confident-looking passage than the right one. Selection is a job you do before sending.
How often should I re-measure rot boundaries?
Whenever the model, retrieval system, average fill, or task shape changes. Model upgrades often shift the boundary. Don't assume last quarter's mitigations are still the right size.
Wrap-Up
Context rot is real, gradual, and usually invisible to anyone not measuring it. The window fits, the answer reads fine, accuracy has slipped. Catch it with known-answer probes and reference questions that sweep context length. Mitigate with compression, selective retrieval, task splitting, and better ordering — compression and retrieval discipline clear most cases. Rot is not a hard failure, which is exactly what makes it dangerous.
For the broader frame, the context engineering pillar. For length-aware prompting, long context prompting. For depth-sensitive evaluation, needle-in-a-haystack prompting. For shrinking what you send, context compression techniques. For the base term, context window.