A context window in 2026 is wide enough — over a million tokens on recent Claude and Gemini models — that the temptation is to fill it. Don't. Whatever fits is not the same as whatever helps, and a prompt stuffed to the brim is usually both slower and worse than one that was managed. The real job is budget allocation: decide what deserves space at full fidelity, what should be summarized, what should be truncated, and what should come from retrieval on demand. This post sits under the context engineering pillar and walks through the four strategies that do that work.
Why Context Windows Need Management Even at 1M Tokens
Three reasons the budget matters even when the ceiling is high.
Cost scales with input tokens. A filled million-token prompt at every call is a line item. Even with caching on the static front, the variable tail pays per token on every request.
Latency scales with input tokens too. Time-to-first-token grows with context size on every provider. A prompt that fits in the window but takes twenty seconds to start streaming is not a good user experience.
Models don't attend equally across the window. The "lost in the middle" phenomenon is well observed across providers: content in the middle of a long context is retrieved less reliably than content at the beginning or end. Putting a critical instruction in token 400,000 of a 900,000-token prompt is not the same as putting it in token 400. See the long context prompting guide for the positional patterns.
The four strategies below — truncation, summarization, sliding windows, priority ordering — are how production systems keep the budget honest.
Truncation: Drop What No Longer Earns Its Place
Truncation is the bluntest tool and often the right one. Cap the context at a target size and cut what doesn't fit. Two common policies:
- Drop oldest. Keep the most recent N messages or tokens. Best when recency correlates with relevance.
- Drop least relevant. Score items by a signal (recency + retrieval score + user action) and drop the lowest. Best for mixed feeds where old items can still matter.
Truncation's virtue is that nothing remaining is paraphrased. What the model sees is the original wording. If the task depends on exact quotes, terminology, or numbers, truncation preserves them where summarization would risk drift.
The cost is that truncated content is gone. Pair truncation with a retrieval backstop so cut content remains addressable — the user or agent can pull it back in on demand. Truncation without retrieval is amnesia; truncation with retrieval is focus.
Summarization: Compress When You Can Afford Fidelity Loss
Summarization replaces a section of context with a shorter description. Multi-turn chat applications do this to keep long sessions tractable: once a conversation exceeds a threshold, the earliest N turns collapse into a "summary of the session so far" block, and the raw turns drop.
Summarization shines when exact wording doesn't matter downstream, a compressed representation is enough to maintain continuity, and you need some signal from many turns rather than lose them entirely.
It hurts when fine details — names, numbers, quoted phrasing, code — need to survive verbatim, when the summary comes from a weaker model that drops nuance the main model could have used, or when repeated layers of summarization drift (a summary of a summary of a summary compresses away specifics no one knew would matter).
A production pattern: summarize old turns but keep the last few raw, preserving recency detail while bounding total cost. See context compression techniques for when to summarize, when to extract structured facts, and when to just drop.
Sliding Windows: Keep Only the Recent N
A sliding window keeps the last N turns (or last N tokens) and drops everything before. It is a common default for chat and voice assistants where the near past dominates relevance.
Sliding windows are cheap, predictable, and easy to reason about. The trade is that anything earlier than N is gone — including instructions the user gave on turn 1 that still apply on turn 50. The fix is to anchor stable content outside the window: system prompt, user preferences, tool definitions, and session-wide rules sit in a fixed prefix that doesn't slide. Only the conversational portion slides.
Two knobs worth tuning: window size (start conservative at 10-20 turns, widen only if accuracy justifies it) and unit (turn count is simpler; token count adapts when some turns are short and others long). Pick one per surface. The hierarchical context loading guide covers the related stable-versus-dynamic tiers pattern.
Priority Ordering: Put Load-Bearing Content in Load-Bearing Positions
Priority ordering is not about what to include but where. Given that models attend more reliably to the beginning and end of a long context, place critical content there.
- Top: system identity, rules, output format, high-priority instructions. Reliably attended.
- Middle: reference material, retrieved documents, historical turns. Available, not guaranteed-attended.
- End (near the user message): the specific task, key retrieved facts, constraints that must bind this answer.
This is why a long prompt often benefits from restating the critical ask at the end — not as a verbosity tic, but because the tail is one of the two reliably-attended regions. If the request involves a long document, "Given the above, answer X with constraints Y and Z" as the final sentence routinely outperforms burying the question at the top.
Example: Hypothetical Priority-Ordered Prompt
[SYSTEM — top, load-bearing]
You are a contracts analyst. Respond only with JSON matching the schema below.
Schema: { "risks": [...], "ambiguities": [...], "questions": [...] }
[REFERENCE MATERIAL — middle, less reliably attended]
Company redlines policy:
<...policy text, 3,000 tokens...>
Relevant past clauses from similar contracts:
<...retrieved clauses, 5,000 tokens...>
[CONTRACT UNDER REVIEW — later middle]
<...full contract text, 40,000 tokens...>
[USER MESSAGE — end, load-bearing]
Review the contract above against the redlines policy.
Output risks, ambiguities, and questions, strictly as JSON in the schema.
Focus especially on indemnification clauses and termination terms.
This is a hypothetical template, but the shape is typical: rules and schema at the top, bulk material in the middle, explicit task restatement at the end. The middle is not wasted — the model uses it — but the two anchor positions carry the instructions that must not be missed.
Combining Strategies: Most Real Systems Use a Hybrid
The strategies above are not exclusive. Production context managers usually combine all four: priority ordering defines where each tier sits, a sliding window caps the conversational tail, summarization collapses older turns once they slide out of the raw window, and truncation is the hard backstop when even the summarized history exceeds budget. Retrieval pairs with all of them, pulling back content from outside the window on demand.
A typical assembly for a long-running chat agent: stable system prompt at the top; hierarchical reference material priority-ordered; summarized older conversation compressed to a few hundred tokens; sliding window of last N raw turns; current user message at the end; truncation check as a final backstop dropping from the middle first. This layering is what hierarchical context loading formalizes — each tier has a different rate of change and a different strategy for when the budget tightens.
Strategy Selector: Failure Mode to Strategy
When the current approach is failing, the symptom points to the fix.
| Symptom / Failure Mode | Likely Cause | Strategy That Fits |
|---|---|---|
| Model forgets early instructions mid-session | Sliding window dropped them | Anchor stable instructions outside the window (priority ordering) |
| Accuracy drops as context grows | Lost-in-the-middle on critical content | Re-order: move critical content to top or end (priority ordering) |
| Cost/latency growing linearly with session length | No compression on old turns | Summarization of older turns |
| Model hallucinates facts from earlier context | Summarization dropped specifics that mattered | Switch to truncation + retrieval for those items |
| Model cites wrong version of a document | Multiple versions in context, no clear priority | Truncate older versions; priority-order latest at top |
| Long reference doc in middle isn't being used | Middle-of-context position | Chunk and retrieve relevant sections, place near end |
| Session context overflows hard token limit | No backstop | Truncation with retrieval fallback |
| Answers feel generic in long sessions | Everything summarized, nothing raw | Keep last N turns raw (sliding window) |
Signals That Your Strategy Is Failing
Treat these as alarms:
- Accuracy drops as context grows. If evals score worse on longer sessions, something in the middle is getting ignored or misread. Check positional placement.
- Missed references. Model responds as if content you included wasn't there — either dead middle zone, or summarized past usefulness.
- Hallucinated specifics. Invented names, dates, or quotes. Truncation or over-aggressive summarization cut something the model now "remembers" by confabulation.
- Latency and cost creep. Time-to-first-token and input tokens climb with every turn. Compression isn't firing, or isn't firing enough.
Each maps to one of the strategies above.
Common Anti-Patterns
- Filling the window because you can. A million-token prompt with 30,000 tokens of actual relevant content is worse than a well-curated 30,000-token prompt — slower, more expensive, more lost in the middle.
- Summarizing without preserving recency. Collapsing the last few turns into a summary loses the raw detail the model needs for the next response. Keep the tail raw.
- Sliding the whole window, including system prompt. Instructions that should be permanent get dropped. Anchor stable tiers outside the sliding region.
- Putting critical instructions in the middle. The reliably-attended positions are top and end. Critical constraints belong in one of those two, usually both.
- Truncating without a retrieval backstop. Cut content should stay addressable. Otherwise the model confabulates or repeatedly asks the user to re-paste.
- Summarizing with a weaker model than the consumer. A cheap summarizer feeding a strong downstream model usually drops signal the downstream model could have used. Match capability to the task.
FAQ
How big should my context budget be?
Start with the smallest that covers the task and grow only when evals justify it. Bigger context is not free — it costs input tokens, latency, and positional reliability. A budget of 8,000-30,000 tokens handles most production applications; reaching for hundreds of thousands should be driven by specific need, not availability.
Should I summarize with the same model I use for inference?
Often yes. A weaker summarizer drops signal the downstream model could have used; a stronger one is overkill. If cost matters, test a cheaper summarizer against evals before committing.
When is retrieval better than any window strategy?
When the content changes often, is larger than the budget, or is only relevant to some calls. Retrieval keeps total context small and targeted; window strategies manage what's already there. They complement each other.
Does prompt caching change how I manage the window?
Yes — the stable front becomes cheap on repeat calls, which raises the cost of putting volatile content early. With caching on, priority ordering pairs with hierarchy: stable-cache-first, then dynamic tiers. See the prompt caching guide 2026.
What's the simplest viable context manager?
System prompt at top, sliding window of last 10 turns raw, truncation as backstop, retrieval for anything pulled back from further out. Add summarization when session length costs; add priority ordering when middle content starts being ignored.
Wrap-Up
The four strategies — truncation, summarization, sliding windows, priority ordering — are not alternatives. They are tools in a budget-allocation toolkit that production systems combine. Pick truncation when what remains must be verbatim. Pick summarization when you can trade specifics for continuity. Pick a sliding window when recency dominates relevance. Use priority ordering over all of them, because where content sits matters as much as whether it's there.
For the broader picture, the context engineering pillar frames window management as one part of a larger practice. For positional patterns, see the long context prompting guide. For compression, see context compression techniques and hierarchical context loading. For the one-paragraph definition, the context window glossary entry.