Long context is no longer the constraint it once was. Recent Claude and Gemini models accept over a million tokens in a single context window — enough for entire codebases, hundred-page contracts, or multi-day agent transcripts. The mistake is treating that ceiling as a free pass. Putting content into the window is half the job; the other half is prompting in a way that lets the model navigate, weight, and retrieve from it. This guide, under the context engineering pillar, covers the three skills long-context prompting actually requires: structural markers, placement strategy, and knowing when retrieval beats packing.
When a 1M-Token Window Actually Helps
The big window is genuinely useful when the whole input needs to be co-present and cross-referenced.
- Codebase analysis and refactoring. Following a function call across files, reasoning about a trace through ten modules, generating a migration that depends on types defined in a dozen places. Retrieval can miss cross-file edges co-presence surfaces naturally.
- Long-document Q&A. Contracts, filings, and papers where the answer depends on definitions in section 2 and exceptions in section 47. Summarizing first risks dropping the exact clause the question turns on.
- Multi-turn agents with rich state. Agents running for hours with tool-call logs and observations that feed later decisions. Aggressive trimming loses the thread.
- Low-latency multi-document synthesis. When retrieval + reranking + generation would blow the latency budget and the source set is small enough to just include.
In each case, the task genuinely needs the model attending across the whole thing.
When It Doesn't
The window is not the right tool when the task is precision retrieval, when it's short, or when the cost-to-accuracy ratio gets worse as context grows.
- Precision retrieval from a large corpus. If the answer is "find the indemnification clause in this 200-page contract," a retriever pulling the five most relevant sections feeds the model better-focused context than dumping the whole document.
- Short, self-contained tasks. Translation, classification, single-file completion. Extra context is pure cost — dollars, latency, and noise.
- Cost-sensitive SLAs. Input tokens are priced per request. Caching the static front helps; the variable tail still pays per token on every call.
- Latency-critical paths. Time-to-first-token rises with input size on every provider. A 900,000-token prompt that takes twenty seconds to start streaming is a bad UX even when the answer is right.
A useful reflex: if you can't explain why the task needs the whole window co-present, it probably doesn't.
Structural Markers: Help the Model Navigate
A million tokens of undifferentiated text is a haystack. A million tokens with clear boundaries and consistent tags is a library with a table of contents. Markers do three jobs:
- Segment. Let the model treat chunks as distinct units instead of a blur.
- Label. Attach semantics to each chunk ("this is the contract," "this is the policy").
- Anchor. Give the model something to refer back to when the instruction says "using the policy in Section 2, review the contract in Section 3."
Concrete patterns that work:
- XML-style tags.
<document>...</document>,<policy>...</policy>,<transcript>...</transcript>. Models parse these cleanly as scoping boundaries, and they make programmatic assembly easier on the app side. - Markdown section headers.
## Contract Under Review,### Appendix A — Definitions. Useful when the content is already document-like. - Consistent delimiters.
===== DOCUMENT 1 =====/===== DOCUMENT 2 =====. Less structured than XML but unambiguous when you just need clear breaks. - Numbered sections. "Section 1," "Section 2," then instructions like "using Section 1, analyze Section 2." Numbering gives the model a cheap way to address content precisely.
The cardinal rule: pick a convention and use it consistently. Mixing XML, Markdown, and plain ----- delimiters forces the model to re-learn structure in every section. Consistency lets it generalize after seeing the pattern once.
Placement Strategy: The Middle Decays
The "lost in the middle" pattern is well observed across providers and context sizes: attention is not uniform across a long context. Content near the beginning and end is retrieved and used more reliably than content in the middle. The effect is strong enough to shape where load-bearing content goes.
- Top is prime real estate. System role, output format, rules, the schema a response must match. Whatever the model must attend to goes here.
- End is the other prime position. The specific task, binding constraints, and critical restatements of the ask. "Answer in JSON using the schema above; focus especially on X" as the last thing the model sees routinely outperforms burying it earlier.
- Middle is not useless — it's where bulk reference material lives. Retrieved documents, prior conversation, long source content. Just don't put a critical instruction there.
The operational move: restate the critical ask at the end of a long prompt. Not verbosity — the tail is one of the two reliably-attended regions. If your prompt is 400,000 tokens of contract text with a question at the top, that question sits in a weaker position than if you'd repeated it at the bottom. The context window management strategies guide covers this priority-ordering pattern and how it interacts with truncation and summarization.
Position-to-Purpose Map
| Position | Reliability | What belongs here |
|---|---|---|
| Top (first few thousand tokens) | High | System role, output schema, rules, high-priority instructions |
| Upper middle | Medium | Stable reference material, policies, background documents |
| Middle | Lowest | Bulk retrieved content, historical turns, long source documents |
| Lower middle | Medium | Task-specific context, recent tool outputs |
| End (last few thousand tokens, near user message) | High | The specific task, key constraints, restated ask |
Retrieval vs. Packing: When Retrieval Wins
The availability of a million-token window makes it tempting to skip retrieval entirely and just pack everything in. Sometimes that's right. Often it isn't.
Retrieval beats brute-force packing when any of these hold:
- The corpus is larger than the window. Obvious but worth stating.
- Most of the corpus is irrelevant to any given question. Reading 800,000 tokens to find the 2,000 that matter pays for all 800,000 and fights the lost-in-the-middle effect against itself.
- Content changes frequently. Retrieval re-indexes once; packing re-sends the whole thing every call.
- You need precision over breadth. A good retriever + small focused context often beats the same model given the full corpus on fact-extraction.
- Cost or latency is constrained. Retrieval keeps tokens-per-call small and predictable.
Packing beats retrieval when the task genuinely needs cross-section reasoning (codebase analysis, synthesis across chapters) or when the retrieval step introduces errors that swamp the gain.
A useful hybrid: packed context for cross-section reasoning, retrieved context for precision lookup. Many production systems do both. The needle-in-a-haystack prompting guide covers single-fact retrieval; the context rot problem covers what happens when packed context degrades quality instead of helping.
Prompting Across Long Contexts: Explicit References and Anchors
Once content is segmented and placed, the last piece is telling the model how to use it. Two patterns carry most of the weight.
Explicit reference instructions. Don't say "using the context above." Say "using the policy in Section 2, review the contract in Section 3." Named anchors beat vague gestures. If sections are numbered or tagged, use the numbers and tags in the instruction — the model will follow them.
Repeated anchors near the ask. For very long prompts, re-declare the key sections near the end: "Recall: the redlines policy is in <policy>, the contract under review is in <contract>, output schema is at the top. Produce the analysis as JSON now." That's a mini table-of-contents placed at a load-bearing position, letting the model re-index into the middle from a strong vantage point.
Testing Long-Context Prompts: The Needle Check
Before trusting a long-context prompt in production, verify it empirically. The core test is a needle-in-a-haystack check: plant a small, specific, easy-to-verify fact somewhere in the middle of the real prompt, then ask the model for it.
- If the model retrieves it reliably across multiple positions, your markers and placement are working.
- If it misses the needle in the middle but finds it at top or end, that's classic lost-in-the-middle — consider retrieval, or re-order so critical content sits in reliable zones.
- If it confabulates an answer instead of saying "not found" when the needle is absent, tighten the instructions with "if the fact is not present, say so explicitly."
Run the test on the real prompt shape with a handful of planted facts and positions. It takes minutes and tells you more than any static inspection.
Example: Prompt Structured for a 500k-Token Input
Hypothetical but representative — a contracts-review prompt with markers at boundaries and the ask restated at the end.
[SYSTEM — top, load-bearing]
You are a contracts analyst. Respond only with JSON matching this schema:
{
"risks": [{ "clause": string, "severity": "low"|"med"|"high", "why": string }],
"ambiguities": [{ "clause": string, "reading_a": string, "reading_b": string }],
"questions": [string]
}
<policy>
# Redlines policy (~5,000 tokens)
...
</policy>
<reference_clauses>
# Retrieved similar-contract clauses for comparison (~25,000 tokens)
...
</reference_clauses>
<contract>
# Full contract under review (~450,000 tokens)
## Section 1 — Definitions
...
## Section 2 — Scope
...
## Section 47 — Termination
...
</contract>
[USER MESSAGE — end, load-bearing]
Using the redlines policy in <policy> and the reference clauses in
<reference_clauses>, review the full contract in <contract>.
Focus especially on indemnification, limitation of liability, and termination.
For any risk flagged, quote the exact clause text from <contract> — do not
paraphrase. If a topic required by the policy is not addressed anywhere in
the contract, list it under "questions" rather than inventing coverage.
Output strictly as JSON matching the schema at the top. No prose outside JSON.
The shape holds up at scale: schema at the top, bulk material tagged and segmented in the middle, explicit reference instructions at the end, and a final restatement of the output format. The model never has to guess where anything is.
Common Anti-Patterns
- Filling the window because you can. A million-token prompt with 30,000 tokens of actually relevant content is worse than a curated 30,000-token prompt — slower, pricier, and more subject to middle-decay.
- No structural markers. Pasting ten documents back-to-back with no delimiters and expecting the model to know where one ends and the next begins. It often guesses wrong.
- Critical instruction buried in the middle. The ask at token 400,000 of a 900,000-token prompt is in the weakest position in the window. Move it to top or end — ideally both.
- Mixed marker conventions. XML tags in one section, Markdown headings in another,
-----delimiters in a third. Pick one. - Vague reference instructions. "Using the context above" forces the model to search. "Using the policy in
<policy>" tells it exactly where to look. - No needle test before shipping. Shipping a long-context prompt without empirical retrieval checks is shipping on vibes.
FAQ
How big is "long context" in 2026?
Over a million tokens on recent Claude and Gemini models. Exact ceilings vary by model and tier; check the provider's docs for the specific number.
Does the model attend equally across the window?
No. The "lost in the middle" pattern — stronger attention at the beginning and end, weaker in the middle — is observable across providers. It's why placement matters even when everything technically fits.
Should I always use retrieval instead of long context?
No. Retrieval wins on precision lookup over large corpora; packing wins on cross-section reasoning where the whole needs to be co-present. Many production systems use both.
XML tags or Markdown headers?
Use XML tags for unambiguous scoping and easy programmatic assembly (<policy>, <contract>). Use Markdown headers when the content is already document-like. Pick one convention per prompt.
What's the simplest durable long-context shape?
System rules at the top. Content segmented with consistent tags in the middle. A user message at the end that names the tags explicitly, restates the schema, and asks the question. Run a needle test before trusting it.
Wrap-Up
Long context solves "will it fit?" It doesn't solve "will the model use it well?" — that's still on the prompter. Structural markers give the model a map. Placement strategy puts load-bearing content in load-bearing positions. Retrieval stays in the toolkit for when precision matters more than breadth. Test with a needle check before shipping.
For the broader frame, the context engineering pillar. For budget allocation across the window, context window management strategies. For precision lookup, needle-in-a-haystack prompting. For what goes wrong when packed context stops helping, the context rot problem. For the definition, the context window glossary entry.