Most of what makes an LLM feature work in production is decided before the model runs — in how the context window gets packed, cached, and ordered. This post is the distillation of the context engineering pillar: a 12-point checklist that ties the discipline's moving pieces together. Each item is short, with a link back to the deep-dive. Read it as the last-step review before you ship.
One-paragraph framing: context engineering is the practice of deciding, per call, what information the model sees, in what order, at what cost, and with what caching behavior. Prompt engineering is a subset — the wording layer. Everything below sits above and around that wording layer.
The 12-Point Checklist
1. Budget the context window as tokens, not "fill it up"
Treat the window like a cost line item, not free space. Every request has a latency budget and a dollar budget; both are roughly linear in tokens in, plus non-linear in attention cost past certain window sizes. "The model supports 200K tokens" is not an invitation to send 200K — it's a ceiling. Pick a working budget per route (system, retrieved context, few-shot, user, reserved output), enforce it in code, and fail loudly when something overshoots. More detail and concrete budgets by route-type in context window management strategies and the related deep-dive on long-context prompting.
2. Put stable content first
The top of the prompt should be the part that never changes across requests in a session — role, policies, domain guide, tool definitions, invariant examples. The bottom should be the part that does change — retrieved docs, the user turn, the tool results. This ordering isn't only stylistic; it's what makes caching work and what keeps critical instructions out of the lossy middle of the window. See system prompt vs user prompt context for what belongs where, and prompt caching guide 2026 for why prefix stability matters economically.
3. Use explicit cache markers where supported
If your provider exposes cache control (Anthropic's cache breakpoints, OpenAI's automatic prefix caching, Gemini's explicit cache), use them. The price delta on cached input tokens is large enough that "most of the prompt is reused" should almost always translate to "most of the prompt is cached." Mark a stable system block, mark a stable tool-schema block, and keep those blocks byte-identical across calls. The compare-and-contrast between providers is in Claude vs OpenAI prompt caching, and the broader trade-offs with embedding-based reuse live in semantic caching vs prompt caching. See also the prompt caching glossary entry.
4. Retrieve selectively, not exhaustively
Sending 30 retrieved chunks "so the model has enough" is a common failure mode. Retrieval recall past a certain point buys nothing — and often loses accuracy because the relevant chunk gets drowned. A retrieval layer is a ranking-and-filtering layer, not a dump. Tune k for each route (often 3–8), require a minimum score threshold, and deduplicate near-identical passages before they hit the prompt. Groundedness instructions and evidence-format patterns are in retrieval-augmented prompting patterns.
5. Load context hierarchically
Context arrives in layers: system prompt, session memory, retrieved evidence, tool outputs, the user turn. Design each layer as its own slot with its own eligibility rules and its own budget. Higher-stability layers sit at the top (cacheable); lower-stability layers sit at the bottom (fresh). When something goes wrong, you can debug one layer at a time instead of re-reading a monolithic string. The slot-by-slot design pattern is covered in hierarchical context loading and the request-time composition step in dynamic context assembly patterns.
6. Pick few-shot examples by similarity and diversity
A handful of well-chosen examples beats a big fixed set. Select by semantic similarity to the incoming query, then apply a diversity filter so you're not shipping near-duplicates. Order weakest-to-strongest so the most-relevant example sits closest to the user turn, where recency bias helps. Examples must model the exact output shape you want — format, length, label vocabulary — because when instructions and examples disagree, examples usually win. Full treatment in the few-shot example selection guide.
7. Compress before you truncate
When a thread or a document set won't fit, the default reflex is to truncate — drop the oldest turns, clip the tail of the document. Compression is almost always a better first move: summarize older turns into a rolling recap; extract the key facts from long documents into a few sentences; keep the raw source only for the spans that actually matter. Compression preserves the signal truncation throws away, and it's cheap compared to the bad completions truncation produces. Techniques, when to run them, and cost math in context compression techniques.
8. Be explicit about groundedness
If the model is meant to answer from retrieved evidence, say so — and say what to do when the evidence is missing. "Answer only from the passages below. If the passages do not contain the answer, say you don't know" is the minimum. Tag each passage with a stable identifier and ask the model to cite the tag it used. Without this, retrieval silently degrades into "the model's prior, lightly flavored by what you retrieved," and the groundedness you thought you had isn't there. More patterns and failure modes in retrieval-augmented prompting patterns.
9. Use the right memory shape for the task
Memory is not one thing. A chat agent wants rolling session memory with periodic summarization. A research agent wants a scratchpad it can read and rewrite. A long-running assistant wants user-level memories with extract/write/retrieve stages and an eviction policy. Picking the wrong shape — sticking a vector DB onto a problem that needed a running summary, or a running summary onto a problem that needed per-user facts — wastes tokens and makes the agent feel forgetful. The taxonomy and when to use which is in AI memory systems guide.
10. Test for context rot and needles
Long context isn't "the same as short context, just more." Accuracy drops as windows fill, specific positions in the window get worse, and small facts in the middle get missed. Build two kinds of probes into your eval harness: context rot probes that measure quality degradation as the window grows, and needle-in-a-haystack probes that insert a known fact at varying depths and check whether the model retrieves it. If you ship without these and a customer reports "it forgot the thing we told it earlier," you'll have no way to reproduce it cheaply. Background on both in the context rot problem explained and needle-in-a-haystack prompting. For the reasoning-depth angle on long prompts, see extended thinking prompts in Claude.
11. Tier models by task
Not every call needs the flagship. Draft with a cheap model, re-rank or judge with a medium one, reserve the frontier model for the turns that actually need it. Reasoning-heavy models cost more per output token and run slower, so using them on cheap classification or boilerplate generation burns money without lifting quality. Context engineering includes the routing layer: what context goes to which model, under which budget. Practical routing patterns and cost math in token economics guide 2026.
12. Instrument and iterate
Ship nothing you can't see. Log the assembled prompt (or a hash plus slot sizes if privacy rules require it), model and parameters, cache-hit status, latency, and token counts — per call, correlated with outputs and feedback. Run known-answer probes on every deploy to catch regressions before users do. Teams that stay ahead of context-quality drift aren't the ones with the cleverest prompts; they're the ones whose observability catches a dropping cache hit rate or a failing needle eval.
Putting It Together: A Production Prompt Skeleton
The following is a hypothetical assembly that shows several of the practices above at once — stable-first ordering, explicit cache markers, hierarchical slots, selective retrieval, groundedness instruction, and a freshly-assembled user turn at the bottom.
[CACHE: stable prefix — marked for provider cache, byte-identical across calls]
<system>
You are a support assistant for Acme Corp's billing system.
Follow the Acme response style guide (below) in every answer.
If the retrieved passages do not contain the answer, reply:
"I don't see this in our billing docs — please open a ticket."
Cite the passage tag (e.g. [doc:refund-policy#3]) for any factual claim.
</system>
<style_guide>
- Short paragraphs. No emojis. No apologies for missing info.
- Prefer numbered steps for procedures.
- Dollar amounts always with currency symbol and two decimals.
</style_guide>
<tool_schemas>
{tool schema JSON, stable across calls}
</tool_schemas>
[/CACHE]
[DYNAMIC: composed per request]
<session_summary>
User is on the Pro plan, billed annually, last invoice 2026-03-14.
Previously asked about proration; resolved.
</session_summary>
<retrieved_passages>
[doc:refund-policy#3] Annual plans are refundable within 14 days of renewal...
[doc:proration#1] Mid-cycle upgrades are prorated against the remaining term...
[doc:billing-contacts#2] Billing disputes route to billing@acme.example...
</retrieved_passages>
<user>
I upgraded from Pro to Team 6 days ago and want to know what I'd be
refunded if I cancel today.
</user>
Read it as shape, not template. Top half is cacheable, written once, rarely changed. Bottom half is assembled per request, small, evidence-tagged, and grounded by explicit instruction. The division is what makes both caching and debugging tractable.
Closing
None of these practices are individually clever. The discipline is in doing all of them, consistently, on every route — and treating the assembly layer as code, not copy-paste. A prompt you can version, test, cache, and observe will beat a "better-worded" prompt that gets rewritten ad hoc every sprint.
For the full framing, the context engineering pillar is the entry point. For the vocabulary, the glossary entry for context engineering and the glossary entry for prompt caching. For where the boundary with prompt engineering sits, context engineering vs prompt engineering.
FAQ
Where should a team new to context engineering start?
With the two practices that compound the fastest: stable-first prompt ordering plus explicit cache markers (items 2 and 3). Those two together usually move cost and latency before any other change, and they set up the structure the rest of the checklist assumes. After that, instrumentation (item 12) — because without logs, every later tweak is a guess.
Is this checklist provider-specific?
The practices are provider-agnostic; some of the mechanisms aren't. Cache-marker syntax, maximum cached prefix length, and discount percentages differ across Anthropic, OpenAI, and Gemini. The Claude vs OpenAI prompt caching comparison covers the big ones. Treat the checklist as what to do; consult your provider's docs for how.
How do I know a practice is actually helping?
Pair each change with an eval. For caching, watch cache hit rate and cost per request. For retrieval changes, watch groundedness and answer-accuracy on a held-out set. For compression, watch the summary-faithfulness metric and downstream accuracy. A change without an eval isn't a best practice — it's a hypothesis.