Tip
TL;DR: The Context Engineering Maturity Model is a 5-level framework for how teams assemble the context a language model sees — from Level 1 static hand-written prompts through Level 5 multi-source orchestration with semantic caching, dynamic context-window budgeting, and evaluation loops. Each level names a real capability jump, with concrete symptoms and a specific upgrade path to the next.
Key takeaways:
- Five levels, patterned loosely on CMMI: static prompts, parameterized templates, dynamic assembly, cached and layered with memory, multi-source orchestration with evals.
- Every level has concrete symptoms. If you cannot describe the pain you are in, you are not ready to move up.
- In our experience, many teams are Level 2 or 3 but think they are Level 4. Real Level 4 requires measurement, not just tools.
- Skipping levels produces fragile systems. A team jumping from Level 2 directly to "we built an agent" usually ends up running expensive, unreliable Level 3 on the surface with Level 5 ambitions underneath.
- Pair this with the SurePrompts Quality Rubric for the prompt-level audit and the Context Engineering pillar for the discipline overview.
Why a maturity model for context engineering?
Context engineering has become the 2026 category vocabulary for how systems assemble what a model sees on each call. The label is new; the problem is not. Teams have been doing context engineering ad hoc for two years without naming it. What is missing is a shared way to describe how well they are doing it.
Software engineering has had maturity models since CMMI. Security has one. Data engineering has one. Context engineering does not. Without a model, every team invents its own vocabulary and engineering leaders have no way to say "we are here, we need to be there, this is the gap."
This model exists to fill that gap. Five levels, each naming a real capability jump that changes what the system can do and what it costs to run. Each level carries concrete symptoms and a specific upgrade path.
One warning. Maturity models degrade into checklist theater when they reward surface indicators over real capability. A team with a vector database wired up is not automatically at Level 3 if nobody measures whether retrieval helps. A team that deploys prompt caching is not automatically at Level 4 if nobody tracks cache hit rates. Every level here requires evidence, not props.
The five levels
Level 1 — Static hand-written prompts
What it looks like. Prompts are hand-written, one per task, usually living in code files or a Notion doc. They are copy-pasted from previous work, edited in place, and shipped whenever the author is happy. No templating, no retrieval, no conversation history management beyond what the chat API returns, no caching, no evaluation beyond "I tried it a few times and it seemed fine." The prompt is the whole context surface, and it is static text.
Typical stack. A chat completion endpoint and a string. Sometimes a helper that concatenates the user question onto a hard-coded system prompt. Source control for the string, sometimes.
Symptoms you're stuck here. Two engineers change the same prompt and produce diverging outputs. The same prompt works on Monday and breaks on Thursday. You cannot answer "what changed?" when quality shifts.
Upgrade path. Introduce parameterized templates. Identify the varying parts (user name, product, task parameters), convert them into named slots filled at runtime, and commit the template to version control. A one-afternoon project in most codebases.
Level 2 — Parameterized templates
What it looks like. Prompts are templates with variable slots — Mustache-style placeholders, f-strings, or a templating engine. The same template handles many instances of the same task. Templates live somewhere discoverable, versioned, usually tested against a few golden inputs. The system prompt is still static, and dynamism comes from slotting runtime values into fixed positions.
Typical stack. A template engine (Handlebars, Jinja, the SurePrompts Mustache-style {{placeholders}} pattern), a shared template library, and a test harness that runs fixed inputs on change. Many teams adopt a prompt structure at this level — RCAF fits naturally, because Role, Context, Action, and Format become the four labeled slots a template fills. Few-shot examples, if used, belong inside Context — see few-shot prompting.
Symptoms you're stuck here. Templates work for the common case but fail on users whose situation is not captured by the slots. You add more slots to cover edge cases and the template becomes an unreadable wall of conditionals. Users ask questions that require information the template has no slot for — because it lives in a database, a document, or a prior conversation, and the prompt cannot reach it. This is where teams discover they need retrieval.
Upgrade path. Introduce dynamic context assembly. Add retrieval (vector search, keyword search, or direct queries against your own data), conditional context blocks, and a deliberate split between system prompt and user prompt. This is the jump to Level 3 — the point at which context engineering becomes a discipline rather than a string-formatting exercise.
Level 3 — Dynamic context assembly
What it looks like. Context is assembled at runtime from multiple sources. Retrieval (often RAG, sometimes direct queries) pulls relevant documents based on input. Conditional context blocks include framing only when a condition is met — user tier, task type, prior interaction signal. The system prompt is separated from the user turn and carries stable identity, rules, and reference material. Conversation history is usually handled by a simple "keep the last N turns verbatim" rule.
Typical stack. A vector database (Pinecone, Weaviate, pgvector) or structured retrieval over the product's own database, a chunking and embedding pipeline, a context assembler that builds the final prompt, and a token counter to avoid blowing past the context window. LangChain, LlamaIndex, or custom equivalents often appear here.
Symptoms you're stuck here. Retrieval quality is unmeasured — nobody knows what percentage of answers actually used the retrieved context correctly. Costs climb because every call pays full input-token price on a long assembled prompt, and nothing is cached. Conversation history either gets dropped at an arbitrary cutoff or replayed in full, and neither is right. Agent behavior, if you have it, is unpredictable because context rebuilds from scratch on every step.
Upgrade path. Introduce prompt caching and treat memory as a first-class layer. Design the system prompt so stable content sits at the top and rarely changes, making it cacheable. Compress older history into a rolling summary; keep recent turns verbatim. Split user-level and session-level memory into distinct layers. Instrument the system: cache hit rates, retrieval precision, token cost, latency — per request. Without measurement you cannot claim you moved up.
Level 4 — Cached and layered context with memory
What it looks like. Context is assembled in deliberate layers, each with its own cache lifetime and source of truth. The system prompt is long, stable, and aggressively cached — every call reuses it. Conversation history is compressed on a schedule: last two or three turns verbatim, everything before summarized into a rolling digest. User-level and session-level memory exist as separate layers, surfaced when the task calls for them. Tool use outputs are shaped for reuse across steps, not dumped raw. The team can point at a cache hit rate, a retrieval precision number, and a cost-per-request trend.
Typical stack. Prompt caching offered by major providers (e.g., Claude cache breakpoints, OpenAI prompt caching), a scheduled summarizer, a memory store (Redis, a database, or purpose-built) keyed by user and session, a tool-output formatter, and a telemetry layer reporting cache hit rate, token cost, and latency per request. Many teams introduce a nightly eval harness here — fixed inputs run through the system on a schedule.
Symptoms you're stuck here. The system works for the cases you designed for and fails invisibly for the ones you did not. Retrieval returns near-duplicates that inflate cost without adding signal. Multiple agents share context inconsistently — one agent's memory is invisible to another when it shouldn't be. The eval harness catches regressions after they ship because it only runs nightly. Cost control is reactive — you notice a bill spike and then chase it down.
Upgrade path. Introduce semantic caching, dynamic context-window budgeting, and evals in the loop. Cache not just prefixes but semantically similar queries. Allocate the context window as a budget across five inputs — system prompt, retrieval, history, tool outputs, examples — and enforce it at assembly time. Run evals inline on a sample of production traffic. Share context across agents through a common layer with explicit access rules.
Level 5 — Multi-source orchestration with semantic caching and evaluation loops
What it looks like. Context is orchestrated, not assembled. Multiple sources — retrieval, memory, tool outputs, prior agent steps, user state — feed into a budgeting layer that decides what goes in the prompt per request, based on the task. Semantic caching sits on top of prefix caching and catches near-duplicate queries. Context window management is dynamic: long contexts use different strategies than short ones, and the system chooses. Evals run inline on a sample of real traffic, and their results feed back into retrieval parameters, cache policies, and budgeting. Cost and latency are tracked per context block. Multiple agents share context through a defined protocol, not by accident.
Typical stack. An in-house context orchestrator, semantic cache (vector-indexed past queries and responses), a budgeting module allocating tokens at runtime, an inline eval harness, a shared memory and context layer across agents, and telemetry granular enough to attribute spend to individual context blocks. By Level 5 the system is bespoke — a sign of maturity, not a warning.
Symptoms you're stuck here. Almost nothing technical — Level 5 teams work at the frontier. The symptoms are organizational: context orchestration becomes a platform every product team depends on, and platform/product tension emerges. The eval loop feeds into too many decisions to reason about cleanly. New capabilities cost more to build because the system is complex.
Upgrade path. There is no Level 6. The move past Level 5 is discipline about when not to use it — dropping back to Level 3 or 4 for features that do not justify the overhead. Sophistication is a tool, not a destination.
Summary table
| Level | Characteristics | Typical stack | Upgrade trigger |
|---|---|---|---|
| 1 — Static hand-written | Hand-written prompts, copy-pasted, no templating, no retrieval, no caching. | Chat completion API plus a hard-coded string. | Prompts diverge across engineers; same prompt breaks from Monday to Thursday. |
| 2 — Parameterized templates | Variable slots filled at runtime; shared template library; golden-input tests. | Template engine, versioned template repo, small test harness. | Edge cases outgrow available slots; information the prompt needs lives elsewhere. |
| 3 — Dynamic context assembly | Retrieval wired in, conditional context blocks, separated system and user prompts. | Vector DB or structured retrieval, chunking pipeline, context assembler, token counter. | Retrieval quality unmeasured; costs climb; history strategy ad hoc; agents unpredictable. |
| 4 — Cached + layered with memory | Cache-friendly system prompt, deliberate summarization, memory as first-class layer, measurement. | Prompt caching, summarizer, memory store, formatted tool outputs, nightly eval harness. | Regressions caught too late; cost control reactive; context silos across products. |
| 5 — Multi-source orchestration | Semantic caching, dynamic budgeting, inline evals, shared context across agents. | In-house orchestrator, semantic cache, budgeting module, inline eval harness, per-block telemetry. | No further level — discipline becomes "when not to use Level 5." |
How to self-assess
Answer each question honestly. Your level is the highest one where you can answer "yes" to all questions up to and including that level's threshold.
- Are your prompts stored in version control? If no, you are Level 1. If yes, continue.
- Do your prompts have variable slots filled at runtime? If no, you are Level 1. If yes, continue.
- Do you have a shared template library reused across more than one feature? If no, you are Level 2. If yes, continue.
- Do your prompts include runtime-retrieved content (documents, database rows, prior context)? If no, you are Level 2. If yes, continue.
- Is your system prompt physically separated from the user turn at the API level? If no, you are Level 2 dressed up as Level 3. If yes, continue.
- Do you measure retrieval precision or relevance on production traffic? If no, you are Level 3. If yes, continue.
- Do you use prompt caching, and can you report a cache hit rate? If no, you are Level 3. If yes, continue.
- Does conversation history use deliberate summarization rather than a fixed-N-turns cutoff? If no, you are Level 3 or early Level 4. If yes, continue.
- Do you have an inline eval harness sampling production traffic (not just nightly fixed inputs)? If no, you are Level 4. If yes, continue.
- Do you budget the context window across the five inputs (system, retrieval, history, tools, examples) at assembly time? If no, you are Level 4. If yes, you are at Level 5.
We've found that teams commonly overestimate by one level. If your answer to question 6 is "we check retrieval sometimes" or "the eng lead eyeballs it," that is not measurement — that is vibes.
Upgrade paths and common traps
Each level has its own characteristic failure mode. Recognizing which one you are in tells you whether you are stuck or moving.
Stuck at Level 1. The team treats prompts as configuration, not code. Edits go uncommitted, changes go undocumented, and quality becomes personality-dependent. The fix is always the same: commit the prompt, version it, and start tracking changes.
Stuck at Level 2. Template sprawl. Each new edge case adds a conditional block or a new slot, and the template becomes an unreadable tangle. The underlying problem is that the prompt needs information the template has no access to, and the team is papering over the gap with structure. Retrieval is the answer — not more slots.
Stuck at Level 3. Retrieval is in place but nobody measures it. The team adopts RAG, ships it, and declares victory. Quality stops improving and costs keep climbing. The missing piece is measurement: retrieval precision, answer-uses-retrieval rate, cost per request. Without numbers, the team cannot tell whether retrieval is helping or hurting, and they cannot move to Level 4.
Stuck at Level 4. The team has prompt caching, memory, and a nightly eval harness. It looks mature. What is missing is the loop — evals catch regressions after they ship, cost control happens after the bill arrives, and retrieval parameters are tuned manually once a quarter. Moving to Level 5 means closing the loop: inline evals, dynamic budgeting, cost telemetry granular enough to act on.
Stuck at Level 5. Rare. The context orchestration platform every product depends on becomes a bottleneck. The fix is organizational — treat it like any shared dependency, with SLAs, documented interfaces, and escalation paths. Level 5 is an engineering achievement; keeping it useful is a management one.
A separate trap: skipping levels. A team that jumps from Level 2 to "we built an agent" usually ends up running a Level 2 template library behind a Level 5 architecture — no caching, no memory layer, no measurement, and a codebase nobody can reason about. Each level builds capabilities the next one assumes.
Our position
- It's our hypothesis that many teams are Level 2 or 3 and believe they are Level 4. Real Level 4 requires measurement — cache hit rate, retrieval precision, cost per request — not just the tools. Owning a vector database is not Level 3; owning a vector database with measured retrieval quality is.
- Level 5 is rare, expensive, and not always worth it. Match the level to the stakes. Internal tools top out at Level 3. Consumer features often land at Level 4. Level 5 pays for itself only on products where context quality drives revenue or risk directly.
- Prompt caching is the enabling primitive for Level 4, not a nice-to-have. Without caching, long stable system prompts are prohibitively expensive per call, and the architectural moves that define Level 4 — long system prompt, deliberate memory layer, structured tool outputs — are not economic. Caching unlocks affordability and affordability unlocks the architecture.
- Skipping levels produces fragile systems. The jump from hand-written prompts to a multi-agent architecture looks impressive on a slide and breaks quietly in production because the intermediate capabilities (measurement, summarization, cache discipline) were never built.
- Evaluation belongs inside the loop by Level 5, not outside it. A nightly eval harness is Level 4 infrastructure. An eval harness that samples production traffic and feeds results back into retrieval parameters, cache policies, and budgeting decisions is what makes Level 5 different from a well-instrumented Level 4.
- The Maturity Model is a targeting tool, not a prestige ladder. The right level is the one that matches the stakes. A team running every product at Level 5 is over-invested; a team running a high-stakes agent at Level 2 is under-invested. Neither is "mature."
Related reading
- Context Engineering: The 2026 Replacement for Prompt Engineering — the discipline overview this model operates inside.
- The SurePrompts Quality Rubric — the prompt-level audit. Context sufficiency is one of its seven dimensions.
- The RCAF Prompt Structure — the drafting skeleton that fits naturally at Level 2 and above.
- Prompt Caching Guide 2026 — the enabling primitive for Level 4.
- Context Window Management Strategies — the budgeting work Level 5 formalizes.
- RAG Prompt Engineering Guide — one common path into Level 3.
- AI Memory Systems Guide — the memory layer that defines Level 4.
- Retrieval-Augmented Prompting Patterns — tactical patterns for Levels 3 and 4.
- Context Engineering Best Practices 2026 — the companion practice guide.