Tip
TL;DR: A 30-minute workshop applying the Context Engineering Maturity Model to one specific product or flow. Three to six people, ten diagnostic questions, five minutes of disagreement resolution, one concrete upgrade committed before next week. Output is a dated ticket, not a slide.
Key takeaways:
- Score one feature, not the whole team. The Maturity Model is per-product; averaging destroys the signal.
- Ten questions, ten minutes of answering, five minutes resolving disagreement. Keep it tight.
- Default to the lower score when people disagree. It gives you room to prove you are higher.
- Output is a single dated ticket with an owner, not a quarterly roadmap.
- Run the workshop quarterly while actively upgrading; annually once stable.
- Pair with the SurePrompts Quality Rubric — the Maturity Model grades your infrastructure, the Rubric grades the prompt itself.
Why run this workshop
Engineering teams drift on context engineering the same way they drift on code quality, testing discipline, and documentation. Somebody sets up retrieval and declares the team is at Level 3. Somebody else adds prompt caching and declares Level 4. Six months later nobody can agree on where the system actually sits, what works, or what the next move is. Meanwhile the prompts keep getting longer, the bill keeps climbing, and regressions keep shipping.
Half the battle is alignment. Once a team agrees on where one specific feature sits on the Context Engineering Maturity Model and what the next concrete upgrade is, the path forward writes itself. Without that alignment, every conversation about context engineering turns into the same circular debate — architecture diagrams, vendor pitches, someone's blog post — with no decision at the end.
This workshop exists to produce one decision in 30 minutes. One feature scored. One weakest dimension identified. One upgrade committed to, with an owner and a date. That is enough. Anything more ambitious takes longer than 30 minutes and belongs in a follow-up.
The workshop is also a falsifiability exercise. In our experience, teams consistently overestimate their level by one. The workshop surfaces the overestimate by making people cite specific evidence for each question — and by defaulting disagreement to the lower estimate. If a team leaves the workshop having downgraded itself from "we think we're Level 4" to "we're honestly Level 3," that is a successful workshop, not a failure.
Before the workshop
Five minutes of prep, no more.
- Pick one target. One product, one flow, one feature. Not "our AI work." A customer-facing support agent. A document-summarization endpoint. A code-completion suggestion. If your team owns three features, pick the highest-stakes one and run the workshop three times, once each. Averaging across features produces a misleading score.
- Gather three to six participants. The engineers who write and maintain the prompts. A PM who owns the feature. Optionally a data or ML engineer, a support lead, or anyone else who regularly touches the retrieval or prompt layer. Fewer than three and you miss the disagreement that surfaces reality. More than six and the 30 minutes collapses under process overhead.
- Set up a shared surface. A whiteboard for in-person, a shared doc for remote. One column per participant for silent individual answers, then a consensus column. Nothing fancy.
- Print or pin the canonical. The Context Engineering Maturity Model post itself. You will reference it in the first five minutes and consult it during disagreement. Do not try to run the workshop from memory.
That is all the prep. No pre-reading assignment, no survey, no warm-up slides. The workshop is the work.
The 30 minutes, laid out
A minute-by-minute facilitation guide. Use a timer. The time pressure is a feature, not a bug — it stops the workshop from drifting into open-ended architectural discussion.
0–5 min: Review the 5 levels. The facilitator reads the one-sentence definition of each level aloud. L1: static hand-written prompts. L2: parameterized templates. L3: dynamic context assembly with retrieval. L4: cached and layered with memory, measured. L5: multi-source orchestration with semantic caching and inline evals. No Q&A yet. The canonical is on the wall or in the shared doc for reference.
5–15 min: Silent individual answers. Every participant answers the 10 diagnostic questions silently, in their own column. No discussion, no peeking. The questions are below. Ten minutes is enough for honest answers and not enough for negotiation with yourself.
15–20 min: Compare and resolve. The facilitator reads each question and each participant's answer aloud. When answers match, note the score and move on. When answers differ, spend no more than 60 seconds on that question — the person scoring lower explains what evidence they saw, the person scoring higher explains what evidence they saw, and the group defaults to the lower score with a note to investigate later. The goal is not to argue to consensus; it is to produce a defensible floor.
20–25 min: Identify the weakest dimension. At your current level, five dimensions matter — retrieval, caching, memory, evals, orchestration. Which one is the weakest link? The one the group can point at and say "if we fixed this, we'd move up." Often the weakest dimension is measurement itself — retrieval is live but unmeasured, caching is on but the hit rate is unknown. Name it specifically.
25–30 min: Commit to one concrete upgrade. One ticket. Owner named. Done before next week's workshop-review checkpoint (typically the following Monday). Not "we should measure retrieval precision." A specific ticket: "Instrument retrieval precision on support-agent traffic; target dashboard live by Friday; owner @priya." If you cannot write the ticket in five minutes, the upgrade is too big — scope it down.
That is the whole workshop. Everything after this point is homework.
The 10 diagnostic questions
These mirror the canonical self-assessment, rephrased for a group setting. Each question has a clear binary-ish answer and a specific level implication. Read the question, answer yes or no, move on.
Q1. Are your prompts stored in version control?
Yes means a git repo, a commit history, and prompt changes through review. No includes "they live in a Notion doc." If no, the feature is at L1.
Q2. Do your prompts have variable slots filled at runtime?
Yes means a template with named placeholders (Mustache, f-strings, Jinja) that the system fills per request. No includes hand-concatenated strings. If no, L1.
Q3. Do you have a shared template library reused across more than one feature?
Yes means at least one template is used in two or more places. No means every feature has its own one-off prompt. If no, L2 at best.
Q4. Do your prompts include runtime-retrieved content?
Content pulled from a vector store, a database, a document store, or prior conversation — anything that was not known at template-authoring time. If no, L2. If yes, continue. See the RAG prompt engineering guide for the canonical Level 3 path and RAG for the term.
Q5. Is your system prompt physically separated from the user turn at the API level?
Yes means you call the API with distinct system and user fields (or the provider equivalent). No means you concatenate everything into a single user message. See system prompt. If no, you are L2 cosplaying L3.
Q6. Do you measure retrieval precision or relevance on production traffic?
This is where most teams discover they are not at the level they thought. "We check it sometimes" is no. Yes requires a number that updates automatically on at least a daily cadence. If no, L3.
Q7. Do you use prompt caching, and can you report a cache hit rate?
Both halves matter. Caching without measurement does not count — you can ship provider caching and get zero actual hits because your system prompt drifts per request. See prompt caching and the prompt caching guide. If no to either, L3.
Q8. Does conversation history use deliberate summarization rather than a fixed-N-turns cutoff?
Yes means older turns are compressed into a rolling summary; recent turns are kept verbatim; the transition is engineered, not arbitrary. A hard cutoff at "last 10 turns" is no. See the AI memory systems guide. If no, L3 or early L4.
Q9. Do you have an inline eval harness sampling production traffic?
Inline means on live traffic, not a nightly batch of fixed inputs. Sampling means a subset (1 to 5 percent typically), evaluated automatically. A nightly eval that runs a fixed golden set is L4 infrastructure, not L5. If no inline harness, L4.
Q10. Do you budget the context window across the five inputs at assembly time?
The five inputs are system prompt, retrieval, conversation history, tool outputs, and examples. Budgeting means each block has a token cap, and the assembler enforces it before calling the model. If you just trim the whole prompt to fit the context window, that is not budgeting. If no, L4. If yes, you are at L5.
Your score is the highest level where every answer up to and including that level's threshold is yes. The questions are ordered by level, so the first "no" tells you the ceiling.
After the workshop — the commitment protocol
The workshop produces one artifact: the commitment. Everything else is notes. The commitment has four required fields and one optional one.
The ticket. A single, specifically scoped change in your team's existing ticket system. Title includes the target feature and the dimension. Example: "Support-agent: instrument retrieval precision on production sample." Not a label, not an OKR — a ticket with a description and acceptance criteria.
The owner. One named person. Not a team, not "whoever gets to it." The owner lands the change or escalates if blocked.
The rollback plan. What happens if the change makes things worse. For a measurement change, usually "turn off the instrumentation" — trivial. For an architectural change, a feature flag or a revertable deploy. Write it down even when trivial — the habit matters more than the complexity.
The re-check date. When the team reviews whether the change landed. For a one-week commitment, a 10-minute standup check the following week. No open-ended commitments.
(Optional) The next candidate. The second-weakest dimension, noted but not committed. When the first commitment lands, the group can check the note and decide whether it is still the next priority or whether reality has shifted.
The commitment protocol is what distinguishes this workshop from a vague agreement to "work on our context engineering." The ticket, the owner, the rollback, and the re-check date turn 30 minutes of discussion into something the team actually ships.
Common workshop failure modes
Four anti-patterns show up repeatedly. Watch for them.
Scoring the team instead of a specific flow. The Maturity Model is per feature. "What level is our team at?" has no useful answer — most teams span three or four levels simultaneously. If the workshop starts with "we're assessing the whole team's AI maturity," stop and pick one feature. Run it separately for each feature that matters. See context engineering best practices — they all apply per feature.
Everyone agrees on L4 with no evidence. If the group breezes through all ten questions in under five seconds each, somebody is not reading carefully. Real L4 teams have visible hesitation on Q6 and Q9 — the measurement questions. Unanimous, fast, confident L4 scoring is almost always overestimation. Push back: "What's our cache hit rate?" Silence is the answer.
"We'll do everything next quarter." The workshop is designed to produce one commitment, not a roadmap. A team that leaves with six upgrades ships none of them. If the group keeps adding items, the facilitator should say: "We are picking one. Which one unblocks the most?" The others go in a parking lot.
Treating the workshop as a retrospective. A retrospective reviews what happened and assigns blame. The workshop is forward-looking: what level are we at, what do we do next. If the conversation drifts into "we should have done this six months ago," cut it short.
Our position
- Default disagreement to the lower score. It gives the team room to prove they are higher later. Starting from an inflated score means every future conversation starts from a false baseline. We would rather a team leave saying "we're honestly at L3" and surprise themselves at the next workshop than leave claiming L4 and quietly regress.
- Score one feature per workshop. Never average. An average of L4, L3, L3, L2 is not "L3." It is a misleading single number that hides the real distribution. If you need a team-wide view, list the features and their levels side by side. Let the variance show.
- The commitment is the workshop output, not the score. A team that scores itself at L3 and commits to instrument retrieval precision by Friday has a successful workshop. A team that scores itself at L4 with no commitment has a failed workshop regardless of how high the number is.
- Run this separately from the SurePrompts Quality Rubric. The Rubric grades an individual prompt on seven dimensions including context sufficiency — see the Rubric glossary entry. The Maturity Model grades the system that produces the prompt. Both matter; they measure different things. Running them in the same meeting collapses the distinction and blurs both signals.
- Thirty minutes is a hard cap, not a suggestion. If the workshop is running long, the scope is wrong — either the feature is too broad, the group is too large, or the discussion is drifting into architecture instead of assessment. End at 30 minutes even if you have not finished. A partial result produced on time beats a complete result that takes two hours and nobody wants to repeat.
Related reading
- The Context Engineering Maturity Model — the canonical this workshop applies. Re-read before facilitating.
- Context Engineering: The 2026 Replacement for Prompt Engineering — the discipline overview this all operates inside.
- The SurePrompts Quality Rubric — the prompt-level companion assessment. Complementary, not overlapping.
- Prompt Caching Guide 2026 — the enabling primitive for Level 4 commitments.
- RAG Prompt Engineering Guide — the common path into Level 3.
- AI Memory Systems Guide — the memory layer that defines Level 4.
- Retrieval-Augmented Prompting Patterns — tactical patterns for Levels 3 and 4.
- Context Engineering Best Practices 2026 — the companion practice guide for teams actively moving up the model.