Dynamic Context Assembly Patterns (2026)

Q: What are the four dynamic context assembly patterns?

Four patterns cover most of the work. Template + slots defines the shape of the prompt — named, typed, independent sections like persona, rules, memory, docs, history, and task, each with a fetcher function. Conditional inclusion decides whether a slot gets filled this turn, using rules like a retrieval relevance threshold or memory applicability. Ordered injection decides in what order the filled slots concatenate, computed per request so the most relevant content lands nearest the query where attention is strongest. Size-aware assembly is the budget check: after slots are filled and ordered, the assembler measures total tokens against a budget and drops or compresses the least-important content until it fits. They compose into a pipeline where each stage is independent enough to test and swap on its own.

Q: What token budget should I target for an assembled prompt?

Target a budget well under the model's context window, since you need room for output tokens and some headroom. A common starting point is 60 to 70 percent of the model's context window for input, leaving the rest for output. The exact figure depends on the model, the task, and whether you need a long response. Tighter budgets usually produce better answers — long contexts are not always better. The hard rule is to never ship a prompt that exceeds the model's limit: over-budget API calls fail outright or get silently truncated on the provider side, often cutting the tail of the prompt where the task usually lives. Measure tokens and fail loudly rather than truncate silently.

Q: How do I keep dynamic assembly from breaking prompt caching?

Keep the stable prefix truly stable and put all dynamic content after it. Dynamic assembly can accidentally break prompt caching when it injects per-request content — a timestamp, a user ID, retrieved documents — into what should be the cached prefix. The persona, rules, and format sections belong in the stable prefix where they are cacheable and still in view; everything that changes per request goes after them, nearest the query. Injecting variable content at the top of the system prompt breaks caching on every request. As a related discipline, log the assembled prompt for every request or a sampled subset so you can see what the model actually saw when it misbehaves.

Imtiaz Rayhan

A static prompt is a string you hand-write once and send as-is. A real production chat app, agent, or copilot almost never ships a static prompt. What it ships is an assembled prompt — built fresh for each request from a template, retrieved documents, recent memory, the latest tool output, and the current user turn. Dynamic context assembly is the pattern that governs how all those pieces come together at runtime. This post, part of the context engineering pillar, covers the four patterns that cover most of the work.

What Dynamic Context Assembly Is

Dynamic context assembly is the runtime process of constructing the prompt from parts. Instead of "here is the prompt I wrote," the application has a set of building blocks — a persona template, a retrieval function, a memory store, a tool-output formatter — and an assembler that stitches them into a final payload per request.

The contrast with static templating is worth keeping in mind. A static template has placeholders filled with fixed values: name goes here, question goes there. Dynamic assembly is strictly richer. Every piece can be gated, sized, ordered, compressed, or dropped based on the current query and available budget. Two successive requests from the same user may produce prompts that differ in which documents were pulled, whether a memory snippet fired, and how the tool trace was truncated. Same template, different assembly.

Why It Matters

Three kinds of application effectively require dynamic assembly.

Agents. Long-running loops accumulate tool calls, observations, and intermediate reasoning. The prompt at step 20 is not the prompt at step 2; what's in scope depends on what just happened.
Multi-turn chat with retrieval. Each turn may pull different documents. Stale retrievals from earlier turns need to be evicted or the window fills up with context that no longer matches the current question.
Personalization. Different users have different tiers, preferences, and history. A static template can't adapt; an assembler can.

The payoff is that the model sees the right context for this turn, sized to fit, ordered so the important parts land where attention is strongest. The cost is a harder thing to test — more on that in the pitfalls section.

Pattern 1: Template + Slots

The foundational pattern. Start with a template that defines the shape of the prompt — sections, headings, placeholders — and fill the slots with values computed per request.

A minimal shape:

code

[PERSONA]  →  filled from persona config for this app
[RULES]    →  filled from policy store for this role
[MEMORY]   →  filled from memory retrieval (or empty)
[DOCS]     →  filled from RAG retrieval (or empty)
[HISTORY]  →  filled from conversation history (possibly compressed)
[TASK]     →  filled from the current user message

Slots are named, typed, and independent. Each has a fetcher — a function that returns a string (or empty) given the current request. The template doesn't know how [DOCS] gets populated; it only knows where to put it and how to format the section heading.

This pattern by itself isn't dynamic — it's just templating. It becomes dynamic when combined with the other three patterns, but the slot structure is the prerequisite. Without named slots you can't reason about what's in the prompt or change it programmatically. See the note on system vs user prompt split for which slots typically belong in which role.

Pattern 2: Conditional Inclusion

Not every slot gets filled on every turn. Conditional inclusion is the rule set that decides: include this section only if some condition holds.

Typical conditions:

Retrieval relevance threshold. Include [DOCS] only if top-k retrieval scores exceed some minimum. If the best hit is weak, better to include nothing than to pad the prompt with irrelevant text that the model may overweight.
Memory applicability. Include [MEMORY] only if a memory retrieval returned a match for the current topic or user. An empty memory lookup produces no section, not an empty "Memory: none" line.
Tool output freshness. Include the last tool result only if it was generated in this turn or the prior one; drop older traces.
User tier or role. Include extra policy rules only for power-tier users where they apply.
Task type. Include a code-style section only when the task is identified as a coding task.

The payoff is token savings and attention focus. Empty sections burn budget and invite the model to react to content that isn't there ("Memory: none" can read as "the system is telling me memory is unavailable — flag that"). Conditional inclusion keeps the prompt clean when parts don't apply.

The cost is a branching surface. A prompt with six conditionally-included sections has up to 64 possible shapes. You can't eyeball-test all of them; you need evals that exercise the conditional logic. More on that under pitfalls.

Pattern 3: Ordered Injection

Once you know which slots are included, you have to decide in what order to concatenate them. Hierarchical context loading is the strategy: most specific and relevant content nearest the query, general fallback pushed to the edges.

Applied dynamically, ordering is computed per request rather than fixed by the template. A template might specify a default order — persona, rules, memory, docs, history, task — but the assembler may reshuffle based on what fired:

If retrieval returned a single high-confidence document, place it immediately before the task.
If multiple documents returned, sort by relevance score descending and place the strongest closest to the task.
If memory fired with a user preference relevant to the task, hoist it near the task even if the default order put it earlier.
If tool output is the actionable context, put it last, right before the task restatement.

The logic is usually: the closer to the current user turn, the higher the attention weight. Reserve those slots for content the model must condition on to answer. Everything else — persona, background rules — can sit in the stable prefix where it's cacheable and still in view.

Ordering is a free lever. It costs zero extra tokens. Most teams under-use it because default template order becomes invisible — worth revisiting when answers start drifting for reasons you can't pin on retrieval quality.

Pattern 4: Size-Aware Assembly

The fourth pattern is the budget check. After slots are filled and ordered, the assembler measures the total token count and compares against a budget — typically well under the model's context window, since you want output tokens and some headroom.

If the assembled prompt fits, ship it. If it exceeds the budget, drop or compress the least-important content until it fits. Drop rules, ordered by importance:

History first. Oldest turns go first; compress or summarize surviving turns.
Low-ranked retrieval next. If five documents were pulled, keep the top two, drop the rest.
Memory last among dynamic content. Memory is usually small but contextually valuable.
Core sections never. Persona, rules, and the current task itself never get dropped.

The right order is domain-specific. A coding agent may prize the tool trace over history; a support bot may prize history over retrieved docs. The pattern is that there's an explicit priority list and an explicit budget, enforced by the assembler rather than left to chance. See context compression techniques for how to shrink rather than drop.

Hard rule: never ship a prompt that exceeds the model's limit. API calls that over-budget fail outright or get silently truncated on the provider side, often cutting the part you care about most — the end of the prompt, where the task usually lives.

The Four Patterns Together

Pattern	Question it answers	Runs when
Template + slots	What are the sections of my prompt?	Design-time
Conditional inclusion	Should this slot be filled this turn?	Per request, before retrieval
Ordered injection	In what order should filled slots concatenate?	Per request, after retrieval
Size-aware assembly	Does the result fit my budget?	Per request, last step

The patterns compose into a pipeline. A request comes in; the assembler runs the pipeline; the prompt goes to the model. Each stage is independent enough to test on its own and swap implementations without touching the others.

Where Assembly Logic Lives

The orchestration code has to live somewhere. Common placements:

In the application layer. A handful of functions inside your app — buildPrompt(request) — pull pieces together. Simplest, most flexible, least reusable. Fine for single applications.
In a prompt library or helper module. A shared module that exposes a declarative API — register slots, register conditions, register a budget — and runs the pipeline. Good for teams running multiple prompts with similar structure.
In an agent framework. Agent frameworks typically own assembly for you — you declare tools, memory, and retrieval sources, and the framework assembles the prompt each turn. Lowest flexibility, highest leverage when the framework's shape matches yours.
In middleware. An assembly service sits between the app and the model API, receiving structured requests and emitting the final prompt. Useful when multiple apps share the same pattern; adds a network hop.

There's no right answer; the pattern is more important than the location. What matters is that assembly logic is centralized enough to reason about and test, not scattered across a dozen string concatenations in UI components.

An Illustrative Assembly Function

The following is pseudocode to show the four patterns together. It's a sketch for illustration, not a production implementation — real systems add error handling, logging, caching, and more nuanced priority rules.

code

function assemblePrompt(request, config) {
  // Pattern 1: template slots
  const slots = {
    persona: config.persona,           // stable
    rules: config.rulesForRole(request.role),
    memory: null,
    docs: null,
    toolOutput: null,
    history: null,
    task: request.userMessage,
  };

  // Pattern 2: conditional inclusion
  const memoryHit = memory.lookup(request.userId, request.userMessage);
  if (memoryHit && memoryHit.score > 0.7) {
    slots.memory = formatMemory(memoryHit);
  }

  const docs = retrieval.search(request.userMessage, { topK: 5 });
  const relevantDocs = docs.filter(d => d.score > 0.6);
  if (relevantDocs.length > 0) {
    slots.docs = formatDocs(relevantDocs);
  }

  if (request.lastToolOutput && isRecent(request.lastToolOutput)) {
    slots.toolOutput = formatToolOutput(request.lastToolOutput);
  }

  if (request.history && request.history.length > 0) {
    slots.history = compress(request.history);
  }

  // Pattern 3: ordered injection
  // Stable prefix first (cacheable), then dynamic content nearest the task
  const ordered = [
    slots.persona,
    slots.rules,
    slots.memory,
    slots.history,
    slots.docs,
    slots.toolOutput,
    slots.task,
  ].filter(Boolean);

  // Pattern 4: size-aware assembly
  let prompt = ordered.join("\n\n");
  let tokens = countTokens(prompt);
  const budget = config.tokenBudget;

  // Drop priority: history, low-ranked docs, memory; never persona/rules/task
  while (tokens > budget && canShrink(slots)) {
    slots = shrink(slots); // compresses history, trims docs, etc.
    prompt = rebuild(slots);
    tokens = countTokens(prompt);
  }

  if (tokens > budget) {
    throw new Error("Cannot fit prompt within budget after shrinking");
  }

  return prompt;
}

Read it as a shape, not a spec. The point is that each pattern corresponds to a distinct stage — easy to test, easy to change, easy to swap.

Pitfalls

Runtime complexity. An assembled prompt is harder to debug than a static one. Log the assembled prompt for every request — or a sampled subset — so when the model misbehaves you can see what it actually saw.
Testing surface. Conditional inclusion multiplies possible prompt shapes. Golden-file tests that snapshot the assembled prompt for known inputs catch drift. Full eval suites need to cover the conditional branches, not just the happy path.
Cache invalidation. Dynamic assembly can accidentally break prompt caching if it injects per-request content into the stable prefix. Keep the stable prefix — persona, rules, format — truly stable; put all dynamic content after it.
Silent truncation. If you don't enforce the budget, the provider will, often by cutting the tail. Measure tokens; fail loudly rather than truncate silently.
Empty-section artifacts. Including a heading with no content ("Retrieved documents: none") is usually worse than omitting the section. Either condition on inclusion or make the empty state genuinely informative.
Order drift. Changes to the template or to ordering rules can quietly change the prompt shape for every request. Tie ordering to a version string and log it so behavior changes can be traced to their cause.

Common Anti-Patterns

String concatenation scattered across the app. Prompt pieces assembled inline in handlers. Works until it doesn't; no single place to debug or change the shape.
No budget check. Assembling whatever retrieval returns and shipping it. Works until a long document or long history pushes past the window.
Per-request content in the cached prefix. Injecting the timestamp or user ID at the top of the system prompt. Breaks caching on every request.
All-or-nothing retrieval. Including retrieved documents regardless of score. Low-relevance content takes budget from the task and can mislead the answer.
Order fixed by template only. Ignoring runtime signal about which content matters most for this turn.
Unlogged assemblies. Dynamic prompts that are never captured. Makes it impossible to debug model behavior after the fact.

FAQ

How is dynamic assembly different from prompt chaining?

Assembly builds a single prompt from parts, then makes one model call. Chaining makes multiple model calls, each with its own prompt, passing intermediate results forward. An agent loop combines both: each step assembles a prompt (dynamic) and the overall task progresses through multiple calls (chaining).

Do I need all four patterns from day one?

Usually no. Start with template + slots and a fixed order. Add size-aware assembly as soon as you enable retrieval or long histories — the budget check is cheap insurance. Add conditional inclusion when you notice irrelevant content hurting answers. Add dynamic ordering last, when answers drift for reasons you can't pin on retrieval quality.

Should I version my assembly template?

Yes, especially once multiple conditions fire. Tag each request with the template version that produced it. When a regression appears, you can filter to requests using the old version and confirm whether the change caused it.

What's a reasonable token budget to target?

Depends on the model, the task, and whether you need room for a long response. A common starting point is 60–70% of the model's context window for input, leaving the rest for output. Tighter budgets usually produce better answers — long contexts aren't always better.

Where should I put conversation history in the assembled prompt?

Before the retrieved documents and before the current task is a common choice, so the model reads history as background before reading the query's retrieved evidence. If history is long and compressed, it can also move earlier — the compressed summary becomes part of the stable framing. Test both; the right order depends on your history length and retrieval quality.

Wrap-Up

Dynamic context assembly is how real chat apps and agents build prompts at runtime. Four patterns cover most of the work: template + slots defines the shape, conditional inclusion decides what to include, ordered injection decides where pieces land, and size-aware assembly enforces the budget. Compose them into a pipeline, centralize the logic, log the result, and you get prompts that adapt per request without the behavior going opaque.

The tradeoff is real: assembled prompts are harder to test and debug than static ones. The way through is discipline — named slots, versioned templates, logged outputs, evals that exercise the conditional branches. Do that, and the flexibility pays back many times over in answer quality and token efficiency.

For the pillar, context engineering. For ordering strategy, hierarchical context loading. For what goes in memory vs prompt, AI memory systems. For retrieval patterns, retrieval-augmented prompting. For the term itself, context engineering.