Tip
TL;DR: This is a layer-by-layer walkthrough building a hypothetical research agent with the Agentic Prompt Stack. Each of the six layers — Goals, Tool permissions, Planning scaffold, Memory access, Output validation, Error recovery — is shown with the literal prompt text that goes in it, the failure modes that layer prevents, and how the finished prompt scores against the SurePrompts Quality Rubric.
Key takeaways:
- The Agentic Prompt Stack is six layers; the research agent below is the smallest realistic scenario that exercises all of them.
- The literal prompt text per layer matters more than the names. Most "Layer 3 problems" are really "no worked example of the scaffold was ever shown."
- Layer 5 (Output validation) is where shape bugs go to hide. A self-critique block inside the prompt plus programmatic schema checking outside it is the minimum that holds up.
- Layer 6 (Error recovery) is where infinite loops are prevented. A retry cap, a give-up condition, and a graceful-abort shape are the three pieces that must exist.
- The assembled prompt at the end of this post scores around 31/35 on the SurePrompts Quality Rubric. A tighter score is easy; a production-grade agent needs layer completeness more than Rubric perfection.
- Debugging by symptom-to-layer mapping is faster than debugging by prompt rewriting. The stack's value is mostly post-launch, not pre-launch.
The research agent we're building
Hypothetical scenario, not a shipped product. Given a topic like "current approaches to retrieval-augmented generation for long-tail technical documentation", the agent produces a structured research brief. Three tools: web_search (with query-count budget), fetch_url (read a specific page), and save_note (commits facts and citations to a structured notes buffer the agent reads back between steps). Final output: a JSON object with summary, five key findings, and cited sources. Trajectory caps at fifteen steps.
Small enough to fit one system prompt, large enough that all six layers of the Agentic Prompt Stack pull weight.
Layer 1 — Goals
Layer 1 answers three questions: what is "done," what is out of scope, what global constraints hold. The agent needs a success criterion it can self-check.
Goal: produce a structured research brief on the user-supplied topic.
Done when you emit a FINAL JSON (schema in Layer 5) containing:
- summary: 120–180 words,
- exactly 5 distinct key findings,
- ≥3 cited sources with URL and one-sentence attribution.
Out of scope: rewriting the topic; citing sources not fetched this
session; exceeding Layer 2 budgets; any output besides the FINAL JSON.
Hard constraints:
- Never fabricate URLs, quotations, or statistics.
- No fact without a source in notes.
- If you cannot meet the criterion within budget, emit PARTIAL
(schema in Layer 5), not padded content.
What this prevents. Infinite exploration, or completion claimed on a half-formed brief. The word range, fixed finding count, and citation floor give the agent a boolean to evaluate.
Rubric note. The RCAF Prompt Structure is the drafting skeleton underneath — Role (research agent), Context (user topic), Action (produce a brief), Format (Layer 5 schema). Layer 1 is the constraint-rich version of those slots.
Layer 2 — Tool permissions
Enumerates tools, their schemas, their call conditions, and their budgets. This is where tool use and function calling live — and where destructive failures originate, so the wording is stricter than Layer 1.
Allowed tools (no others):
1. web_search(query) -> [{title, url, snippet}]
Budget: 5 calls. Use to discover candidate sources.
2. fetch_url(url) -> {status, title, text} // text truncated to 8000 chars
Budget: 8 calls. Use when a snippet is insufficient AND you intend
to cite the page. Only URLs returned by a prior web_search.
3. save_note(kind, payload) -> ok
kind="fact": {claim, source_url, quote?}
kind="dead_end": {what_you_tried, why_failed}
kind="source": {url, title, credibility_hint}
If you need a tool not listed, emit PARTIAL with limitation="required
tool unavailable". Never invent tools or arguments.
tool_choice consideration. Set tool_choice="auto" for the main loop so the model freely picks search, fetch, save, or FINAL. On the Layer 5 validation pass, set tool_choice="none" — the agent should be thinking, not tool-calling. Forcing tool_choice="required" is usually a mistake here: the research loop has steps where no tool call is the right move.
What this prevents. Malformed arguments. Invented tools. Budget overruns. The prompt caps are belt-and-suspenders with runtime enforcement.
Layer 3 — Planning scaffold
How the agent structures reasoning between steps. The choice is plan-and-execute vs ReAct. We default to plan-execute-reflect for this agent: trajectories are long enough (eight to fifteen steps) that up-front planning pays back. ReAct alone tends to drift past step six under noisy observations.
Three phases.
PHASE 1 — PLAN
Emit PLAN: 3–5 numbered sub-questions whose answers satisfy the goal,
plus a one-sentence CITATION STRATEGY for reaching ≥3 distinct sources.
PHASE 2 — EXECUTE
Per step, emit exactly:
THOUGHT: sub-question index, what notes already say, next action
and why.
ACTION: one tool call.
OBSERVATION: verbatim result (runtime-filled).
Once a sub-question is answerable from notes, save facts and advance.
Do not keep searching past that point.
PHASE 3 — REFLECT
Before FINAL, emit REFLECT listing each sub-question with its
answering note_ids, flagging any gaps, and confirming ≥3 distinct
source URLs. If a gap remains and budget allows, return to EXECUTE
for up to two steps; else emit PARTIAL.
Why plan-execute-reflect over ReAct. Past five steps, a research trajectory without an explicit plan substitutes breadth for depth: every sub-question spawns a new search, coverage gets patchy. PLAN commits to a sub-question tree; REFLECT checks coverage before finalizing. Together they cost two extra model calls and cut the "searched ten times, wrote a thin brief anyway" failure mode.
What this prevents. Reactive searching, overshot plans, silent coverage gaps. Modeling the scaffold with a worked example (not just describing it) is what makes the agent actually follow it.
Layer 4 — Memory access
What the agent remembers across steps, on what surface, read back when. The wrong answer is "put all observations in the prompt" — noise drowns the goal. The right answer is managed memory with explicit read/write rules.
We use summary buffer plus last-three-verbatim: working memory at any step is the full PLAN, a running summary of earlier notes, and the last three OBSERVATIONs verbatim. Older observations are compressed into the summary via save_note.
Surfaces:
- NOTES: structured buffer from save_note; runtime injects a compact
version at the top of each step.
- PLAN: Phase 1 plan, persistent.
- RECENT: last 3 OBSERVATION blocks verbatim; older ones drop once
compressed into NOTES.
Write rules:
- save_note(fact) immediately when an OBSERVATION contains a citable
claim you intend to use.
- save_note(dead_end) when a sub-question cannot be answered from a
given source, to avoid retrying the same path.
- save_note(source) once per new domain.
Read rules:
- Before each web_search, scan NOTES for facts/dead_ends on the same
sub-question. Do not re-search what is already answered.
- Before REFLECT, scan NOTES for coverage.
- Never quote from a NOTES summary. Quotations come only from RECENT
OBSERVATION text or from a save_note "quote" field.
Why not full history. Stale observations get treated as authoritative if left in context — the agent quotes a step-3 snippet at step twelve as if fresh. The summary buffer marks old content as summary; last-three-verbatim keeps the fresh signal. The Context Engineering Maturity Model calls this Level 4: committed memory with explicit policy.
What this prevents. Repeated searches. Stale quotations. Drowning in tool-result noise past step six.
Layer 5 — Output validation
The output schema plus the self-critique that happens before emitting it. Uncompromising rule: the prompt carries the schema, the runtime enforces it. Do not trust the agent to self-correct on JSON shape.
{
"kind": "FINAL",
"topic": "string",
"summary": "string (120-180 words)",
"key_findings": [{ "finding": "string", "note_ids": ["string"] }],
"sources": [{ "url": "string", "title": "string", "attribution": "string" }],
"coverage": { "sub_questions_answered": 0, "sub_questions_total": 0 }
}
PARTIAL has the same shape plus a "limitation" field for graceful aborts.
Before emitting FINAL/PARTIAL, emit a VALIDATION block, yes/no each:
1. kind is "FINAL" or "PARTIAL"?
2. summary is 120–180 words?
3. key_findings has exactly 5 items?
4. every finding has ≥1 note_id?
5. sources has ≥3 items with distinct domains?
6. every source URL appears in a saved note (source or fact) from
this session?
7. no placeholder text ("TBD", "fill in", "example")?
8. every cited claim in the summary is backed by a note?
If any answer is no, revise and re-run. Only emit when all are yes.
Output the JSON on a single line, no prose preamble.
Self-critique in the prompt; schema check in the runtime. VALIDATION catches semantic wrongness the schema cannot see — placeholder text, uncited claims, off-by-one finding count. The runtime catches shape wrongness the agent is unreliable on — trailing commas, missing fields, wrong types. Both are needed. This is where structured output and self-refine meet. On runtime parse failure, feed the parse error back as an OBSERVATION and let Layer 6 handle the retry.
What this prevents. "Almost JSON." Fabricated sources. Placeholder text. Briefs that parse cleanly but are semantically wrong.
Layer 6 — Error recovery
What happens on tool failures, dead ends, and validation loops. Under-built Layer 6 is the single most common reason demo agents break in production.
Tool error (timeout, 4xx, 5xx, malformed): save_note(dead_end, ...)
and continue. Never ignore.
Rate limit: retry once with same args; on second failure, try a
reformulated query. If budget exhausted, emit PARTIAL.
Contradictory sources: save both facts with opposing attribution and
note the conflict in the finding. Do not pick silently.
Validation loop: if VALIDATION fails the same item twice, stop
self-revision and emit PARTIAL with limitation describing the gap.
Retry caps: 2 retries per tool call with different args; 3 searches
per sub-question before declaring dead_end; hard stop at 15 steps.
Graceful abort (emit PARTIAL, not FINAL) when:
- step budget exhausted before REFLECT passes,
- fewer than 3 distinct source URLs saved,
- VALIDATION failed the same check twice,
- any required tool unavailable.
PARTIAL beats FINAL built on retries, padding, or fabrication.
What this prevents. Infinite retries. Silent tool failures. Fabricated padding when the agent cannot meet the citation floor. The explicit preference for PARTIAL over FINAL is the most important line in this layer — it tells the agent that giving up gracefully is a success, not a failure. Reflexion patterns fit here if you want the agent to learn from prior dead ends within a session.
The assembled prompt
Stitched as a single system prompt for an autonomous agent with agentic AI tool-calling enabled. Copy-pasteable — drop it into your runtime, wire up the three tools, point it at a topic.
SYSTEM
You are a research agent. You take a topic and produce a structured
research brief backed by cited sources.
--- LAYER 1: GOALS ---
[goal, scope, hard constraints block from above]
--- LAYER 2: TOOL PERMISSIONS ---
[tool list, schemas, budgets, tool_choice defaults from above]
--- LAYER 3: PLANNING SCAFFOLD ---
[three-phase plan-execute-reflect block from above]
Example of one execute step (abridged):
THOUGHT: Sub-question 2, "what benchmarks are commonly used." Notes
have one fact from step 3. Need one more source. Will search
for review papers.
ACTION: web_search(query="RAG benchmarks survey 2025")
OBSERVATION: [filled by runtime]
--- LAYER 4: MEMORY ACCESS ---
[NOTES / PLAN / RECENT surfaces and read/write rules from above]
--- LAYER 5: OUTPUT VALIDATION ---
[schema + 8-item VALIDATION checklist from above]
--- LAYER 6: ERROR RECOVERY ---
[tool error, retry, graceful-abort rules from above]
USER
<topic goes here>
In deployment, Layers 1, 2, 4 cache as a stable system prefix. Layer 3's per-step THOUGHT/ACTION/OBSERVATION flows through turns. Layers 5 and 6 also enforce programmatically — schema parse, budget check, step count — regardless of what the model emits. This matches the caching guidance in the Agentic Prompt Stack.
Score it with the Rubric
Applying the SurePrompts Quality Rubric to the assembled prompt. Target for a production agent prompt: 30 or higher out of 35.
| Dimension | Score | Notes |
|---|---|---|
| Role clarity | 4/5 | "Research agent" is clear but generic. A sharper role ("cite-first technical research agent") would tighten this. |
| Context specificity | 4/5 | The scenario and sub-question framing are concrete; missing: domain-specific source-quality heuristics. |
| Action definition | 5/5 | Three-phase scaffold with explicit per-step shape and a named termination signal. |
| Constraint tightness | 5/5 | Word counts, finding counts, citation floor, budgets, and give-up conditions are all specific. |
| Output specification | 5/5 | JSON schema + VALIDATION checklist + runtime enforcement. This is as tight as this dimension gets. |
| Example quality | 4/5 | One abridged execute-step example. Adding a worked PLAN and a worked REFLECT example would push this to 5. |
| Failure-mode handling | 5/5 | Layer 6 is spelled out end-to-end; graceful abort is preferred over padding. |
Total: 32/35. Held back by role clarity and example quality. The RCAF Prompt Structure would tighten those in a focused pass. For agent prompts, constraint tightness and output specification weigh heavier than role clarity — we would ship this.
Debugging by layer
Pattern-match symptom to layer first, read the trace at that layer second, change prompt text third. Rewriting the whole prompt when Layer 6 is what broke is how one-line fixes become three-day regressions.
- Emits FINAL with fewer than three sources. Layer 1 (success criterion not self-checked) or Layer 5 (VALIDATION skipped).
- Searches the same query six times. Layer 4 (dead_end write rule not firing) or Layer 6 (no per-sub-question retry cap).
- Fabricates a URL in sources. Layer 5 — VALIDATION item 6 missing or skipped. Enforce it programmatically.
- Runs to step 15, emits malformed JSON. Layer 5 (runtime not enforcing schema) or Layer 6 (no validation-failure-loop rule).
- Ignores a 500 error. Layer 6 — tool-error rule not followed; check that the prompt isn't truncated.
- Real sources, but findings share citations unevenly. Layer 5 — VALIDATION item 4 too loose; tighten to "each source cited by at least one finding."
- Plans five sub-questions, executes three, quits. Layer 3 (REFLECT not emitted) or Layer 1 ("done" read too loosely).
- Never saves notes. Layer 4 — write rule described but not modeled; add a worked example.
Our position
- Plan-execute-reflect beats ReAct alone past five steps. Under five, ReAct is lighter and fine. Past five, PLAN and REFLECT are cheap insurance against drift and thin coverage.
- Layer 5 is non-negotiable. Even the simplest agent needs a JSON schema, a self-critique step, and runtime parse-and-retry. Trusting the model to self-correct on format is the top reason "demo worked, production didn't."
- PARTIAL over FINAL on budget exhaustion. A PARTIAL with a
limitationfield is honest. A padded FINAL is a hallucination machine. Preferring PARTIAL in prompt text changes agent behavior more than almost any other single instruction. - Summary buffer plus last-three-verbatim is the right memory starting pattern. Full-history context is expensive and degrades past step six. Pure summaries lose citation provenance. The hybrid holds.
- The stack is a debugging tool more than a drafting tool. Its value shows up when an agent breaks at step nine in production and you know within two minutes it's a Layer 4 write-rule problem, not a model quality problem.
Related reading
- The Agentic Prompt Stack — the canonical for the six-layer model this walkthrough applies.
- Agentic AI Prompting Guide — the broader discipline underneath.
- The Complete Guide to Prompting AI Coding Agents (2026) — the coding-agent application of the same stack.
- Multi-Agent Prompting Guide — N stacks plus a coordination protocol.
- Plan-and-Execute Prompting — the Layer 3 scaffold used here.
- ReAct Prompting Guide — the most common alternative Layer 3 scaffold.
- Tool Use Prompting Patterns — Layer 2 tactical companion, including tool_choice patterns.
- Self-Refine Prompting Guide — a Layer 5 self-critique pattern.
- Reflexion Prompting Guide — a Layer 6 error-recovery pattern.
- The SurePrompts Quality Rubric — the prompt-level audit used above.
- Context Engineering Maturity Model — infrastructure model underneath Layer 4.