Building a Research Agent with the Agentic Prompt Stack: A Layer-by-Layer Walkthrough

Q: Why pick a research agent for this walkthrough?

A research agent is the most common shape a small team actually ships first: a bounded task with a few read-only tools and a structured deliverable at the end. That makes it the clearest stage for showing all six layers filled in without drowning in domain detail. Every concern the Agentic Prompt Stack addresses — terminating cleanly, calling only sanctioned tools, planning across steps, remembering citations, validating shape, recovering from timeouts — shows up within a ten-step trajectory. If you cannot get this shape right, more ambitious agents will inherit the same gaps.

Q: Do I need all 6 layers for a simple research agent?

A minimum viable agent can launch with four layers: Goals, Tool permissions, Planning scaffold, and Output validation. Memory access only becomes load-bearing once the trajectory is long enough that earlier observations fall out of working attention — typically past six or seven steps. Error recovery is often implicit until the first real timeout. The rule we use is: name all six even if a layer is one sentence. An empty layer is still a choice, and the stack forces the choice to be explicit.

Q: Which layer fails most often in practice?

We do not have cross-industry data and anyone citing a precise percentage is guessing. Our working hypothesis — stated as hypothesis, not statistic — is that Layer 5 (Output validation) and Layer 6 (Error recovery) are the most under-built because they feel like polish rather than core agent logic. Teams ship with a tight goal and good tools, then discover in production that malformed outputs and unhandled timeouts are where the agent actually breaks. Watching your own agent's production traces for a week will tell you which layer fails most in your system.

Q: Can the Planning scaffold be plan-and-execute OR ReAct?

Both work. [Plan-and-execute](/blog/plan-and-execute-prompting) commits to a plan up front, then runs it, with an optional reflect step at the end. [ReAct](/blog/react-prompting-guide) interleaves thought-action-observation at every step. ReAct is lighter and adapts step-to-step; plan-and-execute anchors against drift on longer trajectories. For a research agent running eight to fifteen steps, we default to plan-execute-reflect because the plan is cheap insurance against the agent wandering after step five. For a three-step research question, ReAct alone is fine.

Q: How does this walkthrough pair with the SurePrompts Quality Rubric?

The stack tells you which six concerns must be covered. The [SurePrompts Quality Rubric](/blog/sureprompts-quality-rubric) scores how well the prompt text within those layers is actually written — role clarity, constraint tightness, example quality, output specification, and the rest. We use them together: build the agent with the stack, then score each layer's text with the Rubric. The assembled prompt in this walkthrough scores around 31/35, held back mostly by example quality inside Layers 3 and 4.

Q: Can I use this same Stack for a coding agent?

Yes — the six layers are structurally identical. Layer 2 changes the most: coding agents carry file-edit, shell, and test-runner tools instead of web search, and the blast radius of each tool is larger, so permission text gets stricter. Layer 1's out-of-scope list also expands because coding agents touch shared state. For the coding-agent domain specialization, see the [complete guide to prompting AI coding agents](/blog/the-complete-guide-to-prompting-ai-coding-agents-2026). The rest of the stack — Planning scaffold, Memory access, Output validation, Error recovery — transfers with minor wording changes.

Imtiaz Rayhan

Key takeaways:

The Agentic Prompt Stack is six layers; the research agent below is the smallest realistic scenario that exercises all of them.
The literal prompt text per layer matters more than the names. Most "Layer 3 problems" are really "no worked example of the scaffold was ever shown."
Layer 5 (Output validation) is where shape bugs go to hide. A self-critique block inside the prompt plus programmatic schema checking outside it is the minimum that holds up.
Layer 6 (Error recovery) is where infinite loops are prevented. A retry cap, a give-up condition, and a graceful-abort shape are the three pieces that must exist.
The assembled prompt at the end of this post scores around 31/35 on the SurePrompts Quality Rubric. A tighter score is easy; a production-grade agent needs layer completeness more than Rubric perfection.
Debugging by symptom-to-layer mapping is faster than debugging by prompt rewriting. The stack's value is mostly post-launch, not pre-launch.

The research agent we're building

Hypothetical scenario, not a shipped product. Given a topic like "current approaches to retrieval-augmented generation for long-tail technical documentation", the agent produces a structured research brief. Three tools: web_search (with query-count budget), fetch_url (read a specific page), and save_note (commits facts and citations to a structured notes buffer the agent reads back between steps). Final output: a JSON object with summary, five key findings, and cited sources. Trajectory caps at fifteen steps.

Small enough to fit one system prompt, large enough that all six layers of the Agentic Prompt Stack pull weight.

Layer 1 — Goals

Layer 1 answers three questions: what is "done," what is out of scope, what global constraints hold. The agent needs a success criterion it can self-check.

code

Goal: produce a structured research brief on the user-supplied topic.

Done when you emit a FINAL JSON (schema in Layer 5) containing:
- summary: 120–180 words,
- exactly 5 distinct key findings,
- ≥3 cited sources with URL and one-sentence attribution.

Out of scope: rewriting the topic; citing sources not fetched this
session; exceeding Layer 2 budgets; any output besides the FINAL JSON.

Hard constraints:
- Never fabricate URLs, quotations, or statistics.
- No fact without a source in notes.
- If you cannot meet the criterion within budget, emit PARTIAL
  (schema in Layer 5), not padded content.

What this prevents. Infinite exploration, or completion claimed on a half-formed brief. The word range, fixed finding count, and citation floor give the agent a boolean to evaluate.

Rubric note. The RCAF Prompt Structure is the drafting skeleton underneath — Role (research agent), Context (user topic), Action (produce a brief), Format (Layer 5 schema). Layer 1 is the constraint-rich version of those slots.

Layer 2 — Tool permissions

Enumerates tools, their schemas, their call conditions, and their budgets. This is where tool use and function calling live — and where destructive failures originate, so the wording is stricter than Layer 1.

code

Allowed tools (no others):

1. web_search(query) -> [{title, url, snippet}]
   Budget: 5 calls. Use to discover candidate sources.

2. fetch_url(url) -> {status, title, text}  // text truncated to 8000 chars
   Budget: 8 calls. Use when a snippet is insufficient AND you intend
   to cite the page. Only URLs returned by a prior web_search.

3. save_note(kind, payload) -> ok
   kind="fact":     {claim, source_url, quote?}
   kind="dead_end": {what_you_tried, why_failed}
   kind="source":   {url, title, credibility_hint}

If you need a tool not listed, emit PARTIAL with limitation="required
tool unavailable". Never invent tools or arguments.

tool_choice consideration. Set tool_choice="auto" for the main loop so the model freely picks search, fetch, save, or FINAL. On the Layer 5 validation pass, set tool_choice="none" — the agent should be thinking, not tool-calling. Forcing tool_choice="required" is usually a mistake here: the research loop has steps where no tool call is the right move.

What this prevents. Malformed arguments. Invented tools. Budget overruns. The prompt caps are belt-and-suspenders with runtime enforcement.

Layer 3 — Planning scaffold

How the agent structures reasoning between steps. The choice is plan-and-execute vs ReAct. We default to plan-execute-reflect for this agent: trajectories are long enough (eight to fifteen steps) that up-front planning pays back. ReAct alone tends to drift past step six under noisy observations.

code

Three phases.

PHASE 1 — PLAN
Emit PLAN: 3–5 numbered sub-questions whose answers satisfy the goal,
plus a one-sentence CITATION STRATEGY for reaching ≥3 distinct sources.

PHASE 2 — EXECUTE
Per step, emit exactly:
  THOUGHT: sub-question index, what notes already say, next action
           and why.
  ACTION: one tool call.
  OBSERVATION: verbatim result (runtime-filled).

Once a sub-question is answerable from notes, save facts and advance.
Do not keep searching past that point.

PHASE 3 — REFLECT
Before FINAL, emit REFLECT listing each sub-question with its
answering note_ids, flagging any gaps, and confirming ≥3 distinct
source URLs. If a gap remains and budget allows, return to EXECUTE
for up to two steps; else emit PARTIAL.

Why plan-execute-reflect over ReAct. Past five steps, a research trajectory without an explicit plan substitutes breadth for depth: every sub-question spawns a new search, coverage gets patchy. PLAN commits to a sub-question tree; REFLECT checks coverage before finalizing. Together they cost two extra model calls and cut the "searched ten times, wrote a thin brief anyway" failure mode.

What this prevents. Reactive searching, overshot plans, silent coverage gaps. Modeling the scaffold with a worked example (not just describing it) is what makes the agent actually follow it.

Layer 4 — Memory access

What the agent remembers across steps, on what surface, read back when. The wrong answer is "put all observations in the prompt" — noise drowns the goal. The right answer is managed memory with explicit read/write rules.

We use summary buffer plus last-three-verbatim: working memory at any step is the full PLAN, a running summary of earlier notes, and the last three OBSERVATIONs verbatim. Older observations are compressed into the summary via save_note.

code

Surfaces:
- NOTES: structured buffer from save_note; runtime injects a compact
  version at the top of each step.
- PLAN: Phase 1 plan, persistent.
- RECENT: last 3 OBSERVATION blocks verbatim; older ones drop once
  compressed into NOTES.

Write rules:
- save_note(fact) immediately when an OBSERVATION contains a citable
  claim you intend to use.
- save_note(dead_end) when a sub-question cannot be answered from a
  given source, to avoid retrying the same path.
- save_note(source) once per new domain.

Read rules:
- Before each web_search, scan NOTES for facts/dead_ends on the same
  sub-question. Do not re-search what is already answered.
- Before REFLECT, scan NOTES for coverage.
- Never quote from a NOTES summary. Quotations come only from RECENT
  OBSERVATION text or from a save_note "quote" field.

Why not full history. Stale observations get treated as authoritative if left in context — the agent quotes a step-3 snippet at step twelve as if fresh. The summary buffer marks old content as summary; last-three-verbatim keeps the fresh signal. The Context Engineering Maturity Model calls this Level 4: committed memory with explicit policy.

What this prevents. Repeated searches. Stale quotations. Drowning in tool-result noise past step six.

Layer 5 — Output validation

The output schema plus the self-critique that happens before emitting it. Uncompromising rule: the prompt carries the schema, the runtime enforces it. Do not trust the agent to self-correct on JSON shape.

json

{
  "kind": "FINAL",
  "topic": "string",
  "summary": "string (120-180 words)",
  "key_findings": [{ "finding": "string", "note_ids": ["string"] }],
  "sources": [{ "url": "string", "title": "string", "attribution": "string" }],
  "coverage": { "sub_questions_answered": 0, "sub_questions_total": 0 }
}

PARTIAL has the same shape plus a "limitation" field for graceful aborts.

code

Before emitting FINAL/PARTIAL, emit a VALIDATION block, yes/no each:

1. kind is "FINAL" or "PARTIAL"?
2. summary is 120–180 words?
3. key_findings has exactly 5 items?
4. every finding has ≥1 note_id?
5. sources has ≥3 items with distinct domains?
6. every source URL appears in a saved note (source or fact) from
   this session?
7. no placeholder text ("TBD", "fill in", "example")?
8. every cited claim in the summary is backed by a note?

If any answer is no, revise and re-run. Only emit when all are yes.
Output the JSON on a single line, no prose preamble.

Self-critique in the prompt; schema check in the runtime. VALIDATION catches semantic wrongness the schema cannot see — placeholder text, uncited claims, off-by-one finding count. The runtime catches shape wrongness the agent is unreliable on — trailing commas, missing fields, wrong types. Both are needed. This is where structured output and self-refine meet. On runtime parse failure, feed the parse error back as an OBSERVATION and let Layer 6 handle the retry.

What this prevents. "Almost JSON." Fabricated sources. Placeholder text. Briefs that parse cleanly but are semantically wrong.

Layer 6 — Error recovery

What happens on tool failures, dead ends, and validation loops. Under-built Layer 6 is the single most common reason demo agents break in production.

code

Tool error (timeout, 4xx, 5xx, malformed): save_note(dead_end, ...)
and continue. Never ignore.

Rate limit: retry once with same args; on second failure, try a
reformulated query. If budget exhausted, emit PARTIAL.

Contradictory sources: save both facts with opposing attribution and
note the conflict in the finding. Do not pick silently.

Validation loop: if VALIDATION fails the same item twice, stop
self-revision and emit PARTIAL with limitation describing the gap.

Retry caps: 2 retries per tool call with different args; 3 searches
per sub-question before declaring dead_end; hard stop at 15 steps.

Graceful abort (emit PARTIAL, not FINAL) when:
- step budget exhausted before REFLECT passes,
- fewer than 3 distinct source URLs saved,
- VALIDATION failed the same check twice,
- any required tool unavailable.

PARTIAL beats FINAL built on retries, padding, or fabrication.

What this prevents. Infinite retries. Silent tool failures. Fabricated padding when the agent cannot meet the citation floor. The explicit preference for PARTIAL over FINAL is the most important line in this layer — it tells the agent that giving up gracefully is a success, not a failure. Reflexion patterns fit here if you want the agent to learn from prior dead ends within a session.

The assembled prompt

Stitched as a single system prompt for an autonomous agent with agentic AI tool-calling enabled. Copy-pasteable — drop it into your runtime, wire up the three tools, point it at a topic.

code

SYSTEM

You are a research agent. You take a topic and produce a structured
research brief backed by cited sources.

--- LAYER 1: GOALS ---
[goal, scope, hard constraints block from above]

--- LAYER 2: TOOL PERMISSIONS ---
[tool list, schemas, budgets, tool_choice defaults from above]

--- LAYER 3: PLANNING SCAFFOLD ---
[three-phase plan-execute-reflect block from above]

Example of one execute step (abridged):
  THOUGHT: Sub-question 2, "what benchmarks are commonly used." Notes
           have one fact from step 3. Need one more source. Will search
           for review papers.
  ACTION: web_search(query="RAG benchmarks survey 2025")
  OBSERVATION: [filled by runtime]

--- LAYER 4: MEMORY ACCESS ---
[NOTES / PLAN / RECENT surfaces and read/write rules from above]

--- LAYER 5: OUTPUT VALIDATION ---
[schema + 8-item VALIDATION checklist from above]

--- LAYER 6: ERROR RECOVERY ---
[tool error, retry, graceful-abort rules from above]

USER
<topic goes here>

In deployment, Layers 1, 2, 4 cache as a stable system prefix. Layer 3's per-step THOUGHT/ACTION/OBSERVATION flows through turns. Layers 5 and 6 also enforce programmatically — schema parse, budget check, step count — regardless of what the model emits. This matches the caching guidance in the Agentic Prompt Stack.

Score it with the Rubric

Applying the SurePrompts Quality Rubric to the assembled prompt. Target for a production agent prompt: 30 or higher out of 35.

Dimension	Score	Notes
Role clarity	4/5	"Research agent" is clear but generic. A sharper role ("cite-first technical research agent") would tighten this.
Context specificity	4/5	The scenario and sub-question framing are concrete; missing: domain-specific source-quality heuristics.
Action definition	5/5	Three-phase scaffold with explicit per-step shape and a named termination signal.
Constraint tightness	5/5	Word counts, finding counts, citation floor, budgets, and give-up conditions are all specific.
Output specification	5/5	JSON schema + VALIDATION checklist + runtime enforcement. This is as tight as this dimension gets.
Example quality	4/5	One abridged execute-step example. Adding a worked PLAN and a worked REFLECT example would push this to 5.
Failure-mode handling	5/5	Layer 6 is spelled out end-to-end; graceful abort is preferred over padding.

Total: 32/35. Held back by role clarity and example quality. The RCAF Prompt Structure would tighten those in a focused pass. For agent prompts, constraint tightness and output specification weigh heavier than role clarity — we would ship this.

Debugging by layer

Pattern-match symptom to layer first, read the trace at that layer second, change prompt text third. Rewriting the whole prompt when Layer 6 is what broke is how one-line fixes become three-day regressions.

Emits FINAL with fewer than three sources. Layer 1 (success criterion not self-checked) or Layer 5 (VALIDATION skipped).
Searches the same query six times. Layer 4 (dead_end write rule not firing) or Layer 6 (no per-sub-question retry cap).
Fabricates a URL in sources. Layer 5 — VALIDATION item 6 missing or skipped. Enforce it programmatically.
Runs to step 15, emits malformed JSON. Layer 5 (runtime not enforcing schema) or Layer 6 (no validation-failure-loop rule).
Ignores a 500 error. Layer 6 — tool-error rule not followed; check that the prompt isn't truncated.
Real sources, but findings share citations unevenly. Layer 5 — VALIDATION item 4 too loose; tighten to "each source cited by at least one finding."
Plans five sub-questions, executes three, quits. Layer 3 (REFLECT not emitted) or Layer 1 ("done" read too loosely).
Never saves notes. Layer 4 — write rule described but not modeled; add a worked example.

Our position

Plan-execute-reflect beats ReAct alone past five steps. Under five, ReAct is lighter and fine. Past five, PLAN and REFLECT are cheap insurance against drift and thin coverage.
Layer 5 is non-negotiable. Even the simplest agent needs a JSON schema, a self-critique step, and runtime parse-and-retry. Trusting the model to self-correct on format is the top reason "demo worked, production didn't."
PARTIAL over FINAL on budget exhaustion. A PARTIAL with a limitation field is honest. A padded FINAL is a hallucination machine. Preferring PARTIAL in prompt text changes agent behavior more than almost any other single instruction.
Summary buffer plus last-three-verbatim is the right memory starting pattern. Full-history context is expensive and degrades past step six. Pure summaries lose citation provenance. The hybrid holds.
The stack is a debugging tool more than a drafting tool. Its value shows up when an agent breaks at step nine in production and you know within two minutes it's a Layer 4 write-rule problem, not a model quality problem.

The Agentic Prompt Stack — the canonical for the six-layer model this walkthrough applies.
AI Agents Prompting Guide — the broader discipline underneath.
The Complete Guide to Prompting AI Coding Agents (2026) — the coding-agent application of the same stack.
Multi-Agent Prompting Guide — N stacks plus a coordination protocol.
Plan-and-Execute Prompting — the Layer 3 scaffold used here.
ReAct Prompting Guide — the most common alternative Layer 3 scaffold.
Tool Use Prompting Patterns — Layer 2 tactical companion, including tool_choice patterns.
Self-Refine Prompting Guide — a Layer 5 self-critique pattern.
Reflexion Prompting Guide — a Layer 6 error-recovery pattern.
The SurePrompts Quality Rubric — the prompt-level audit used above.
Context Engineering Maturity Model — infrastructure model underneath Layer 4.

Building a Research Agent with the Agentic Prompt Stack: A Layer-by-Layer Walkthrough

The research agent we're building

Layer 1 — Goals

Layer 2 — Tool permissions

Layer 3 — Planning scaffold

Layer 4 — Memory access

Layer 5 — Output validation

Layer 6 — Error recovery

The assembled prompt

Score it with the Rubric

Debugging by layer

Our position

AI prompts built for designers

Related Resources

AI Agent Instructions Builder Template

Related Articles

The Agentic Prompt Stack: 6 Layers for Designing Prompts That Run Agents

The Complete Guide to Prompting AI Coding Agents (2026)

AI Agents Prompting Guide: How to Write Instructions That Actually Work (2026)

Building a Research Agent with the Agentic Prompt Stack: A Layer-by-Layer Walkthrough

The research agent we're building

Layer 1 — Goals

Layer 2 — Tool permissions

Layer 3 — Planning scaffold

Layer 4 — Memory access

Layer 5 — Output validation

Layer 6 — Error recovery

The assembled prompt

Score it with the Rubric

Debugging by layer

Our position

Related reading

AI prompts built for designers

Related Resources

AI Agent Instructions Builder Template

Related Articles

The Agentic Prompt Stack: 6 Layers for Designing Prompts That Run Agents

The Complete Guide to Prompting AI Coding Agents (2026)

AI Agents Prompting Guide: How to Write Instructions That Actually Work (2026)