Agentic RAG: A Walkthrough of Retrieval as a Tool Call

Q: What is agentic RAG, in one sentence?

Agentic RAG is the pattern where retrieval is a tool the model calls as many times as it needs — deciding when to retrieve, what to query, and when it has enough to answer — instead of being a fixed first step in a linear pipeline. That single change turns RAG from a one-shot lookup into an iterative research loop, which is what multi-hop questions actually require.

Q: How is agentic RAG different from classic RAG?

Classic RAG runs retrieve-then-generate, exactly once per query. The retriever sees the raw user question, returns top-k chunks, and the generator writes an answer from whatever came back. Agentic RAG lets the model read partial results, issue sub-queries it could not have written without seeing the first batch, refuse to answer when context is thin, and stop early on easy questions. The price is latency variance and a harder eval story — classic RAG is deterministic per query in a way agentic RAG is not.

Q: When should I NOT use agentic RAG?

Single-fact lookups where the answer is either in one chunk or nowhere — FAQ bots, policy citations, glossary-style queries. Adding an agent loop to those is overkill: it doubles cost and latency without improving answers a user would notice. Agentic RAG pays off when questions require stitching multiple documents, when the right sub-queries depend on what the first retrieval returns, or when the corpus sometimes does not contain the answer at all and the system should know that.

Q: What are the main failure modes of agentic RAG?

Four show up repeatedly. Runaway loops where the agent keeps retrieving without making progress. Over-retrieval on questions that did not need multi-hop in the first place, which triples cost for no quality gain. Drift, where the model wanders into adjacent topics as each retrieval surfaces tangential documents. And silent truncation, where the conversation context fills with retrieved chunks until earlier tool outputs get dropped and the final answer is written from half the evidence. All four are controllable, but only if you instrument them.

Q: How do I keep costs bounded?

Four controls in combination. A hard cap on tool iterations per query — three to five is a reasonable default. A token budget the agent must stay inside, enforced by the orchestrator rather than trusted to the model. A stop condition the model is explicitly prompted to evaluate after each retrieval — 'do you have enough to answer?' as an actual branching decision, not a hope. And tool-use traces logged per query so you can see which questions are driving the cost and tune from data, not intuition.

Q: How does agentic RAG relate to corrective RAG and self-RAG?

All three inject judgement between retrieval and generation, but at different points. Corrective RAG grades the retrieved documents and falls back (rewrite the query, web search, refuse) when none of them look relevant. Self-RAG trains the generator to emit special tokens that decide whether to retrieve at all and whether each generated span is grounded. Agentic RAG is the most general of the three — the model as an agent decides end-to-end when to retrieve, what to query, and when to stop. In practice they compose: an agentic RAG system can use a CRAG-style relevance grader as one of its tools.

Imtiaz Rayhan

Key takeaways:

RAG with a fixed "retrieve then generate" shape hits a ceiling on multi-hop questions — the right query for step two depends on what step one returned.
Agentic RAG reshapes retrieval as a tool call inside an agent loop. The model decides when to retrieve, what to query, and when to stop.
The cost is non-trivial: latency variance, harder evals, and failure modes specific to agents — runaway loops, over-retrieval, drift. These are paid in dollars and user trust if uncontrolled.
The control plane is four things: a hard iteration cap, a token budget enforced by the orchestrator, explicit stop conditions the model evaluates, and per-query tool-use traces.
Agentic RAG is overkill for single-fact lookups. Use it where questions genuinely require multi-hop reasoning or where the corpus sometimes does not contain the answer.
It composes with corrective RAG and self-RAG — they are orthogonal ideas, not alternatives.

Why fixed-step RAG hits a ceiling

A linear RAG pipeline does three things per query, in the same order, every time: embed the user question, fetch top-k chunks from the vector store, hand question and chunks to the generator. That shape is good enough for a large class of questions — single-fact lookups, policy citations, glossary queries — and the RAG prompt engineering guide covers how to build it well.

The shape breaks on two categories.

First, multi-hop questions. "Why did my invoice total change between last month and this one?" requires pricing tiers, the account's recent subscription changes, and the specific invoice line items. A single retrieval on the raw question returns generic billing-policy documents. The second query you would want to issue — "what subscription changes happened on this account in the last 30 days?" — cannot be written without reading the first batch. A fixed-step pipeline has no way to issue it.

Second, questions where the corpus does not contain the answer. Linear RAG returns the five least-bad chunks anyway, and the generator writes a fluent answer on top of weak context. The failure is silent unless something explicitly grades the retrieval.

Both are retrieval-shape problems, not retrieval-quality problems. You can tune your chunking, your reranker, your embedding model forever and still not write the second sub-query that only exists in the context of the first answer. The fix is not a better retriever. The fix is letting the model run retrieval as a tool, not a preamble.

What agentic RAG actually changes

The definition is narrow: retrieval is a tool the agent calls, not a fixed first step. Everything else follows.

Concretely, the model is an AI agent inside a loop. On each turn it can call retrieve(query) with whatever query it generates, read the returned chunks, call retrieve again with a refined query, call a different tool, or emit a final answer. The loop terminates when the model says it has enough or when the orchestrator's iteration cap fires. This is the same shape as ReAct prompting or plan-and-execute — the agentic part is that retrieval is one tool among several, not a privileged first step.

Three capabilities come for free once retrieval is a tool.

The model can skip it. "What's the capital of France" does not need to hit your corpus. A linear pipeline retrieves anyway, wastes tokens, and sometimes pulls irrelevant chunks that actively hurt the answer. An agent with retrieval as an optional tool can answer from parametric memory and move on.

The model can issue a sub-query conditioned on the first batch. This is where multi-hop questions stop being pathological. The agent reads "the invoice total changed by $40" from chunk one, notices the question asks why, and issues a follow-up retrieval for subscription changes on that account.

The model can notice the corpus does not contain the answer. Three retrievals return nothing relevant; a well-prompted agent refuses to answer from weak context instead of fabricating from what it has. This is the same discipline corrective RAG bakes into the pipeline, just at a different layer.

None of this is magic. All of it lives or dies on the control plane.

Worked example: a hypothetical research-assistant agent

Hypothetical scenario, not a shipped product. A legal research assistant runs over a corpus of internal case files, jurisdiction-specific statutes, and firm memos. The user asks:

"In our 2023 matter for Northstar Logistics, which regulatory argument did we win on, and has that statute been amended since?"

This is two hops — the matter record, then the statute history — and the second query depends on what the first returns. A linear pipeline would embed the raw question, retrieve a mix of Northstar-tagged documents and vaguely regulatory statutes, and produce a confident-sounding synthesis with unverified citations.

Here is how an agentic RAG walks it. The model has three tools: retrieve_matters, retrieve_statutes, final_answer. Max iterations: five. Token budget: 12k for the full trace.

Turn 1. The agent issues its first retrieval — a matter lookup, not a statute one, because the question roots in the Northstar matter.

code

tool: retrieve_matters
query: "Northstar Logistics 2023 regulatory winning argument"

Returns three matter memos. The agent reads them and notices one names a specific statute — Section 12-406 of the state transport code — as the winning argument. Progress.

Turn 2. The agent now has a concrete citation and issues a targeted statute retrieval, which it could not have written before reading turn one.

code

tool: retrieve_statutes
query: "Section 12-406 state transport code amendment history"

Returns the current statute text and two amendment entries from 2024 and 2025. The agent reads them and has both pieces of evidence it needs.

Turn 3. The agent evaluates its stop condition — "do I have enough to answer?" — and decides yes. It emits the final answer, citing the matter memo and both amendment entries. Three tool calls total; the iteration cap never fires; the token budget is well within limits.

Now the failure-mode version of the same query. Same prompt, no controls.

Turn 1. Matter retrieval. Good.

Turn 2. Statute retrieval on Section 12-406. Good.

Turn 3. The agent, prompted vaguely, decides to "be thorough" and retrieves statutes adjacent to Section 12-406 — related transport code sections the user never asked about.

Turn 4. The adjacent statutes mention another matter involving Northstar from 2019. The agent retrieves that matter "for context".

Turn 5. The 2019 matter mentions a subsidiary. The agent retrieves the subsidiary's matters.

Turn 6. The iteration cap (if you set one) fires. The final answer is padded with three paragraphs of tangentially relevant material the user did not want. Cost is three times the disciplined version. Latency is longer than the user's patience.

The first trace is agentic RAG working as designed. The second is agentic RAG with no control plane. The pattern itself is the same. The difference is the controls.

The control plane

Four controls. None of them are optional.

Max iterations. A hard cap on how many tool calls the agent can make per query. Three to five is a reasonable default for retrieval-heavy agents. The cap exists because the model will occasionally decide to keep retrieving forever, and "forever" on a paid API is a line item on your bill. Set it low, let real traces tell you which questions need more, raise it for those specifically.

Token budget. A ceiling on total tokens — input plus output plus tool results — the agent can spend on one query. Enforced by the orchestrator, not by asking the model nicely. When the budget is 80% consumed, the orchestrator forces a final_answer call with whatever the agent has. This catches the case where the iteration cap is generous but each retrieval returns a huge chunk that balloons context.

Explicit stop conditions. Prompt the agent, after each retrieval, to answer a branching question: "Do you have enough to answer the original question? If yes, emit final_answer. If no, what specific missing information are you retrieving next?" This makes stopping an actual decision the model has to defend, not a thing that happens by accident when the iteration cap fires.

Tool-use traces. Log every tool call per query — query text, returned chunks (at minimum their IDs), latency, tokens. Sample a small percentage of traces for manual review. The pattern you are looking for: queries that keep retrieving without new information (drift), queries that stop early on easy questions (efficient), queries where the second retrieval is conditional on the first (the reason you adopted agentic RAG in the first place). Traces are how you tell whether the pattern is paying for itself.

This is the context-engineering discipline applied at the orchestrator layer. The Context Engineering Maturity Model calls out traceability and observability at Level 4-5; agentic RAG without traces is a Level 2 habit wearing a Level 5 costume.

When NOT to use agentic RAG

The pattern is not free. Skip it when the question shape does not justify the overhead.

Single-fact lookups. "What is our refund policy?" The answer is one chunk in your help-center corpus. Linear RAG gets it in one shot. Wrapping it in an agent loop adds two tool-call hops of latency and triples cost for zero quality gain.

Fixed-schema Q&A. If every question follows a template — "show me the status of order X", "what does field Y do" — the best retrieval query is the template, not something the model invents. Agentic RAG's upside is in query creativity; fixed-schema questions have nowhere for creativity to help.

Latency-sensitive paths. Agentic RAG's latency is a distribution, not a number. Easy queries are fast; hard queries take as long as they take. If your product surface is an interactive chat where p95 latency matters more than average, linear RAG with a high-quality reranker is the safer choice.

Small, high-quality corpora. If your corpus is 500 documents all curated by hand, linear RAG with hybrid search and a good reranker outperforms agentic approaches in most queries. The multi-hop upside shows up when the corpus is large and messy enough that no single query reliably surfaces the right mix.

The general rule: use agentic RAG when the shape of the retrieval — how many queries, what those queries are, in what order — genuinely depends on what earlier retrievals returned. Where it does not, it is architecture theater.

Failure modes

Four anti-patterns that show up repeatedly in agentic RAG systems that skipped the control plane.

Runaway loops. The agent keeps calling retrieve, each call returning adjacent-but-not-useful results, and the loop never converges on an answer. Caused by weak stop conditions and missing iteration caps. The cheapest fix is a hard iteration cap; the durable fix is an explicit "do you have enough" prompt after each retrieval.

Over-retrieval on simple questions. A well-instrumented team once discovered their agent was calling retrieve three to four times on questions a linear pipeline answered in one. The model had been prompted to "be thorough" and interpreted thoroughness as more tool calls. Fix: prompt for sufficiency, not thoroughness. "Retrieve only what you need to answer" beats "retrieve thoroughly" every time.

Drift. The agent starts on topic, retrieves something tangentially related, and follows that thread into documents the user never asked about. The original question quietly disappears from the conversation. This is what query rewriting can help with at the pipeline layer, but in an agentic loop it also requires the model to re-anchor each retrieval against the original question, not the most recent one.

Silent truncation. Tool results accumulate in the conversation context. By iteration four, earlier retrievals may have been compressed out of the prompt window, and the final answer is written from a subset of the evidence the agent actually collected. Fix: summarize intermediate results into a structured working memory the agent maintains explicitly, rather than relying on the model to keep its own growing context straight.

All four are why you instrument tool-use traces from day one. They are not visible in output quality until they are expensive.

Our position

Four opinionated stances.

Default to linear RAG. Earn your way to agentic. Most RAG workloads do not need an agent loop. Build the linear pipeline first, measure where it fails — multi-hop, corpus-miss, query-shape questions — and adopt agentic RAG for those slices specifically. A system that is linear for 80% of queries and agentic for the 20% that need it is cheaper and better than one that is agentic for everything.

The control plane is the pattern. Iteration caps, token budgets, stop conditions, and traces are not operational polish — they are what distinguishes agentic RAG from "a bunch of retrievals in a loop." A team that has the loop but no traces has not implemented agentic RAG. They have implemented a bill.

Evaluate on the hard slice, not the average. Average-case evals hide the payoff. Curate a golden set of multi-hop questions the linear pipeline fails on, and measure agentic RAG against that set specifically. RAGAS on that slice — especially context recall — is where the pattern's contribution shows up. See the RAGAS evaluation walkthrough for the metrics plumbing.

Compose with CRAG and Self-RAG, do not choose between them. Agentic RAG, corrective RAG, and self-RAG attack the same weak-context failure at different layers. An agentic system can use a CRAG-style relevance grader as one of its tools and still benefit from self-RAG-trained generation. Pick the pattern that matches your failure mode; do not treat them as mutually exclusive schools.

Prompts still matter. The agent's tool-use behavior is driven by its system prompt and tool descriptions as much as by the orchestrator. A sloppy tool description ("retrieves relevant documents") produces a sloppy calling pattern; a precise one ("retrieves up to 5 chunks relevant to the query; use for questions requiring factual citations from internal matter files") produces disciplined calls. Score the controlling prompts against the SurePrompts Quality Rubric alongside whatever eval harness you run over the agent itself.

Agentic RAG: A Walkthrough of Retrieval as a Tool Call

Why fixed-step RAG hits a ceiling

What agentic RAG actually changes

Worked example: a hypothetical research-assistant agent

The control plane

When NOT to use agentic RAG

Failure modes

Our position

Ready to write better prompts?

Related Resources

RAG System Design Template

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

Building a Research Agent with the Agentic Prompt Stack: A Layer-by-Layer Walkthrough

Agentic RAG: A Walkthrough of Retrieval as a Tool Call

Why fixed-step RAG hits a ceiling

What agentic RAG actually changes

Worked example: a hypothetical research-assistant agent

The control plane

When NOT to use agentic RAG

Failure modes

Our position

Related reading

Ready to write better prompts?

Related Resources

RAG System Design Template

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

Building a Research Agent with the Agentic Prompt Stack: A Layer-by-Layer Walkthrough