Tool Use Prompting Patterns: Getting Reliable Tool Calls (2026)

Q: Why does tool use fail even with capable models?

Most failed tool-use runs trace to one of five failure modes, and four of the five are prompt-level problems rather than model-level ones. The wrong tool gets called when descriptions overlap and there's no guidance on which to prefer. No tool gets called when the system prompt doesn't require tool use for a class of question, so the model answers from stale training data. Malformed arguments happen when the schema is unclear and there's no example. Infinite loops happen when there's no stop condition and a tool error is treated as 'try again.' Hallucinated results get trusted when there's no instruction about what to do on empty or error results. A smarter model lowers the rate of these failures but does not remove the failure mode — prompt patterns do. That's why reliable tool use is a prompt problem, not just a model-capability problem.

Q: What makes a good tool description?

A good tool description has five parts. A name that is short, verbish, and distinctive (search_docs beats docs). A one-sentence purpose written from the model's perspective: 'Use this when the user asks for…'. Parameters that are each named, typed, described, and marked required vs. optional. One concrete example showing a sample argument set and what the tool returns, kept short. And — the part people skip — an explicit negative line stating what the tool does NOT do, such as 'Does not search the web. Use web_search for that.' The negative is often the most useful part, because when two tool descriptions both sound plausible the model picks essentially at random; an explicit 'this is not that' line forces the distinction and prevents most wrong-tool errors. Aim for five to ten lines per tool — long enough to disambiguate, short enough not to bloat the prompt.

Q: How do I stop an agent from looping forever on a failing tool call?

Add a short error protocol to the system prompt with a hard retry cap. Without guidance, a model that sees a tool error usually does one of two wrong things: retries the same call forever, or silently invents a result and keeps going. A four-line protocol fixes most of this: read the error message rather than ignoring it; if the error is transient (timeout, rate limit), retry once with the same arguments; if the error is about your arguments (invalid format, missing field), fix the arguments and retry once; if the error persists or the resource does not exist, stop calling that tool, tell the user what went wrong, and do not invent a result. The retry cap is the load-bearing part — without it, the model will keep retrying with tiny variations for as many turns as the harness allows. Budget limits like 'use at most 5 tool calls for this task' add a second backstop against drift.

Q: How many tools is too many in a single prompt?

There's no hard number, but selection accuracy degrades noticeably past around 10 to 15 tools in a single prompt, based on common practitioner reports. Past that point there are two fixes. Scope the tools to the phase: expose only research tools during research and only execution tools during execution, so the model never has to choose from the full set at once. Or group tools under a dispatcher — a single do(action, params) tool with action names instead of many individual tools — which shifts the decision into the arguments, something the model handles differently than tool selection. A 40-tool toolbox overwhelms selection, so the practical move is to expose only the phase-relevant subset rather than the whole surface.

Q: How should the prompt handle tool results so the model doesn't trust garbage?

By default the model treats tool output as gospel, which breaks when the tool is search (results may be irrelevant), retrieval (chunks may be low-quality), or scraping (pages may be wrong). A good result protocol forces evaluation before use: read results carefully and don't assume they answer the question; if results are empty, irrelevant, or contradictory, say so and either call the tool again with different arguments or ask the user for clarification rather than filling the gap from memory; and cite which tool call each claim came from (for example, 'from web_search result 2'). The citation requirement is the quiet heavy-lifter — when the model must attribute every claim to a specific tool call, it can't smoothly merge tool output with hallucinated content, because the citation exposes the gap. When two tools contradict each other, instruct the model to surface both findings rather than silently picking one.

Imtiaz Rayhan

Tool use — also called function calling — is supported by every major frontier model in 2026. Claude, GPT, and Gemini can all receive a list of tools, decide when to call one, emit structured arguments, and incorporate the result. That is the capability. Reliable tool use is a different problem: picking the right tool, calling it at the right time, passing valid arguments, handling errors, and stopping when done. Almost all of that is controlled by the prompt. This guide covers the patterns that turn raw capability into reliable behavior — see the tool-use glossary entry for the compact definition.

What Tool Use Means

At the capability level, tool use works roughly the same across providers. You give the model a system prompt and a list of tool schemas — each with a name, a description, and a parameter definition. When the model decides a tool is needed, instead of answering the user it emits a structured call with the tool name and argument values. Your harness executes the tool, captures the result, and sends it back to the model as a follow-up turn. The model then either calls another tool or replies.

The model does not run the tool itself — it only emits the call. Everything else (executing, returning results, looping) is the harness's job. That means the prompt is where the decision-making happens: which tool, when, with what arguments, and when to stop. For the reasoning loop that wraps this capability, see ReAct prompting.

Why Tool Use Fails

Most failed tool-use runs trace back to one of five failure modes. They are worth naming because each has a different prompt-level fix.

Failure mode	What it looks like	Typical cause
Wrong tool	Model calls `search` when it should have called `read_file`	Overlapping tool descriptions; no guidance on when to prefer one
No tool	Model answers from its training data when it should have looked up current data	System prompt does not require tool use for that class of question
Malformed arguments	JSON is invalid, required field missing, type wrong	Schema unclear; no example in the tool description
Infinite loop	Model keeps calling the same tool forever	No stop condition; tool error treated as "try again"
Trusted hallucinated result	Tool returns nothing useful, model invents content anyway	No instruction about what to do on empty or error results

Four of the five are prompt-level problems, not model-level. A smarter model reduces the rate but does not remove the failure mode. Prompt patterns do.

Writing Good Tool Descriptions

The tool description is where most bugs live. The model uses the description to decide whether to call the tool at all, and to distinguish it from other tools. A vague description breaks both decisions.

A good description has five parts:

Name. Short, verbish, distinctive. search_docs beats docs.
Purpose. One sentence on what the tool does and what problem it solves, written from the model's perspective: "Use this when the user asks for…"
Parameters. Each one named, typed, described. Required vs optional marked.
Example use. One concrete sample argument set and what the tool returns. Short.
What it does NOT do. An explicit negative. "Does not search the web. Use web_search for that." This line prevents most wrong-tool errors.

The negative is the part people skip, and it is often the most useful. When the model has two tools whose descriptions both sound plausible, it picks essentially at random. An explicit "this is not that" line forces the distinction.

Tool-Forcing vs Tool-Permitting

Most provider APIs expose three tool-use modes:

Auto. The model decides whether to call a tool or answer directly. Default.
Any / required. The model must call a tool (not answer in text). Lets it pick which one.
Specific tool. The model must call a named tool. No choice.

The prompt-level equivalents — useful as reinforcement or when the provider does not expose modes — are:

Tool-permitting language. "You may use these tools if helpful." Leaves it to the model.
Tool-forcing language. "You must call one of these tools before responding." Requires a call.
Phase-scoped forcing. "In the research phase, you must call search at least once. In the writing phase, do not call any tools." Different regimes at different stages.

Forcing is right when the question cannot be answered from training data (time-sensitive, user's own files, fresh lookup). Permitting is right when the model might know the answer and a tool call would add latency without value. The mistake is defaulting to auto for everything. Auto lets the model guess, and for questions where a stale answer is worse than a slow one, that guess is often wrong.

Handling Tool Errors in Prompts

Tools fail. A search returns no results, an API rate-limits, a file does not exist, a query is malformed. The model sees the error message in the next turn and has to decide what to do. Without guidance, it usually does one of two things, both wrong: retry the same call forever, or silently invent a result and keep going.

The prompt pattern is a short error protocol in the system prompt:

code

When a tool call returns an error:
1. Read the error message — do not ignore it.
2. If the error is transient (timeout, rate limit), retry once with the same args.
3. If the error is about your arguments (invalid format, missing field),
   fix the arguments and retry once.
4. If the error persists or indicates the resource does not exist, stop
   calling that tool and tell the user what went wrong. Do not invent
   a result.

Four lines of system prompt, and the model stops looping on errors. The retry cap matters — without it, the model will keep retrying with tiny variations for as many turns as the harness allows.

Handling Tool Results in Prompts

The other side of the same problem is what to do with successful tool results. By default the model treats tool output as gospel, which is a problem when the tool is search (results may be irrelevant), retrieval (chunks may be low-quality), or scraping (pages may be wrong). A good prompt forces the model to evaluate results before using them.

Three lines that help:

code

When a tool returns results:
- Read them carefully. Do not assume they answer the question.
- If the results are empty, irrelevant, or contradictory, say so and
  call the tool again with different args OR ask the user for
  clarification. Do not fill in the gap from memory.
- When you cite a result in your answer, reference which tool call it
  came from (e.g. "search result 2").

The citation requirement is the quiet heavy-lifter. When the model has to cite which tool call each claim came from, it cannot smoothly merge tool output with hallucinated content — the citation exposes the gap.

Constraining Tool Use Deliberately

Reliability also comes from limiting tool use. A few constraints worth applying:

Budget limits. "Use at most 5 tool calls for this task." Prevents drift on open-ended questions. The model learns to plan its calls rather than scatter them.
Phase-specific tools. "In the planning phase, only read_file and grep are allowed. In the implementation phase, edit_file is added. In the verification phase, only run_tests is allowed." Matches plan-and-execute prompting where different phases need different capabilities.
Order constraints. "You must call read_file on the file before calling edit_file on it." Prevents blind edits to files the model has not looked at.
Forbidden combinations. "Do not call delete_file in the same turn as create_file on the same path. If you need to rewrite a file, use edit_file." Stops destructive patterns.

These are brittle — a stubborn model will sometimes break the rule — but they catch most violations and make the rest visible in traces.

A Good Tool-Use Setup — Hypothetical Example

Here is an illustrative system prompt for a research agent with three tools. The specifics are hypothetical and meant to show the shape of a well-formed setup, not a production spec.

code

You are a research assistant. You answer questions using the tools below.
You must call at least one tool before answering — do not answer from memory
on topics where current data matters.

Tools:

1. web_search(query: string) -> list of {title, url, snippet}
   Purpose: Find recent articles, news, or documentation on the open web.
   Use when the user asks about current events, recent releases, or
   general public information.
   Example: web_search("Anthropic Claude 4.7 release date")
   Does NOT: search private documents or the user's files.

2. read_doc(doc_id: string) -> string
   Purpose: Retrieve the full text of an internal document by ID.
   Use when the user references a specific internal doc or when a
   web_search result links to an internal doc_id (format: "doc_XXXX").
   Example: read_doc("doc_1042")
   Does NOT: search — you must already have the doc_id.

3. finish(answer: string) -> terminates the loop
   Purpose: Emit the final answer and stop.
   Use ONLY when you are confident the answer is grounded in tool results
   you have just seen.

Error protocol:
- On error, read the message. Retry once if transient or if your args
  were wrong. Otherwise stop and report the failure to the user.

Result protocol:
- Cite the tool call each claim came from ("from web_search result 2").
- If results are empty or irrelevant, call the tool again with different
  args or ask the user for clarification. Do not invent.

Budget: at most 6 tool calls per question.

Six things: enumerates tools with explicit negatives, forces at least one tool call, defines an explicit terminator, sets an error protocol, sets a result protocol, caps the budget. Each one corresponds to a failure mode from the table above.

Common Anti-Patterns

Vague tool descriptions. "This tool searches things." The model cannot tell it apart from other tools. Fix: specify purpose, example, and what it does NOT do.
Too many tools. A 40-tool toolbox overwhelms selection. Fix: group by phase (planning tools vs execution tools) and expose only the phase-relevant subset.
No stop condition. No finish tool, no budget, no "when done, reply to the user." Model loops. Fix: always define an explicit terminator.
Trusting every result. No instruction to evaluate results before using. Model merges garbage into its answer. Fix: require citation and evaluation in the system prompt.
Single-mode thinking. Setting everything to auto tool choice regardless of whether the question needs fresh data. Fix: use tool-forcing for time-sensitive questions, tool-permitting for general ones.
Ignoring errors. No error protocol, so the model retries the same broken call until it runs out of turns. Fix: four-line error protocol with a retry cap.

For agents that loop over tool calls with reasoning in between, these patterns compose with multi-agent prompting — tool descriptions and protocols become part of what each agent sees.

FAQ

How many tools is too many?

No hard number, but selection accuracy degrades noticeably past around 10–15 tools in a single prompt based on common practitioner reports. Past that point, either scope the tools to the phase (research tools in research, execution tools in execution), or group tools under a dispatcher — a single do(action, params) tool with action names instead of individual tools. The dispatcher shifts the decision into the arguments, which the model handles differently than tool selection.

Should tool descriptions be long or short?

Long enough to disambiguate, short enough to not bloat the prompt. A useful shape is: one-sentence purpose, parameter list, one example, one-sentence negative. Five to ten lines per tool. Avoid paragraphs of marketing about what the tool can do.

When should I use tool-forcing over tool-permitting?

Force a tool call when answering without the tool risks stale or invented data — anything about current events, specific user files, live database state, or external systems. Permit when the model might genuinely know the answer and a tool call would add latency without value. When in doubt, force — the cost of an unnecessary tool call is a few hundred milliseconds; the cost of a hallucinated answer is trust.

What about when tools contradict each other?

Instruct the model to surface the contradiction rather than pick. "If two tools return conflicting information, state both findings and explain which appears more authoritative and why, or ask the user to resolve it." This matters for research agents pulling from multiple sources — hiding contradictions to produce a clean answer is a worse failure than flagging them.

How do I debug a tool-use run that went sideways?

Look at the trace turn by turn. Which tool was called? What arguments? What did it return? What did the model say next? Most bugs are visible in one specific transition — a result comes back, the next turn ignores it. From there, the fix is usually a prompt-level instruction about how to handle that case. The agent debugging prompts guide covers the trace-reading workflow in detail.

Wrap-Up

Tool use works across every frontier model in 2026, but reliable tool use comes from the prompt. Good descriptions with explicit negatives stop wrong-tool errors. Tool-forcing stops stale-answer errors. Error and result protocols stop silent failures. Budget and phase constraints stop drift. The whole set fits in a system prompt under 300 words, and each line corresponds to a failure mode you would otherwise debug in production. For the broader picture see the complete guide to prompting AI coding agents; for adjacent patterns see ReAct prompting and multi-agent prompting.

Tool Use Prompting Patterns: Getting Reliable Tool Calls (2026)

What Tool Use Means

Why Tool Use Fails

Writing Good Tool Descriptions

Tool-Forcing vs Tool-Permitting

Handling Tool Errors in Prompts

Handling Tool Results in Prompts

Constraining Tool Use Deliberately

A Good Tool-Use Setup — Hypothetical Example

Common Anti-Patterns

FAQ

How many tools is too many?

Should tool descriptions be long or short?

When should I use tool-forcing over tool-permitting?

What about when tools contradict each other?

How do I debug a tool-use run that went sideways?

Wrap-Up

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

The Complete Guide to Prompting AI Coding Agents (2026)

ReAct Prompting Guide: Reasoning Plus Acting for AI Agents (2026)

Multi-Agent Prompting Guide: Coordinating Specialist Agents (2026)