Skip to main content
Back to Blog
tool usefunction callingprompt engineeringAI agentsagentic AI

Tool Use Prompting Patterns: Getting Reliable Tool Calls (2026)

Prompt patterns that make tool use reliable — clear tool descriptions, tool-forcing vs tool-permitting, error recovery, and handling malformed arguments.

SurePrompts Team
April 20, 2026
12 min read

TL;DR

Tool use as a capability is universal, but reliable tool use requires prompt patterns — clear tool descriptions, explicit guidance on when to call vs answer directly, and error recovery when calls fail.

Tool use — also called function calling — is supported by every major frontier model in 2026. Claude, GPT, and Gemini can all receive a list of tools, decide when to call one, emit structured arguments, and incorporate the result. That is the capability. Reliable tool use is a different problem: picking the right tool, calling it at the right time, passing valid arguments, handling errors, and stopping when done. Almost all of that is controlled by the prompt. This guide covers the patterns that turn raw capability into reliable behavior — see the tool-use glossary entry for the compact definition.

What Tool Use Means

At the capability level, tool use works roughly the same across providers. You give the model a system prompt and a list of tool schemas — each with a name, a description, and a parameter definition. When the model decides a tool is needed, instead of answering the user it emits a structured call with the tool name and argument values. Your harness executes the tool, captures the result, and sends it back to the model as a follow-up turn. The model then either calls another tool or replies.

The model does not run the tool itself — it only emits the call. Everything else (executing, returning results, looping) is the harness's job. That means the prompt is where the decision-making happens: which tool, when, with what arguments, and when to stop. For the reasoning loop that wraps this capability, see ReAct prompting.

Why Tool Use Fails

Most failed tool-use runs trace back to one of five failure modes. They are worth naming because each has a different prompt-level fix.

Failure modeWhat it looks likeTypical cause
Wrong toolModel calls search when it should have called read_fileOverlapping tool descriptions; no guidance on when to prefer one
No toolModel answers from its training data when it should have looked up current dataSystem prompt does not require tool use for that class of question
Malformed argumentsJSON is invalid, required field missing, type wrongSchema unclear; no example in the tool description
Infinite loopModel keeps calling the same tool foreverNo stop condition; tool error treated as "try again"
Trusted hallucinated resultTool returns nothing useful, model invents content anywayNo instruction about what to do on empty or error results

Four of the five are prompt-level problems, not model-level. A smarter model reduces the rate but does not remove the failure mode. Prompt patterns do.

Writing Good Tool Descriptions

The tool description is where most bugs live. The model uses the description to decide whether to call the tool at all, and to distinguish it from other tools. A vague description breaks both decisions.

A good description has five parts:

  • Name. Short, verbish, distinctive. search_docs beats docs.
  • Purpose. One sentence on what the tool does and what problem it solves, written from the model's perspective: "Use this when the user asks for…"
  • Parameters. Each one named, typed, described. Required vs optional marked.
  • Example use. One concrete sample argument set and what the tool returns. Short.
  • What it does NOT do. An explicit negative. "Does not search the web. Use web_search for that." This line prevents most wrong-tool errors.

The negative is the part people skip, and it is often the most useful. When the model has two tools whose descriptions both sound plausible, it picks essentially at random. An explicit "this is not that" line forces the distinction.

Tool-Forcing vs Tool-Permitting

Most provider APIs expose three tool-use modes:

  • Auto. The model decides whether to call a tool or answer directly. Default.
  • Any / required. The model must call a tool (not answer in text). Lets it pick which one.
  • Specific tool. The model must call a named tool. No choice.

The prompt-level equivalents — useful as reinforcement or when the provider does not expose modes — are:

  • Tool-permitting language. "You may use these tools if helpful." Leaves it to the model.
  • Tool-forcing language. "You must call one of these tools before responding." Requires a call.
  • Phase-scoped forcing. "In the research phase, you must call search at least once. In the writing phase, do not call any tools." Different regimes at different stages.

Forcing is right when the question cannot be answered from training data (time-sensitive, user's own files, fresh lookup). Permitting is right when the model might know the answer and a tool call would add latency without value. The mistake is defaulting to auto for everything. Auto lets the model guess, and for questions where a stale answer is worse than a slow one, that guess is often wrong.

Handling Tool Errors in Prompts

Tools fail. A search returns no results, an API rate-limits, a file does not exist, a query is malformed. The model sees the error message in the next turn and has to decide what to do. Without guidance, it usually does one of two things, both wrong: retry the same call forever, or silently invent a result and keep going.

The prompt pattern is a short error protocol in the system prompt:

code
When a tool call returns an error:
1. Read the error message — do not ignore it.
2. If the error is transient (timeout, rate limit), retry once with the same args.
3. If the error is about your arguments (invalid format, missing field),
   fix the arguments and retry once.
4. If the error persists or indicates the resource does not exist, stop
   calling that tool and tell the user what went wrong. Do not invent
   a result.

Four lines of system prompt, and the model stops looping on errors. The retry cap matters — without it, the model will keep retrying with tiny variations for as many turns as the harness allows.

Handling Tool Results in Prompts

The other side of the same problem is what to do with successful tool results. By default the model treats tool output as gospel, which is a problem when the tool is search (results may be irrelevant), retrieval (chunks may be low-quality), or scraping (pages may be wrong). A good prompt forces the model to evaluate results before using them.

Three lines that help:

code
When a tool returns results:
- Read them carefully. Do not assume they answer the question.
- If the results are empty, irrelevant, or contradictory, say so and
  call the tool again with different args OR ask the user for
  clarification. Do not fill in the gap from memory.
- When you cite a result in your answer, reference which tool call it
  came from (e.g. "search result 2").

The citation requirement is the quiet heavy-lifter. When the model has to cite which tool call each claim came from, it cannot smoothly merge tool output with hallucinated content — the citation exposes the gap.

Constraining Tool Use Deliberately

Reliability also comes from limiting tool use. A few constraints worth applying:

  • Budget limits. "Use at most 5 tool calls for this task." Prevents drift on open-ended questions. The model learns to plan its calls rather than scatter them.
  • Phase-specific tools. "In the planning phase, only read_file and grep are allowed. In the implementation phase, edit_file is added. In the verification phase, only run_tests is allowed." Matches plan-and-execute prompting where different phases need different capabilities.
  • Order constraints. "You must call read_file on the file before calling edit_file on it." Prevents blind edits to files the model has not looked at.
  • Forbidden combinations. "Do not call delete_file in the same turn as create_file on the same path. If you need to rewrite a file, use edit_file." Stops destructive patterns.

These are brittle — a stubborn model will sometimes break the rule — but they catch most violations and make the rest visible in traces.

A Good Tool-Use Setup — Hypothetical Example

Here is an illustrative system prompt for a research agent with three tools. The specifics are hypothetical and meant to show the shape of a well-formed setup, not a production spec.

code
You are a research assistant. You answer questions using the tools below.
You must call at least one tool before answering — do not answer from memory
on topics where current data matters.

Tools:

1. web_search(query: string) -> list of {title, url, snippet}
   Purpose: Find recent articles, news, or documentation on the open web.
   Use when the user asks about current events, recent releases, or
   general public information.
   Example: web_search("Anthropic Claude 4.7 release date")
   Does NOT: search private documents or the user's files.

2. read_doc(doc_id: string) -> string
   Purpose: Retrieve the full text of an internal document by ID.
   Use when the user references a specific internal doc or when a
   web_search result links to an internal doc_id (format: "doc_XXXX").
   Example: read_doc("doc_1042")
   Does NOT: search — you must already have the doc_id.

3. finish(answer: string) -> terminates the loop
   Purpose: Emit the final answer and stop.
   Use ONLY when you are confident the answer is grounded in tool results
   you have just seen.

Error protocol:
- On error, read the message. Retry once if transient or if your args
  were wrong. Otherwise stop and report the failure to the user.

Result protocol:
- Cite the tool call each claim came from ("from web_search result 2").
- If results are empty or irrelevant, call the tool again with different
  args or ask the user for clarification. Do not invent.

Budget: at most 6 tool calls per question.

Six things: enumerates tools with explicit negatives, forces at least one tool call, defines an explicit terminator, sets an error protocol, sets a result protocol, caps the budget. Each one corresponds to a failure mode from the table above.

Common Anti-Patterns

  • Vague tool descriptions. "This tool searches things." The model cannot tell it apart from other tools. Fix: specify purpose, example, and what it does NOT do.
  • Too many tools. A 40-tool toolbox overwhelms selection. Fix: group by phase (planning tools vs execution tools) and expose only the phase-relevant subset.
  • No stop condition. No finish tool, no budget, no "when done, reply to the user." Model loops. Fix: always define an explicit terminator.
  • Trusting every result. No instruction to evaluate results before using. Model merges garbage into its answer. Fix: require citation and evaluation in the system prompt.
  • Single-mode thinking. Setting everything to auto tool choice regardless of whether the question needs fresh data. Fix: use tool-forcing for time-sensitive questions, tool-permitting for general ones.
  • Ignoring errors. No error protocol, so the model retries the same broken call until it runs out of turns. Fix: four-line error protocol with a retry cap.

For agents that loop over tool calls with reasoning in between, these patterns compose with multi-agent prompting — tool descriptions and protocols become part of what each agent sees.

FAQ

How many tools is too many?

No hard number, but selection accuracy degrades noticeably past around 10–15 tools in a single prompt based on common practitioner reports. Past that point, either scope the tools to the phase (research tools in research, execution tools in execution), or group tools under a dispatcher — a single do(action, params) tool with action names instead of individual tools. The dispatcher shifts the decision into the arguments, which the model handles differently than tool selection.

Should tool descriptions be long or short?

Long enough to disambiguate, short enough to not bloat the prompt. A useful shape is: one-sentence purpose, parameter list, one example, one-sentence negative. Five to ten lines per tool. Avoid paragraphs of marketing about what the tool can do.

When should I use tool-forcing over tool-permitting?

Force a tool call when answering without the tool risks stale or invented data — anything about current events, specific user files, live database state, or external systems. Permit when the model might genuinely know the answer and a tool call would add latency without value. When in doubt, force — the cost of an unnecessary tool call is a few hundred milliseconds; the cost of a hallucinated answer is trust.

What about when tools contradict each other?

Instruct the model to surface the contradiction rather than pick. "If two tools return conflicting information, state both findings and explain which appears more authoritative and why, or ask the user to resolve it." This matters for research agents pulling from multiple sources — hiding contradictions to produce a clean answer is a worse failure than flagging them.

How do I debug a tool-use run that went sideways?

Look at the trace turn by turn. Which tool was called? What arguments? What did it return? What did the model say next? Most bugs are visible in one specific transition — a result comes back, the next turn ignores it. From there, the fix is usually a prompt-level instruction about how to handle that case. The agent debugging prompts guide covers the trace-reading workflow in detail.

Wrap-Up

Tool use works across every frontier model in 2026, but reliable tool use comes from the prompt. Good descriptions with explicit negatives stop wrong-tool errors. Tool-forcing stops stale-answer errors. Error and result protocols stop silent failures. Budget and phase constraints stop drift. The whole set fits in a system prompt under 300 words, and each line corresponds to a failure mode you would otherwise debug in production. For the broader picture see the complete guide to prompting AI coding agents; for adjacent patterns see ReAct prompting and multi-agent prompting.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Ready to write better prompts?

SurePrompts turns plain English into expert-level AI prompts. 350+ templates, real-time preview, works with any model.

Try AI Prompt Generator