Skip to main content

Tau-bench

Tau-bench is an agent evaluation benchmark that tests tool-use accuracy across multi-turn customer-service-style tasks. It measures whether an agent reliably calls the right tools, passes valid arguments, and reaches the right outcome across realistic flows.

How it works

  1. 1

    The benchmark simulates customer-service tasks with multiple turns of interaction, where each turn may require a tool call, a clarification, or a final answer.

  2. 2

    Tool calls are validated against expected schemas and argument values, not just the surface text of the response.

  3. 3

    Evaluation runs across the full multi-turn trajectory so a single missed tool call or wrong argument propagates and lowers the score.

  4. 4

    Final scores reflect end-to-end task success rate, which is a tighter signal of production reliability than single-step accuracy on isolated function-call prompts.

Example

When teams talk about an agent being "Tau-bench reliable," they mean its multi-step tool-calling holds up across the kinds of branching customer-service flows the benchmark simulates — not just one-shot function calls in isolation.

Frequently asked questions

What is Tau-bench?

Tau-bench is an agent evaluation benchmark that tests tool-use accuracy across multi-turn customer-service-style tasks. It measures whether an agent reliably calls the right tools, passes valid arguments, and reaches the right outcome across realistic flows.

How does Tau-bench work?

The benchmark simulates customer-service tasks with multiple turns of interaction, where each turn may require a tool call, a clarification, or a final answer. Tool calls are validated against expected schemas and argument values, not just the surface text of the response. Evaluation runs across the full multi-turn trajectory so a single missed tool call or wrong argument propagates and lowers the score. Final scores reflect end-to-end task success rate, which is a tighter signal of production reliability than single-step accuracy on isolated function-call prompts.

Can you give an example of Tau-bench?

When teams talk about an agent being "Tau-bench reliable," they mean its multi-step tool-calling holds up across the kinds of branching customer-service flows the benchmark simulates — not just one-shot function calls in isolation.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts