Skip to main content

Terminal-Bench

Terminal-Bench is an evaluation benchmark for AI agents that measures their ability to complete long-horizon, multi-step shell tasks — git operations, build and test loops, file manipulation, system configuration, and recovery from intermediate errors. It targets the kinds of jobs that terminal-native coding agents are asked to do in practice, where success depends on tool reliability and multi-step state rather than single-turn reasoning. Terminal-Bench complements code-patch benchmarks like SWE-Bench by exercising the surrounding environment, not just the diff. Tasks are typically run in disposable containers and scored on whether the agent produces the specified end state, not on how many turns or tokens it took to get there.

Example

A Terminal-Bench task instructs an agent to clone a specific repository into a fresh container, install its dependencies, run the test suite, identify why one test fails, and produce a fix. The task succeeds only if the agent's changes land on disk and the previously failing test now passes when re-run. Failures might come from a mistyped shell command, a wrong working directory, or giving up before the test loop converges.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts