Key takeaways:
- Public benchmarks for AI coding agents describe capability ceilings on specific kinds of work. They do not predict how an agent will perform on your codebase, your build system, or your team's review standards.
- The three benchmarks worth knowing in 2026 are SWE-Bench Verified (issue resolution on real Python repos), Aider Polyglot (cross-language edit accuracy with hidden tests), and Terminal-Bench (long-horizon shell work). Each measures a different slice; none measures everything.
- The honest framing is "public benchmarks are a filter, internal evals are the verdict." Use public scores to narrow a shortlist; use an internal eval against your own tasks to make the actual decision.
- Coding-agent evals are structurally different from chat evals — multi-turn, tool-using, environment-stateful — so the right unit of measurement is the final state of a repository, not the agent's last message.
- Be skeptical of small score differences. Harness differences, test contamination, and "teach to the test" optimization make a few-point gap roughly meaningless.
- The internal eval pattern is small (ten to fifty tasks), real (drawn from your own issue tracker), and reproducible (same access, same prompt template, scored by automated checks plus a LLM-as-judge or human review).
- Re-run on three triggers — model upgrade, scaffolding change, and a monthly calendar tick — and track regressions by category, not as a single average.
Most teams choosing an AI coding agent look at a leaderboard, pick whichever model is on top this month, and wonder six weeks later why the agent that looked best on paper keeps producing diffs nobody wants to merge. The leaderboard is not lying — it is answering a different question than the one the team needed to ask. Public benchmarks tell you what an agent can do on a curated slice of public code with a specific harness. Whether it can do the work your team actually ships is a separate measurement, and you have to make it yourself.
This is a working-engineer reference for that measurement. It covers the three public benchmarks worth understanding in 2026 and walks through the internal-eval discipline that turns "this model scored well" into "this model is the right default for our team." It pairs with the pillar guide on prompting AI coding agents and the Prompt Evaluation Complete Guide.
Tip
Public benchmarks rank capability ceilings; internal evals predict shipped quality. Use both — but trust internal evals more.
Why coding-agent evals are different from chat evals
A chat eval scores a single response. The model sees a prompt, produces text, a judge decides whether it was good. That is the right shape for chatbot, RAG, and summarization evaluation — one input, one output, a verdict.
A coding-agent eval cannot work that way. The agent does not produce one response; it runs a loop. It reads files, runs tests, writes a patch, reads the test output, edits the patch, runs tests again, and only stops when it thinks it is done. The thing you want to score is not the agent's final message — the agent might say "done!" while leaving the repository broken. You want to score the final state of the repository. Did the tests pass? Did the diff stay focused? Are types still clean? Did adjacent code break? Those are properties of the environment after the loop runs, not properties of any one message inside it.
This is why agentic coding needs its own evaluation discipline. The unit of measurement is a task with a starting state, a goal, and an outcome verified by running checks against the resulting environment. The model is not the only thing under test — tool definitions, the system prompt, the iteration limit, and how errors are surfaced all change the outcome. A weaker model with a better harness routinely beats a stronger model with a worse one, and a benchmark that does not control for harness is hard to interpret.
This maps naturally onto how engineers already think about software quality. We do not score code by reading the commit message; we score it by running the tests, looking at the diff, and asking whether a reviewer would merge it. A coding-agent eval is the same — we just automate what can be automated and structure the rest.
SWE-Bench (and SWE-Bench Verified)
SWE-Bench is the most-cited coding-agent benchmark and the one most likely to come up in a vendor pitch. It was built by taking real GitHub issues from popular open-source Python projects (early versions drew from a couple dozen repositories including Django, sympy, and scikit-learn), recording the repository state at the commit before each issue was fixed, and packaging each as a task. The agent gets the repository, the issue text, and a working test environment, and must produce a patch that resolves the issue.
The grading is unambiguous. Each task has "fail-to-pass" tests that must turn green and "pass-to-pass" tests that must stay green. A task is solved only when both conditions hold. There is no LLM-as-judge at the grading layer — it is all the project's own pytest output.
This is what makes SWE-Bench credible: real inputs, mechanical grading, binary per-task verdicts. The score measures end-to-end issue-resolution capability on real Python codebases — read the issue, find the relevant code, write a fix, verify it doesn't break adjacent tests.
What SWE-Bench Verified adds
The original SWE-Bench shipped with rough edges. Some tasks had flaky test infrastructure, some issues lacked the information needed to solve them from text alone, and some fail-to-pass tests did not actually verify what the issue was complaining about. Two agents with identical reasoning could get different scores depending on which noisy tasks fell their way.
SWE-Bench Verified is a human-curated subset where annotators kept only the tasks where the test mapping is clean, the issue is solvable from the text, and the infrastructure runs reliably. It is the version most credible reporting uses in 2026 and the one to default to when comparing models. Original and Verified are not directly comparable, so when a vendor quotes a SWE-Bench number with no qualifier, the version matters.
What SWE-Bench does not measure
The corpus is Python. If your codebase is TypeScript, Go, Rust, or anything else, SWE-Bench scores tell you almost nothing about how an agent will perform on your work. Cross-language transfer is real but uneven, and the harness assumptions (pip environments, pytest, Python-OSS project layouts) bake in.
The tasks are bug fixes and small features, not architecture or refactors. An agent that crushes SWE-Bench may still produce poor design decisions on greenfield work. The tasks are also single-repo and self-contained — nothing in the dataset measures cross-service work, infrastructure changes, or refactors touching hundreds of files.
The dataset is public. The issues, test patches, and canonical fixes are all on the open web, and there is a real concern some have made it into training data. Vendors who report Verified scores generally apply contamination filters, but the concern does not disappear. Treat any SWE-Bench number as a noisy estimate.
Aider Polyglot
Aider Polyglot is a different shape of benchmark. Where SWE-Bench measures end-to-end issue resolution in one language, Aider Polyglot measures cross-language edit accuracy across roughly six languages — Python, JavaScript, Go, Rust, C++, and Java — using Exercism-style coding problems. Each problem has a hidden test suite the agent does not see. The agent gets a problem description and a stub file, and must edit the file so the hidden tests pass.
The eval runs through the Aider tool itself, which is meaningful. Scores capture not just the model's reasoning but how well it works with Aider's edit format and conversation pattern. A model great at writing code but poor at producing the diff format Aider expects scores lower than its raw capability suggests. Harness matters in every agent eval, but Aider Polyglot is more transparent about it than most.
What it is good for: comparing models head-to-head on a wide language surface. The same task is posed to every model, graded by the same hidden test, with broader language coverage than SWE-Bench's Python-only corpus. If your team works across multiple languages, this tells you something SWE-Bench cannot.
What it does not measure: production-codebase complexity. The problems are self-contained exercises, not real repos with build systems, test fixtures, and adjacent files. An agent that aces Aider Polyglot may still struggle with a multi-file change that requires understanding how a service is wired together. The benchmark also doesn't measure debugging — you fill in a stub, not repair buggy code — or architectural judgment.
Contamination applies here too. Exercism problems and their canonical solutions are public. Models that have seen the solutions in training will outperform those that have not, on tasks that are essentially memorization. Use these scores as a comparative signal across models tested on the same harness at the same time, not as an absolute measurement.
Terminal-Bench
Terminal-Bench is the newest of the three and the most relevant for agents that operate inside a shell — running commands, reading output, recovering from errors, chaining long sequences of work. Tasks look like what a developer does when they open a terminal: configure a service from a README, debug a broken environment, complete a git workflow with rebasing and conflict resolution, set up a project, run a complex build pipeline, manipulate files at scale.
The harness is genuinely a terminal. The agent issues commands, the system runs them, the agent reads the output, and grading is based on the resulting state — files in the right place, services responding, git history matching expectations. Some tasks have many correct command sequences; only the final state matters.
This is the eval most relevant to terminal-native agents like Claude Code. It stresses the exact loop those agents run — propose a command, observe its effect, adapt, plan the next step — rather than the produce-a-patch loop SWE-Bench measures. A model great at writing patches but poor at recovering from a failed npm install looks fine on SWE-Bench and miserable on Terminal-Bench, and that gap matters if you're deploying a terminal agent.
What Terminal-Bench measures: tool-use reliability, multi-step shell reasoning, error recovery, and knowing when a task is done. What it doesn't measure: creative coding, architectural judgment, large-diff refactors, or IDE-style editing. It depends heavily on the specific shell tools the harness exposes — strength with one toolset doesn't guarantee strength with another.
Contamination for Terminal-Bench is less worrying right now than for the other two; the tasks involve specific environment configurations rather than canonical solutions on the public web. But any benchmark on the public internet eventually leaks into training corpora.
Reading public benchmarks honestly
Each benchmark above is a slice of capability, not a coronation. A higher Aider Polyglot score does not mean an agent will be better on your gnarly TypeScript monorepo. A higher SWE-Bench Verified score does not mean it will write good React components. A higher Terminal-Bench score does not mean architecturally clean diffs.
A few things to watch for:
Tooling differences. Some scores are reported with elaborate scaffolding — multi-agent setups, test-time search, retry logic, retrieval over the codebase. Others are reported with a minimal harness. A "SWE-Bench Verified score of X%" with a custom 5-agent system tells you about the system; the same number with a vanilla single-pass setup tells you about the model. Compare like with like.
Test contamination. Public datasets eventually leak into training data. Vendors generally apply some filtering, but there is no perfect way to verify a model has not seen a task. Treat benchmark numbers as noisy estimates with an unknown contamination floor. Newer releases tend to be cleaner signal than older ones.
"Teach to the test." When a benchmark drives vendor messaging, vendors tune for it. That tuning may help on adjacent real-world work — or it may produce a model that has implicitly memorized test patterns and underperforms on work the benchmark doesn't cover. A score that climbs sharply on one benchmark while moving little on others is a flag, not a celebration.
Single-number averages. "X% on SWE-Bench Verified" averages across hundreds of tasks. The interesting story is usually inside the average — does the agent crush easy tasks and choke on hard ones? Strong on bug fixes, weak on features? Aggregates hide structure. Per-category breakdowns are more useful than the headline.
The honest summary: public benchmarks are a filter. They tell you which models are in the running. They do not tell you which one to ship.
Building internal evals that actually predict your team's experience
Once a model has cleared the public-benchmark filter, the question that matters is whether it works on your work. The answer is an internal eval — a small, real, reproducible test suite drawn from your own codebase. It is the same shape as the broader Prompt Evaluation Complete Guide discipline, specialized for agentic coding.
Curate a golden set of real tasks
Pull ten to fifty tasks from your issue tracker or git history. Aim for diversity, not volume. A good starting set includes:
- Easy bugs (a typo, a missing null check, a regression caught by a test)
- Harder bugs that required reading multiple files to find
- Small features (add a flag, add a field, add a route)
- Refactors that touch two or three files
- One or two tasks that previously surprised a senior engineer — the kind where a junior would have shipped the wrong fix
Real tasks beat synthetic ones for the same reason real production traffic beats imagined inputs in chatbot evals. Synthetic tasks systematically miss the failures that come from the gap between what an author imagines and what real work looks like. Build your golden set from the work your team has actually done.
Define success the way your reviewers would
For each task, write down what success means. Not a vague description — a checklist a reviewer could apply. A typical set looks like:
- All affected tests pass
- Type checking passes
- The diff is below a size threshold (varies by task; "below 200 lines for a bug fix, below 600 for a feature" is a reasonable starting heuristic)
- The change does not touch files unrelated to the task
- A senior engineer reading the diff would approve it on style and design grounds
The first three are automated. The fourth is mostly automated (file-scope check). The fifth is where LLM-as-judge review or a human reviewer comes in, and it is the part most teams under-build. A patch that passes tests but is structurally embarrassing is still a problem; the eval has to catch that.
Run candidates through the same harness
Run each candidate agent through the same task list with the same access, prompt template, iteration budget, and tool permissions. This sounds obvious and is hard. Letting "agent A use the system prompt it was tuned for, and agent B use the prompt it was tuned for" produces a comparison where you cannot tell whether the difference is the model or the prompt. Fix the harness as much as possible and let the model be the variable. Where you do change harness elements per agent, write down what and why.
Score with a mix of automated and judged checks
Most of the work should be automated. Tests pass or do not. Types pass or do not. Diff size is a number. File scope is a list. Build those checks once and re-run them on every model change without thinking.
The judgment layer — code-quality review — is where the SurePrompts Quality Rubric and broader scoring come in. Run this with a human reviewer on a sample, an LLM-as-judge across all outputs, or both. The judgment layer is noisier than the automated layer; treat it as one signal, not the decider.
The full discipline is closer to a small eval harness than a one-off comparison. Once the scaffolding exists for the first model, running it for the next one is cheap.
Track regressions across upgrades
The most useful artifact this discipline produces is a score history. When a new model version ships, re-run the eval, compare scores by category, and see whether anything regressed. A model upgrade that lifts the average two points and tanks one category by twenty is not a free upgrade — it is a routing decision. The score history is what makes that visible.
The same applies to prompt-scaffolding changes, tool-definition changes, and how errors are surfaced. Anything that touches the agent's loop should run through the eval, and the diff in scores is the signal.
What to do with the data once you have it
A working internal eval supports a few decisions you couldn't make confidently before:
Model selection. Which agent should be the team default? The eval gives you a defensible answer instead of a vibes-based one. The answer often surprises — a cheaper model with the right scaffolding routinely beats a more expensive one with weaker scaffolding.
Routing. Many teams run multiple models for different work — a fast cheap model for easy bugs, a stronger one for complex refactors, a specialized one for test-driven development with AI coding agents. The per-category breakdown tells you where each model earns its keep.
Regression detection. Vendors update models, and sometimes updates tank a category you care about. Without an eval, you hear about it from your team complaining about diff quality. With one, you find out the morning after rollout.
Prompt-pattern A/B testing. When you change the system prompt, the agentic prompt stack, or how tasks are framed, you can score the change and see whether it actually helped. This is what lets prompt engineering compound instead of churn.
Comparing operating modes. Vibe coding, Cline-style flows, and spec-driven work each stress the agent differently. An eval set that includes tasks from each mode gives a clearer picture than a single task type.
The RCAF framework and other prompt-quality scaffolding sit under these decisions — they are how you keep the prompt side under control while measuring the model side. The eval cares about the system, not just the model.
Common mistakes when running coding-agent evals
A few failure patterns show up often enough to warn about:
The eval set is too small. Five tasks does not produce a stable signal. A model that solves three of five looks 60%; on a different five it could look 40% or 80%. Ten tasks is the floor for noisy signal, twenty is reasonable, fifty is good for production use.
The eval set leaks into training data. If you build your golden set from public OSS issues, vendors who train on public data may have seen the canonical fixes. Either keep the set private (your own internal codebase) or accept that the score floor is uncertain.
Sloppy success criteria. "Did it look right?" is not a criterion. "Tests pass, types pass, diff under 200 lines, scoped to the affected file, reviewer approves on style" is. Sloppy criteria make the score a vibes meter in a number costume.
The eval doesn't isolate the variable. Comparing two models on different prompts, tool definitions, and environments tells you about the system, not the model. For model selection, fix everything else. For studying a prompt change, fix the model.
Reporting a single average. "Model A scored 62%, Model B scored 67%" hides whether B is uniformly better or crushes one category and tanks another. Always break scores down by task type.
Treating the eval as one-time. A model selected this quarter may not be right next quarter. Provider updates, new releases, and codebase changes all shift the answer. The eval is a regression suite you maintain, not a benchmark you run once.
What to read next
For the broader picture of how to prompt and operate AI coding agents, the pillar guide covers frameworks, patterns, and mode selection. For the deeper evaluation discipline, the Prompt Evaluation Complete Guide is the canonical. For test-first workflows that pair particularly cleanly with internal evals, the test-driven development with AI coding agents tutorial is the sibling.
The pattern across all three: stop trusting a single number, start measuring against the work you actually ship, and treat the eval as part of the system you maintain — not a benchmark you read.