SWE-Bench
SWE-Bench is an evaluation benchmark from Princeton and the University of Washington that measures an AI agent's ability to resolve real GitHub issues by producing patches that pass the affected project's existing test suite. Each task pins a repository at a specific commit and pairs it with an issue description; an agent must produce a code change such that, when applied, the hidden test file passes. SWE-Bench Verified is the human-validated subset where the test suite genuinely maps to the issue. The benchmark is widely cited as a capability proxy for coding agents on real-world Python codebases. Limitations include its Python-heavy composition, possible overlap with model training data, and limited coverage of frontend or architectural work.
Example
A SWE-Bench task hands an agent a Django repository at a specific commit along with an issue describing a regression in a date-parsing utility. The agent reads the relevant files, locates the bug, and submits a patch. The eval applies the patch, runs the project's hidden test file, and records pass or fail. Aggregated across the 500-task Verified set, the agent's score is the percentage of issues whose tests pass after the agent's patch is applied.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts