Aider Polyglot
Aider Polyglot is a multi-language coding benchmark, originated by the Aider open-source project, that evaluates an AI agent's ability to satisfy hidden tests across Exercism-style problems in roughly half a dozen languages — typically Python, JavaScript, Go, Rust, C++, and Java. Each task hands the agent a problem statement and stub files; success requires producing a patch that, when applied, passes the hidden test file on the first run. Because tasks span multiple languages, the benchmark surfaces cross-language edit accuracy and instruction-following in a way that single-language Python benchmarks like SWE-Bench cannot. It is one of the standard reference benchmarks cited in coding-agent evaluations alongside SWE-Bench Verified and Terminal-Bench.
Example
An agent receives an Exercism-style task in Rust along with stub files and a hidden test file. It must produce a patch that compiles cleanly and passes every assertion when the test file is run. The harness repeats this across hundreds of tasks spanning all supported languages, then reports pass-rate per language and overall. A model that excels in Python but stumbles on Rust borrow-checker edits shows that asymmetry directly in the per-language breakdown.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts