Benchmark
A benchmark in AI is a standardized test suite with predefined tasks, datasets, and evaluation metrics used to measure and compare model performance. Benchmarks provide objective scores across capabilities like reasoning, coding, math, language understanding, and safety. They enable researchers and practitioners to track progress, identify strengths and weaknesses, and make informed model selection decisions.
Example
The MMLU benchmark tests models across 57 academic subjects from elementary math to professional law. When a new model scores 86% on MMLU compared to the previous best of 83%, it provides a concrete, comparable measure of improvement across a wide range of knowledge and reasoning tasks.
Related Terms
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts