Benchmark Contamination

Benchmark Contamination: Benchmark contamination occurs when an AI model's training data accidentally or deliberately includes questions and answers from the benchmark tests used to evaluate it. Because the model has effectively "seen the exam" during training, its benchmark scores are artificially inflated and no longer reflect genuine capability. This makes it harder to compare models fairly and can mislead users and researchers about a model's true performance on novel tasks.

Example

A model scores 95% on the MMLU benchmark, but researchers discover that thousands of MMLU questions appeared verbatim in its training data. When tested on fresh, unseen questions covering the same topics, the model's accuracy drops to 78% — revealing that the original score reflected memorization rather than true understanding.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts

Example

Related Terms

Put this into practice