Skip to main content

Golden Set

A golden set is a curated collection of input-output pairs that represent the correct behavior for a given task. It is used as the gold standard for evaluation: every new prompt or model version is scored against the golden set before shipping. Golden sets are typically small — 50 to 200 examples — labeled by experts, and treated as immutable; updates go through review rather than ad-hoc edits. They differ from training data: golden sets are eval-only and must never leak into training or fine-tuning pipelines, or scores on them become meaningless. A good golden set is diverse, covers edge cases, and is stable enough that a score change signals a real behavior change, not test noise.

Example

A summarization team maintains a 120-example golden set of (article, human-written summary) pairs labeled by two editors. Each candidate prompt is scored against the golden set via an LLM-as-judge rubric and the team's own pairwise preference tests. When a prompt change drops the rubric score by more than two points or loses more than 35% of pairwise comparisons, the change is blocked pending review.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts