Question 1

What is Golden Set?

Accepted Answer

A golden set is a curated collection of input-output pairs that represent the correct behavior for a given task. It is used as the gold standard for evaluation: every new prompt or model version is scored against the golden set before shipping.

Question 2

How does Golden Set work?

Accepted Answer

Golden sets are typically small — 50 to 200 examples — labeled by experts, and treated as immutable; updates go through review rather than ad-hoc edits. They differ from training data: golden sets are eval-only and must never leak into training or fine-tuning pipelines, or scores on them become meaningless.

Question 3

Can you give an example of Golden Set?

Accepted Answer

A summarization team maintains a 120-example golden set of (article, human-written summary) pairs labeled by two editors. Each candidate prompt is scored against the golden set via an LLM-as-judge rubric and the team's own pairwise preference tests. When a prompt change drops the rubric score by more than two points or loses more than 35% of pairwise comparisons, the change is blocked pending review.

Golden Set

Example

Frequently asked questions

What is Golden Set?

How does Golden Set work?

Can you give an example of Golden Set?

Related Terms

Put this into practice