Question 1

What is Eval Harness?

Accepted Answer

An eval harness is infrastructure that runs a prompt or model against a fixed test set and computes aggregate scores per metric. It decouples what you test (the eval set) from how you run it (the harness), so the same tests can run against different models, prompt variants, or decoding settings with no code changes.

Question 2

How does Eval Harness work?

Accepted Answer

A typical harness handles dataset loading, prompt templating, inference, scoring, and report generation. It is distinct from unit tests: eval harnesses produce distributional scores — accuracy, pass@k, F1, judge scores — not binary pass/fail assertions.

Question 3

Can you give an example of Eval Harness?

Accepted Answer

A support team wires a harness around a 300-example ticket-classification eval set. Before any prompt change ships, the harness runs the candidate prompt against all 300 tickets, computes per-class F1, logs per-example diffs against the previous prompt, and posts the report to the PR. A prompt change that raises overall F1 but drops "billing" F1 by 6 points is caught and revised before merge.

Eval Harness

Example

Frequently asked questions

What is Eval Harness?

How does Eval Harness work?

Can you give an example of Eval Harness?

Related Terms

Put this into practice