Skip to main content

LLM-as-Judge

LLM-as-judge is an evaluation pattern in which an LLM scores the outputs of another model against a rubric. Two common modes are pointwise (score each output 1–N on each criterion) and pairwise (given outputs A and B, pick the better one). The approach is attractive because it scales cheaply compared to human evaluation, but it has known failure modes: position bias (favoring the first or last option), verbosity bias (favoring longer answers), and self-preference bias (favoring outputs that look like the judge's own style). Mitigations include randomizing option order, using multiple judges, and anchoring the rubric with concrete scored examples.

Example

A team evaluating summarization quality runs pairwise LLM-as-judge with randomized A/B ordering and three judge models voting. The rubric defines each score level with a worked example ("5 = covers all key facts, no hallucinations, under 100 words; 3 = covers most key facts, 1 minor hallucination; 1 = misses primary fact or fabricates a figure"). Scores only count when at least 2 of 3 judges agree; disagreements are routed to human review.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts