Question 1

What is LLM-as-Judge?

Accepted Answer

LLM-as-judge is an evaluation pattern in which an LLM scores the outputs of another model against a rubric. Two common modes are pointwise (score each output 1–N on each criterion) and pairwise (given outputs A and B, pick the better one).

Question 2

How does LLM-as-Judge work?

Accepted Answer

The approach is attractive because it scales cheaply compared to human evaluation, but it has known failure modes: position bias (favoring the first or last option), verbosity bias (favoring longer answers), and self-preference bias (favoring outputs that look like the judge's own style).

Question 3

Can you give an example of LLM-as-Judge?

Accepted Answer

A team evaluating summarization quality runs pairwise LLM-as-judge with randomized A/B ordering and three judge models voting. The rubric defines each score level with a worked example ("5 = covers all key facts, no hallucinations, under 100 words; 3 = covers most key facts, 1 minor hallucination; 1 = misses primary fact or fabricates a figure"). Scores only count when at least 2 of 3 judges agree; disagreements are routed to human review.

LLM-as-Judge

Example

Frequently asked questions

What is LLM-as-Judge?

How does LLM-as-Judge work?

Can you give an example of LLM-as-Judge?

Related Terms

Related Resources

LLM-as-Judge: A Practical Guide to Automating Prompt Evaluation (2026)