LLM-as-Judge¶
Using an LLM to grade LLM outputs. It scales evaluation of open-ended quality far beyond what humans can review by hand โ but only if you validate the judge itself.
Overview¶
Some outputs can't be graded by exact match โ "is this summary good?", "is this answer helpful and grounded?" Human review is the gold standard but slow and expensive. LLM-as-judge uses a model to score outputs against a rubric, letting you evaluate thousands of cases cheaply. The catch: the judge is itself a fallible LLM, so an unvalidated judge can give you confident, systematically wrong scores. This page shows how to build a judge you can actually trust.
Learning Objectives¶
By the end of this page you will be able to:
- Write an effective judge prompt with a clear rubric.
- Choose between pointwise scoring and pairwise comparison.
- Recognize and mitigate judge biases.
- Validate a judge against human labels before trusting it.
Theory¶
Two ways to judge¶
flowchart TB
subgraph "Pointwise (score one output)"
A[Output + rubric] --> B[Judge] --> C[Score / pass-fail + reason]
end
subgraph "Pairwise (compare two)"
D[Output A vs Output B] --> E[Judge] --> F[Which is better + why]
end
- Pointwise: score a single output (e.g. faithful? 1โ5 helpfulness). Simple; good for pass/fail gates and tracking a metric over time.
- Pairwise: ask which of two outputs is better. Often more reliable than absolute scores โ models are better at comparison than calibration โ and ideal for comparing two prompts/models.
Known judge biases (and fixes)¶
| Bias | What happens | Mitigation |
|---|---|---|
| Position bias | Favors the first (or second) option in pairwise | Run both orders, average |
| Verbosity bias | Prefers longer answers | Rubric emphasizes correctness, not length |
| Self-preference | A model favors its own style | Use a different model as judge when possible |
| Leniency | Everything "looks fine" โ scores cluster high | Force a rubric + require justification |
The rubric is everything¶
Vague prompts ("rate 1โ10") produce noisy, meaningless scores. A good judge prompt has: a specific criterion, explicit levels, a request for a short justification, and a structured output you can parse.
Practical Example¶
A faithfulness judge (pointwise, structured)¶
import json
from anthropic import Anthropic
from pydantic import BaseModel
client = Anthropic()
class Verdict(BaseModel):
faithful: bool
reason: str
JUDGE = """You evaluate whether an ANSWER is fully supported by the CONTEXT.
Rules:
- "faithful" is true ONLY if every factual claim in the answer is supported by the context.
- If the answer adds unsupported facts, faithful is false.
- Be strict. Judge support, not writing quality.
CONTEXT:
{context}
ANSWER:
{answer}"""
def judge_faithfulness(context: str, answer: str) -> Verdict:
resp = client.messages.create(
model="claude-sonnet-5", max_tokens=250, temperature=0, # deterministic judging
tools=[{"name": "verdict", "description": "Record the verdict.",
"input_schema": Verdict.model_json_schema()}],
tool_choice={"type": "tool", "name": "verdict"},
messages=[{"role": "user",
"content": JUDGE.format(context=context, answer=answer)}],
)
tool_use = next(b for b in resp.content if b.type == "tool_use")
return Verdict.model_validate(tool_use.input)
print(judge_faithfulness(
context="The refund window is 30 days.",
answer="You can get a refund within 30 days, and shipping is always free.",
)) # faithful=False โ the shipping claim isn't in the context.
Pairwise comparison with position-bias control¶
def better_of(question: str, a: str, b: str) -> str:
"""Return 'A', 'B', or 'tie', averaging both orders to cancel position bias."""
def ask(first, second):
prompt = (f"Question: {question}\n\nResponse 1:\n{first}\n\nResponse 2:\n{second}\n\n"
"Which response better answers the question? Reply only '1', '2', or 'tie'.")
r = client.messages.create(model="claude-sonnet-5", max_tokens=10, temperature=0,
messages=[{"role": "user", "content": prompt}])
return r.content[0].text.strip().lower()
first = ask(a, b) # A shown first
second = ask(b, a) # B shown first (order flipped)
# Reconcile: if the winner flips with order, it's effectively a tie.
return "tie" if {"1", "2"} == {first, second} else ("A" if "1" in first else "B")
Validate the judge before you trust it
Hand-label 30โ50 examples yourself, then run the judge on the same set. Measure agreement (e.g. accuracy vs. your labels). If the judge doesn't agree with humans, fix the rubric โ do not build your evals on an unvalidated judge.
Best Practices¶
- โ Give a specific rubric with explicit levels; require a written justification.
- โ
Use
temperature=0and structured output for consistent, parseable verdicts. - โ Prefer pairwise for comparisons; control position bias by averaging both orders.
- โ Use a different model as judge than the one being judged when you can.
- โ Validate the judge against human labels; re-validate if you change the rubric.
Common Mistakes¶
- โ "Rate this 1โ10" with no rubric โ noise dressed up as a metric.
- โ Trusting judge scores without ever checking them against humans.
- โ Ignoring position/verbosity bias in pairwise setups.
- โ Judging with the same model/prompt family and mistaking self-preference for quality.
- โ Non-zero temperature making verdicts irreproducible.
Exercises¶
- Write a helpfulness judge with a 1โ4 rubric and justifications. Run it on 20 answers; read the justifications โ do you agree?
- Build a pairwise comparison and deliberately make one answer longer but worse. Does verbosity bias show up? Does averaging both orders help?
- Hand-label 30 outputs, run your judge, and compute agreement. Improve the rubric until agreement is high.