Evaluation¶
How do you know if your AI system is good β and whether your latest change made it better or worse? Evaluation is the discipline that answers that.
Overview¶
Traditional software has deterministic tests: given input X, assert output Y. LLMs break that β the same prompt can produce different, equally-valid outputs. Evaluation is how you regain confidence in a non-deterministic system.
Without evals, you're "vibe-checking" β changing a prompt, eyeballing a few outputs, and hoping. With evals, every change is measured.
flowchart LR
D[Eval dataset<br/>inputs + criteria] --> R[Run system]
R --> S[Score outputs]
S --> M[Metrics<br/>accuracy, faithfulnessβ¦]
M --> C{Better than<br/>baseline?}
C -->|yes| Ship
C -->|no| Iterate
Learning Objectives¶
By the end of this section you will be able to:
- Build an evaluation dataset that reflects real usage.
- Choose the right scoring method (exact match, rubric, LLM-as-judge).
- Detect regressions before they reach users.
- Evaluate RAG and agents, not just single completions.
The three ways to score¶
| Method | Good for | Watch out for |
|---|---|---|
| Exact / rule-based | Classification, extraction, format checks | Too rigid for open-ended text |
| LLM-as-judge | Open-ended quality, helpfulness, tone | Judge bias; needs its own validation |
| Human review | Ground truth, high-stakes | Slow and expensive; use sparingly |
Best Practices¶
- β Build your eval set from real (or realistic) inputs, including edge cases and failures.
- β Version your eval set and track scores over time β treat it like a test suite.
- β Validate your LLM judge against human labels before trusting it.
- β Measure what matters to users (task success), not just proxy metrics.
Common Mistakes¶
- β Shipping prompt changes with no measurement ("it looked better").
- β Overfitting to a tiny eval set that doesn't represent production.
- β Trusting an LLM judge without checking it agrees with humans.
- β Only measuring averages β the tail (worst cases) is often what hurts users.
π Help build this section¶
Claim a topic by opening an issue:
[WANTED]Building an eval harness β datasets, runners, tracking π‘- β LLM-as-Judge β rubrics, pairwise, bias, validation π΄
[WANTED]Detecting hallucinations β groundedness/faithfulness metrics π΄[WANTED]Benchmarks explained β MMLU, GSM8K, and their limits π‘- See also RAG's own evaluation guide.