Evaluating AI Systems (Evals)
Production AI systems need systematic evaluation. "Evals" are tests that measure model output quality.
Types of Evaluation
| Type | Method | Best For |
| Automated evals | Code checks exact match, regex, or structured output | Classification, extraction |
| LLM-as-judge | Another LLM rates the output on criteria | Open-ended text quality |
| Human review | Human raters score outputs | High-stakes decisions |
| A/B testing | Compare two prompt versions on real users | Production improvements |
Key Metrics
- accuracy — % of correct answers on a test set
- hallucination rate — % of responses containing fabricated information
- latency p95 — 95th percentile response time
- pass@k — does the model solve the problem in k tries?
Useful Phrases
- "We run a suite of 500 golden examples as regression evals on every model update."
- "We use an LLM-as-judge to rate response helpfulness on a 1–5 scale."
- "Our eval pipeline caught a 12% accuracy regression before we deployed the new prompt."