INITIALIZING SECURE CHANNEL...
CPU MEM NET PING9.2 ms SECSECURED
COFFEE—WEB v4.0 // eu-1
ONLINE 2 847 --:--:--
/english > 44. Evaluating AI Systems (Evals)
// УРОК 44

Evaluating AI Systems (Evals)

B2

Evaluating AI Systems (Evals)

Production AI systems need systematic evaluation. "Evals" are tests that measure model output quality.

Types of Evaluation

TypeMethodBest For
Automated evalsCode checks exact match, regex, or structured outputClassification, extraction
LLM-as-judgeAnother LLM rates the output on criteriaOpen-ended text quality
Human reviewHuman raters score outputsHigh-stakes decisions
A/B testingCompare two prompt versions on real usersProduction improvements

Key Metrics

  • accuracy — % of correct answers on a test set
  • hallucination rate — % of responses containing fabricated information
  • latency p95 — 95th percentile response time
  • pass@k — does the model solve the problem in k tries?

Useful Phrases

  • "We run a suite of 500 golden examples as regression evals on every model update."
  • "We use an LLM-as-judge to rate response helpfulness on a 1–5 scale."
  • "Our eval pipeline caught a 12% accuracy regression before we deployed the new prompt."
// TERMINAL CHALLENGE

Проверь себя

Q1. What is 'LLM-as-judge'?
Q2. What are 'golden examples' in an eval suite?
Q3. Complete: 'Our eval pipeline caught a 12% accuracy ___ before we deployed the new prompt.'
Q4. What does 'hallucination rate' measure in an AI evaluation?
Q5. What is A/B testing in the context of prompt engineering?
╔═ GL1TCH v0.1 ═[ПОДКЛЮЧЕНО]═╗ [×]
СОЕДИНЕНИЕ АКТИВНО
запросов:
// сессия #{} начата
>_
[ РАЗРЫВ СВЯЗИ ]
лимит исчерпан...
иду спать... zzZ
хочешь больше? [зарегистрироваться] // +10 запросов в день