Phase 11 - Lesson 10

Evaluation & Testing LLM Applications

This lesson includes a graded coding exercise that runs in your browser, unlocked with lifetime access.

You would never deploy a web app without tests. You would never ship a database migration without a rollback plan. But right now, most teams ship LLM applications by reading 10 outputs and saying "yeah, looks good." That is not evaluation. That is hope. Hope is not an engineering practice. Every prompt change, every model swap, every temperature tweak changes your output distribution in ways you cannot predict by reading a handful of examples. Evaluation is the only thing standing between your application and silent degradation.

Type: Build Languages: Python Prerequisites: Phase 11 Lesson 01 (Prompt Engineering), Lesson 09 (Function Calling) Time: ~45 minutes Related: Phase 5 · 27 (LLM Evaluation — RAGAS, DeepEval, G-Eval) covers the framework-level concepts (NLI-based faithfulness, judge calibration, the RAG four). Phase 5 · 28 (Long-Context Evaluation) covers NIAH / RULER / LongBench / MRCR for context-length regression. This lesson focuses on what is LLM-engineering-specific: CI/CD integration, cost-gated eval runs, regression dashboards.

Learning Objectives

Build an evaluation dataset with input-output pairs, rubrics, and edge cases specific to your LLM application
Implement automated scoring using LLM-as-judge, regex matching, and deterministic assertion checks
Set up regression testing that detects quality degradation when prompts, models, or parameters change
Design evaluation metrics that capture what matters for your use case (correctness, tone, format compliance, latency)

The Problem

You build a RAG chatbot for customer support. It works great in your demos. You ship it. Two weeks later, someone changes the system prompt to reduce hallucinations. The change works -- hallucination rate drops. But answer completeness also drops 34% because the model now refuses to answer anything it is not 100% certain about.

Nobody noticed for 11 days. Revenue from the self-service channel fell. Support tickets spiked.

This is the default outcome when you evaluate by vibes. You check a few examples, they look fine, you merge. But LLM outputs are stochastic. A prompt that works on 5 test cases can fail on the 6th. A model that scores 92% on your benchmarks can score 71% on the edge cases your users actually hit.

The fix is not "be more careful." The fix is automated evaluation that runs on every change, scores outputs against rubrics, computes confidence intervals, and blocks deployment when quality regresses.

Evaluation is not a nice-to-have. It is table stakes. Shipping without evals is deploying blind.

The Concept

The Eval Taxonomy

There are three categories of LLM evaluation. Each has a role. None is sufficient alone.

graph TD
    E[LLM Evaluation] --> A[Automated Metrics]
    E --> L[LLM-as-Judge]
    E --> H[Human Evaluation]

    A --> A1[BLEU]
    A --> A2[ROUGE]
    A --> A3[BERTScore]
    A --> A4[Exact Match]

    L --> L1[Single Grader]
    L --> L2[Pairwise Comparison]
    L --> L3[Best-of-N]

    H --> H1[Expert Review]
    H --> H2[User Feedback]
    H --> H3[A/B Testing]

    style A fill:#e8e8e8,stroke:#333
    style L fill:#e8e8e8,stroke:#333
    style H fill:#e8e8e8,stroke:#333

Automated metrics compare output text against reference answers using algorithms. BLEU measures n-gram overlap (originally for machine translation). ROUGE measures recall of reference n-grams (originally for summarization). BERTScore uses BERT embeddings to measure semantic similarity. These are fast and cheap -- you can score 10,000 outputs in seconds. But they miss nuance. Two answers can have zero word overlap and both be correct. One answer can have high ROUGE and be completely wrong in context.

LLM-as-judge uses a strong model (GPT-5, Claude Opus 4.7, Gemini 3 Pro) to grade outputs against a rubric. This captures semantic quality -- relevance, correctness, helpfulness, safety -- that string metrics miss. It costs money (~$8 per 1,000 judge calls with GPT-5-mini, ~

Method	Speed	Cost per 1K evals	Correlation with humans	Best for
BLEU/ROUGE	<1 sec	$0	40-60%	Translation, summarization baselines
BERTScore	~30 sec	$0	55-70%	Semantic similarity screening
LLM-as-judge (GPT-5-mini)	~3 min	~$8	82-86%	Default CI judge; cheap, fast, calibrated
LLM-as-judge (Claude Opus 4.7)	~5 min	~ 5	85-88%	High-stakes scoring, safety, refusals
LLM-as-judge (Gemini 3 Flash)	~2 min	~$3	80-84%	Highest-throughput judge; for 1M+ eval pass
RAGAS (NLI faithfulness + judge)	~5 min	~ 2	85%	RAG-specific metrics (see Phase 5 · 27)
DeepEval (G-Eval + Pytest)	~4 min	depends on judge	80-88%	CI-native, per-PR regression gates
Human expert	~2 hours	~$500	100% (by definition)	Calibration, edge cases, policy

Test cases	Observed accuracy	95% CI width	Can detect 5% regression?
50	90%	19 points	No
100	90%	12 points	Barely
200	90%	9 points	Yes
500	90%	5 points	Confidently
1000	90%	3 points	Precisely

Eval size	GPT-5-mini judge	Claude Opus 4.7 judge	Gemini 3 Flash judge	Time
100 cases x 4 criteria	~	~$6	~$0.40	~2 min
200 cases x 4 criteria	~$4	~ 2	~$0.80	~4 min
500 cases x 4 criteria	~ 0	~$30	~	~10 min
1000 cases x 4 criteria	~ 0	~$60	~$4	~20 min

Tool	What it does	Pricing
promptfoo	Open-source eval framework, YAML config, LLM-as-judge, CI integration	Free (OSS)
Braintrust	Eval platform with scoring, experiments, datasets, logging	Free tier, then usage-based
LangSmith	LangChain's eval/observability platform, tracing, datasets, annotation	Free tier, $39/mo+
DeepEval	Python eval framework, 14+ metrics, Pytest integration	Free (OSS)
Arize Phoenix	Open-source observability + evals, tracing, span-level scoring	Free (OSS)

Term	What people say	What it actually means
Eval	"Testing"	Systematically scoring LLM outputs against defined criteria using automated metrics, LLM judges, or human review
LLM-as-judge	"AI grading"	Using a strong model (GPT-4o, Claude) to score outputs against a rubric -- correlates 80-85% with human judgment
Rubric	"Scoring guide"	Anchored descriptions for each score level (1-5) that reduce judge variance by defining exactly what each score means
ROUGE-L	"Text overlap"	Longest Common Subsequence-based metric measuring how much of the reference appears in the output -- recall-oriented
Confidence interval	"Error bars"	A range around your measured score that tells you how much uncertainty remains -- wider with fewer test cases
Regression testing	"Before/after"	Running the same eval suite on old and new prompt versions to detect quality degradation before deployment
Golden test set	"Core evals"	Curated input-output pairs representing your most important use cases -- every change must pass these
Pairwise comparison	"A vs B"	Showing a judge two outputs and asking which is better -- eliminates scale calibration problems
Bootstrap	"Resampling"	Estimating confidence intervals by repeatedly sampling from your scores with replacement -- works with any distribution
Wilson interval	"Proportion CI"	A confidence interval for pass/fail rates that works correctly even with small sample sizes or extreme proportions

Evaluation & Testing LLM Applications

Learning Objectives

The Problem

The Concept

The Eval Taxonomy

LLM-as-Judge: The Workhorse

Rubric Design

The Eval Pipeline

Eval Datasets: The Foundation

Sample Size and Confidence

Regression Testing

Cost of Evals

Anti-Patterns

Real Tools

Build It

Step 1: Define the Eval Data Structures

Step 2: Build the LLM-as-Judge Scorer

Step 3: Build Automated Metrics

Step 4: Build the Confidence Interval Calculator

Step 5: Build the Eval Runner and Comparison Report

Step 6: Run the Demo

Use It

promptfoo Integration

DeepEval Integration

CI/CD Integration Pattern

Ship It

Exercises

Key Terms

Further Reading