What LLM Evaluation Is and Why It Is Different

What You Will Learn

You will understand what evaluation means for LLM-powered applications, why traditional software testing approaches do not transfer directly, and what it means to treat quality as a measurement problem.


Before you start: You should have built at least one LLM-powered feature or pipeline, even a simple one. This course assumes you are working with LLM APIs, not fine-tuning models.


Why Traditional Testing Breaks Down

A deterministic function has one correct output for a given input. You test it with unit tests. If the function returns the wrong value, the test fails. The problem is clear, the fix is clear.

LLMs are different. For the same input:

  • The output varies between runs
  • Multiple outputs can be equally correct
  • An output can be technically correct but wrong in tone, length, or format
  • An output can be confident and completely fabricated

You cannot write a unit test that asserts output === expectedString and trust it. The model will pass today and produce a slightly different answer tomorrow after an API update. The answer might still be correct, but your test will fail.

This is not a bug. It is the nature of probabilistic systems. Evaluation is the discipline that replaces brittle string assertions with meaningful quality measurement.


What Evaluation Actually Is

Evaluation is the systematic process of measuring whether an LLM-powered system produces outputs that meet your quality bar, consistently, across a representative set of inputs.

A complete eval system has three parts:

1. A test set. A collection of inputs that represents the range of real inputs your system will see. Includes happy-path cases, edge cases, and adversarial cases.

2. A scoring method. A way to compare the system's output to the expected outcome. Exact match for structured outputs. LLM-as-judge for open-ended text. Retrieval metrics for RAG pipelines.

3. A runner. Code that sends each input through your system, collects the output, scores it, and reports a pass rate or quality score.

When you run this against a change (a new prompt, a new model, a new retrieval strategy), you learn whether quality went up, down, or stayed the same.


Eval vs Testing vs Monitoring

These three things are related but different:

Eval: Offline measurement against a fixed test set before deploying a change. "Does this new prompt perform better than the old one?"

Testing: Automated checks in CI that catch regressions. "Did this code change break the known good cases?"

Monitoring: Online measurement of real user traffic in production. "Is quality degrading over time for real users?"

This course focuses on eval and testing. Monitoring is a production engineering concern and is covered lightly in the final module.


The Cost of Skipping Evaluation

Without evaluation, you do not know:

  • Whether your prompts are actually good or just seem good in demos
  • Whether a model update broke something in your pipeline
  • Which edge cases your system handles poorly
  • Whether a refactoring improved or degraded quality

Teams that skip evaluation ship brittle AI features. They discover regressions when users complain, not before deployment.


Tools in This Space

Promptfoo is an open-source LLM testing framework. Run evals from a YAML config file. Compare models, prompts, and providers side by side.

DeepEval is a Python-based eval framework with built-in metrics for hallucination, answer relevancy, faithfulness, and more.

Ragas specializes in RAG pipeline evaluation. Measures context precision, context recall, faithfulness, and answer relevancy.

LangSmith provides tracing, eval, and dataset management for LangChain-based applications and any LLM app via SDK.

Phoenix by Arize is an open-source observability and eval tool. Strong for tracing and visualizing LLM traces.

TruLens is an open-source eval framework with feedback functions and a dashboard for tracking quality over runs.


Common Mistakes to Avoid

  • Evaluating only on the inputs you thought of. The test set must represent real diversity of what users actually send.
  • Confusing a high pass rate with high quality. A test set that is too easy gives you false confidence.
  • Running evals once and never again. Eval is only valuable as a recurring process, not a one-time check.

Next Step

In the next tutorial, you will learn where evaluation fits in the development lifecycle and how to build the eval habit into your workflow.

Discussion

  • Loading…

← Back to course