Evaluating LLM Output in Production: A Practical Framework
Why Evaluation Is the Hard Part
Building a demo that works 90% of the time is easy. Building a production system that works reliably, catches regressions, and improves over time is hard. The gap is evaluation: a systematic way to measure whether your AI system is doing what you intend.
Without evaluation, you're flying blind. You make prompt changes and don't know if they helped or hurt. You switch models and can't tell if quality improved. A user complains about a bad answer and you have no way to know how common the problem is.
Evaluation is not glamorous. It's less exciting than prompt engineering or model selection. But it's the infrastructure that makes everything else sustainable.
The Three Layers of Evaluation
Layer 1: Reference-based evaluation. You have a dataset of (input, expected output) pairs. You run your system on each input and measure how well the output matches the expected. This is the most rigorous form—but it requires a labeled dataset, which is expensive to create and slow to update.
Best for: Well-defined tasks with clear correct answers. Classification, extraction, factual Q&A over a known knowledge base.
Layer 2: LLM-as-judge. You use a capable LLM (GPT-4, Claude) to evaluate the output of another LLM (possibly the same one). The judge evaluates on dimensions you specify: accuracy, helpfulness, conciseness, tone. This is cheaper than human evaluation and can run automatically.
Best for: Open-ended generation tasks where "correct" is multidimensional and human-defined. Summarization quality, writing style, response appropriateness.
Layer 3: User feedback. Thumbs up/down, explicit ratings, or behavioral signals (did the user accept the AI suggestion or rewrite it?). This is the ground truth of whether your system is actually useful—but it's noisy, slow, and hard to act on without aggregation.
Best for: Understanding overall system utility. Use it alongside layers 1 and 2, not instead of them.
Building Your Evaluation Dataset
The most important thing you can do for your AI system is build and maintain an evaluation dataset. Start with:
- 50-100 real user inputs from your production logs (or simulated ones if you don't have logs yet)
- Expected outputs written by your team or domain experts
- Edge cases: inputs that have caused problems, ambiguous inputs, adversarial inputs
Review and update quarterly. As your system evolves, new edge cases will emerge. Your eval set should reflect current failure modes, not historical ones.
Metrics That Matter
For classification tasks: precision, recall, F1, accuracy. These are well-understood and appropriate.
For generation tasks, metrics are harder. Common ones:
- BLEU/ROUGE: Measure n-gram overlap with reference. Decent for summarization; poor for open-ended generation.
- Semantic similarity: Embed both output and reference, measure cosine similarity. Better than n-gram metrics for capturing meaning.
- LLM judge scores: Ask GPT-4 to rate on a 1-5 scale across dimensions you care about. Calibrate the judge by comparing its scores to human evaluators.
- Task-specific metrics: For RAG, citation accuracy. For code generation, tests passing. For email drafting, human edit distance (how much did humans change the AI draft?).
No single metric is sufficient. Use a combination that reflects your actual requirements.
The Regression Test Pattern
Every time you make a change to your system—prompt update, model version change, retrieval tuning—run your eval set before and after. Track:
- Did accuracy go up or down?
- Did the distribution of failure modes change?
- Did specific edge cases regress?
This is the only way to make changes confidently. Without a regression test, you're guessing. With it, you have data.
Practical Tooling
Promptfoo: Open source LLM testing framework. Define test cases in YAML, run evaluations, compare prompt variants. Good for regression testing and prompt development.
LangSmith: LangChain's evaluation and observability platform. Tracing, evaluation datasets, human feedback collection. Tight integration with LangChain applications.
Braintrust: Evaluation platform with a clean UI, good human review workflow, and LLM-as-judge support. Paid but polished.
Custom eval scripts: For simple use cases, a Python script that runs your test cases, calls your system, and logs results is entirely adequate. Don't over-engineer the tooling when a script does the job.
The Evaluation Culture Problem
The biggest obstacle to good LLM evaluation isn't tools or techniques—it's culture. Teams ship prompts the same way they used to ship features without tests: quickly, based on intuition, with minimal review.
Treating prompt changes like code changes—with review, testing against an eval set, and measurement—is a cultural shift. The teams that build reliable AI systems are the ones that make this shift. The teams that don't will keep struggling with unexplained regressions and inconsistent quality.
Discussion
Sign in to comment. Your account must be at least 1 day old.