How to Actually Evaluate LLM Outputs

At some point, everyone who builds with LLMs hits the same wall: you make a change, and you have no idea if things got better or worse. Your prompts evolved, your model changed, your use case expanded—and you're evaluating everything by feel.

That's fine early on. It stops being fine the moment you care about quality.

Why Vibes-Based Eval Fails

The problem with "does this look right?" evaluation:

It doesn't catch regressions. You improve output A while quietly breaking output B.
It's anchored to your current expectations, not your users' needs.
It's not reproducible. Different reviewers, different standards. Same reviewer on different days, different standards.
It doesn't scale. You can read 20 outputs. You can't read 2,000.

You need a system that measures what you actually care about, consistently, at scale.

The Four Axes of LLM Quality

For most applications, output quality lives on four axes:

1. Faithfulness — Does the output only use information it was given? This is the hallucination axis. An answer can be fluent and wrong.

2. Relevance — Does the output actually address the input? A perfectly accurate answer about Python won't help someone debugging Rust.

3. Completeness — Does it cover everything it should? A partial answer to a multi-part question is a failure even if the partial answer is correct.

4. Format — Is it in the right structure? Length, tone, schema—depending on your use case, format failures are as bad as content failures.

Not every axis matters equally for every use case. A customer support bot cares a lot about tone and format. A research summarizer cares about faithfulness and completeness. Know your priority order before you build your eval.

Building Your Eval Suite

Golden datasets are non-negotiable. Collect 100-500 real inputs that cover the distribution of what your system will see. Include edge cases. Include failures from your current system. Annotate expected outputs or at least expected properties.

Human annotation sets the baseline. Before you automate anything, have humans rate a sample of outputs on your target axes. This gives you ground truth and lets you validate your automated metrics.

LLM-as-judge works surprisingly well. Use a strong model (GPT-4, Claude Opus) as an evaluator. Give it a rubric. Have it score outputs on specific dimensions, not overall quality. Ask for reasoning before the score to reduce position bias. Cross-validate with human ratings to make sure your judge is calibrated.

Deterministic checks are underused. Before reaching for LLM evaluation, check: Does the output contain required keywords? Is it valid JSON? Is it the right length? These are fast, cheap, and reliable.

The Regression Trap

The hardest part of eval isn't building it—it's maintaining it when you're tempted to skip it because you're "pretty sure" a change is fine.

It's almost never fine.

Run your eval suite on every significant change: prompt edits, model upgrades, retrieval changes, any change that touches the output path. Track scores over time. Set a threshold below which you won't ship.

This is boring. It's also the only way to know whether your system is getting better.

A Practical Starting Point

If you're starting from scratch:

Collect 50 representative inputs from your use case
Manually write expected outputs (or at minimum, "must include" and "must not include" criteria)
Build an LLM judge that scores faithfulness and relevance on a 1-5 scale with reasoning
Add deterministic checks for your most common failure modes
Run it before and after every change

Start simple. A 50-example eval you actually run is worth infinitely more than a 5,000-example eval you built and never maintained.