The dirty secret of most AI product teams in 2026: when someone asks “how do you know the new prompt is better?” the honest answer is “we ran it on a few examples and it felt better.” That is vibes-based QA. It works for a demo and collapses in production.
LLM features break in ways that unit tests do not catch. You change the prompt to fix one user’s complaint, and it silently regresses the output for 30 other inputs you never checked. You upgrade the model version and the tone shifts. You add a retrieved document to the context and the model stops following the output schema. None of these have stack traces. The failure arrives as a support ticket three days later.
Eval-driven development treats LLM behavior as something you can measure and gate on, not just observe and shrug at. This post starts from the naive approach every team starts with, shows exactly where it breaks, and builds up to a full eval pipeline you can wire into CI.
Why Vibes-Based QA Breaks
The typical lifecycle of an LLM feature:
- Engineer writes the prompt.
- Engineer tries 5-10 inputs manually. It looks good.
- Feature ships.
- Three weeks later: a regression nobody noticed, a user complaint about wrong answers, a cost spike from the model generating 3x the expected output.
Manual spot-checking fails for four reasons. It does not scale - you can check 10 inputs, not the 500 that represent your real distribution. It is not reproducible - two engineers running the same prompt manually will focus on different examples. It has no memory - you cannot tell whether today’s prompt is better or worse than last week’s without a fixed reference set. And it does not run automatically, so any deploy after the initial one can silently regress.
The cost of not having evals is not zero effort - it is the accumulated cost of slow debugging, surprise regressions, and feature rollbacks. Most teams discover this the hard way after their first serious incident.
Level 0 - Manual Spot-Checking (Where Everyone Starts)
# What most teams actually do
def test_my_feature():
result = llm.generate(prompt="Summarize this document", text=SOME_DOC)
print(result) # engineer reads it and nods
This catches obvious catastrophic failures: the model returning nothing, the JSON being malformed, the output being in the wrong language. Everything subtler slips through. The engineer has read this document before and subconsciously knows what a good summary looks like. That context does not transfer to the 10,000 documents users will submit.
The fatal flaw is that spot-checking is not a regression gate. You can spot-check before a deploy, but if you have no baseline to compare against, you cannot tell whether things got better or worse - only whether the latest output seems fine right now.
Level 1 - Golden Datasets
A golden dataset is a fixed set of inputs with labeled expected outputs (or quality criteria). You run your LLM pipeline against it and score each output. The score becomes a number you can track over time and gate deploys on.
Building one is not glamorous but it is the foundation everything else builds on.
# golden_dataset.jsonl - each line is one example
{
"id": "doc-summary-001",
"input": {"text": "Apple reported Q4 revenue of $94.9B, up 6% year-over-year..."},
"expected": {
"must_contain": ["Q4", "$94.9B", "6%"],
"must_not_contain": ["Q3", "loss"],
"max_length": 200,
"output_format": "bullet_list"
}
}
For a dataset to be useful, it needs to cover your real input distribution - not just easy cases, but the long tail: short documents and very long ones, clear inputs and ambiguous ones, edge cases that broke you before. A dataset of only easy examples gives you a false sense of security.
class GoldenEval:
def __init__(self, dataset_path: str):
self.examples = load_jsonl(dataset_path)
def run(self, pipeline) -> dict:
results = []
for ex in self.examples:
output = pipeline(ex["input"])
score = self.score(output, ex["expected"])
results.append({"id": ex["id"], "score": score, "output": output})
passed = sum(1 for r in results if r["score"] >= 0.8)
return {
"pass_rate": passed / len(results),
"results": results
}
def score(self, output: str, expected: dict) -> float:
checks = []
for term in expected.get("must_contain", []):
checks.append(term.lower() in output.lower())
for term in expected.get("must_not_contain", []):
checks.append(term.lower() not in output.lower())
if "max_length" in expected:
checks.append(len(output) <= expected["max_length"])
return sum(checks) / len(checks) if checks else 1.0
Deterministic checks - format validation, required field presence, length bounds, regex matches - are cheap and fast. Run them first. They catch a wide class of regressions before you spend a dollar on LLM-based scoring.
| Check type | What it catches | Cost |
|---|---|---|
| Format (JSON, Markdown) | Schema violations, malformed output | Near zero |
| Must-contain / must-not | Hallucinated facts, missing required info | Near zero |
| Length bounds | Verbose drift, truncation | Near zero |
| Embedding similarity | Semantic drift from expected answer | Low |
| LLM-as-judge | Nuanced quality, tone, reasoning | Medium |
Start at the top of that table. Most regressions are caught by the cheap checks before you need the expensive ones.
Level 2 - LLM-as-Judge
Deterministic checks cannot tell you whether the answer is actually good - just whether it has the right shape. For quality judgments, you use another LLM call to score the output. This is LLM-as-judge.
The key insight is that judging is easier than generating. A weaker model can reliably tell you whether a summary is accurate, even if it could not write the summary itself. This means you can use a cheaper, faster model as your judge without sacrificing accuracy.
JUDGE_PROMPT = """You are evaluating the quality of an AI-generated summary.
Original document:
{document}
Summary to evaluate:
{summary}
Score the summary on these dimensions. Return JSON only.
{{
"factual_accuracy": <0.0-1.0, does it match the document facts?>,
"completeness": <0.0-1.0, does it cover the key points?>,
"conciseness": <0.0-1.0, is it appropriately brief without losing content?>,
"format_correct": <true/false, does it use bullet points as required?>,
"reasoning": "<one sentence explaining the scores>"
}}"""
class LLMJudge:
def __init__(self, judge_model: str = "claude-haiku-4-5-20251001"):
self.model = judge_model
def score(self, document: str, summary: str) -> dict:
response = llm.generate(
model=self.model,
prompt=JUDGE_PROMPT.format(document=document, summary=summary)
)
return parse_json(response.text)
def aggregate(self, scores: list[dict]) -> float:
weights = {"factual_accuracy": 0.5, "completeness": 0.3, "conciseness": 0.2}
return sum(
s.get(k, 0) * w
for s in scores
for k, w in weights.items()
) / len(scores)
Three rules for LLM-as-judge that actually matter in production:
Anchor the rubric. Vague instructions like “is this a good summary?” produce noisy, inconsistent scores. Define each dimension with explicit criteria and numeric bounds. The more concrete the rubric, the more reproducible the scores.
Use reference answers where you have them. Scoring output against a reference answer is more reliable than scoring in isolation. “Is this summary as accurate as the reference?” is easier for a judge to answer than “is this summary accurate?”
Validate the judge itself. Run the judge against a small set of human-labeled examples and measure agreement. If the judge disagrees with humans 40% of the time, its scores are noise. A Haiku-class model tuned to a concrete rubric typically agrees with human raters at 75-85% for well-defined tasks - good enough for regression detection.
Level 3 - Regression Suites in CI
An eval that only runs manually is not an eval - it is a tool you occasionally use and forget. Regression suites belong in CI, blocking deploys the same way unit tests do.
# .github/workflows/llm-evals.yml
name: LLM Eval Gate
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run eval suite
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: python evals/run.py --dataset golden_v2.jsonl --threshold 0.85
- name: Compare to baseline
run: python evals/compare.py --baseline main --current HEAD --fail-on-regression 0.03
The threshold and regression tolerance are the two decisions that matter:
- Absolute threshold (e.g., 85% pass rate) blocks a PR if quality drops below a minimum. Set this at your current baseline minus a small buffer, not at 100%.
- Regression delta (e.g., -3%) blocks a PR if scores drop more than this from main, even if above the absolute threshold. This catches silent degradation on PRs that look fine in isolation.
┌─────────────────────────────────────────┐
│ CI Pipeline │
│ │
│ PR opened │
│ | │
│ Deterministic checks (format, schema) │
│ | │
│ Golden dataset run (LLM calls, ~$0.50) │
│ | │
│ LLM-as-judge scoring │
│ | │
│ Compare to baseline on main │
│ | │
│ Pass / Block / Warn │
└─────────────────────────────────────────┘
One practical concern: LLM eval suites have non-trivial cost and latency. A suite of 200 examples with LLM-as-judge scoring can cost $0.50-2.00 and take 2-4 minutes. This is acceptable for PR gates but do not run it on every commit to a feature branch. Gate it on paths that include prompts or LLM logic.
What to Actually Measure
This is where most guides stop at “measure accuracy” and leave you guessing. Here are the metrics that actually matter, grouped by what they protect you from.
| Metric | What goes wrong without it | How to measure |
|---|---|---|
| Task success rate | Silent regressions across input types | Golden dataset pass rate |
| Hallucination rate | Model inventing facts | LLM-as-judge factual accuracy, reference answers |
| Format compliance | Downstream parsing breaks | Schema validation, regex |
| Output length | Cost spikes, UX degradation | Token count distribution |
| Latency (p50, p99) | User experience, timeout failures | Timing in harness |
| Cost per call | Budget overruns on model upgrades | Token counts x price |
| Edge case handling | Crashes on real-world inputs | Adversarial subset of golden set |
The most undertracked metric is output length. A prompt change that causes the model to generate 2x the tokens doubles your inference cost and often degrades quality (verbosity is usually a quality signal in reverse). Track token count distribution, not just task quality.
Building the Dataset Incrementally
You do not need 500 examples to start. You need enough to detect the regressions that actually happen to you.
# Log production inputs and outputs - these become your next golden set
class EvalLogger:
def __init__(self, log_path: str):
self.log = open(log_path, "a")
def record(self, input_data: dict, output: str, metadata: dict):
entry = {
"timestamp": metadata["ts"],
"input": input_data,
"output": output,
"user_id": metadata.get("user_id"),
"flagged": metadata.get("flagged", False)
}
self.log.write(json.dumps(entry) + "\n")
Then periodically pull from this log to build your golden set. Prioritize: inputs users flagged as wrong, edge cases that hit error handlers, and inputs from the long tail of your distribution. Human-label a subset of these with expected outputs or quality ratings, and add them to the dataset. The golden set should grow with every incident.
A 50-example dataset covering your actual failure modes is more useful than 500 generic examples.
Failure Modes of Eval Systems
Overfitting to the golden set. If you tune your prompt to pass specific golden examples, you are memorizing the test. Add new examples regularly - especially ones the model currently fails on - to keep the set honest.
Judge bias. LLM judges have systematic biases: they prefer longer answers, they favor confident-sounding text, and they tend to score outputs similar to their own generation style higher. Use multiple judges on the same rubric and flag large disagreements.
Eval cost kills adoption. If running evals costs $50 per PR, people start skipping them. Keep the default eval suite cheap (deterministic checks + a small golden set) and reserve the expensive LLM judge for weekly baseline runs.
Stale baselines. If your baseline is from three months ago, it may already include known regressions you accepted. Snapshot baselines deliberately - “this is the baseline after the v2 prompt launch” - with a reason and a date, so comparisons are meaningful.
The Honest Assessment
In 2026, eval-driven development is possible but not plug-and-play. The frameworks (LangSmith, Braintrust, PromptLayer) give you logging and a dashboard. They do not give you a rubric, a golden dataset, or a sensible CI gate - those are still engineering work you have to do.
What actually works: start with a small golden dataset of your real failure cases, add deterministic checks first (they catch 60% of regressions for near-zero cost), and only add LLM-as-judge for the quality dimensions that deterministic checks cannot reach. Wire the fast checks into every PR. Run the expensive judge weekly or on major prompt changes.
What does not work: buying an eval platform before you have a dataset, trying to cover every possible input before shipping, or setting a 100% pass threshold that nobody will maintain.
The minimum viable eval stack is 30 golden examples, a format check, and a CI step that blocks on a 10% regression from baseline. That beats “feels good” by an enormous margin and takes one engineer one day to build. Start there, then grow the dataset with every incident.
Comments