You're Shipping AI Features Blind - Eval-Driven Development in 2026

The dirty secret of most AI product teams in 2026: when someone asks “how do you know the new prompt is better?” the honest answer is “we ran it on a few examples and it felt better.” That is vibes-based QA. It works for a demo and collapses in production.

LLM features break in ways that unit tests do not catch. You change the prompt to fix one user’s complaint, and it silently regresses the output for 30 other inputs you never checked. You upgrade the model version and the tone shifts. You add a retrieved document to the context and the model stops following the output schema. None of these have stack traces. The failure arrives as a support ticket three days later.

Eval-driven development treats LLM behavior as something you can measure and gate on, not just observe and shrug at. This post starts from the naive approach every team starts with, shows exactly where it breaks, and builds up to a full eval pipeline you can wire into CI.

Why Vibes-Based QA Breaks

The typical lifecycle of an LLM feature:

Engineer writes the prompt.
Engineer tries 5-10 inputs manually. It looks good.
Feature ships.
Three weeks later: a regression nobody noticed, a user complaint about wrong answers, a cost spike from the model generating 3x the expected output.

Manual spot-checking fails for four reasons. It does not scale - you can check 10 inputs, not the 500 that represent your real distribution. It is not reproducible - two engineers running the same prompt manually will focus on different examples. It has no memory - you cannot tell whether today’s prompt is better or worse than last week’s without a fixed reference set. And it does not run automatically, so any deploy after the initial one can silently regress.

The cost of not having evals is not zero effort - it is the accumulated cost of slow debugging, surprise regressions, and feature rollbacks. Most teams discover this the hard way after their first serious incident.

Level 0 - Manual Spot-Checking (Where Everyone Starts)

# What most teams actually do
def test_my_feature():
    result = llm.generate(prompt="Summarize this document", text=SOME_DOC)
    print(result)  # engineer reads it and nods

This catches obvious catastrophic failures: the model returning nothing, the JSON being malformed, the output being in the wrong language. Everything subtler slips through. The engineer has read this document before and subconsciously knows what a good summary looks like. That context does not transfer to the 10,000 documents users will submit.

The fatal flaw is that spot-checking is not a regression gate. You can spot-check before a deploy, but if you have no baseline to compare against, you cannot tell whether things got better or worse - only whether the latest output seems fine right now.

Level 1 - Golden Datasets

A golden dataset is a fixed set of inputs with labeled expected outputs (or quality criteria). You run your LLM pipeline against it and score each output. The score becomes a number you can track over time and gate deploys on.

Building one is not glamorous but it is the foundation everything else builds on.

# golden_dataset.jsonl - each line is one example
{
  "id": "doc-summary-001",
  "input": {"text": "Apple reported Q4 revenue of $94.9B, up 6% year-over-year..."},
  "expected": {
    "must_contain": ["Q4", "$94.9B", "6%"],
    "must_not_contain": ["Q3", "loss"],
    "max_length": 200,
    "output_format": "bullet_list"
  }
}

For a dataset to be useful, it needs to cover your real input distribution - not just easy cases, but the long tail: short documents and very long ones, clear inputs and ambiguous ones, edge cases that broke you before. A dataset of only easy examples gives you a false sense of security.

class GoldenEval:
    def __init__(self, dataset_path: str):
        self.examples = load_jsonl(dataset_path)

    def run(self, pipeline) -> dict:
        results = []
        for ex in self.examples:
            output = pipeline(ex["input"])
            score = self.score(output, ex["expected"])
            results.append({"id": ex["id"], "score": score, "output": output})

        passed = sum(1 for r in results if r["score"] >= 0.8)
        return {
            "pass_rate": passed / len(results),
            "results": results
        }

    def score(self, output: str, expected: dict) -> float:
        checks = []
        for term in expected.get("must_contain", []):
            checks.append(term.lower() in output.lower())
        for term in expected.get("must_not_contain", []):
            checks.append(term.lower() not in output.lower())
        if "max_length" in expected:
            checks.append(len(output) <= expected["max_length"])
        return sum(checks) / len(checks) if checks else 1.0

Deterministic checks - format validation, required field presence, length bounds, regex matches - are cheap and fast. Run them first. They catch a wide class of regressions before you spend a dollar on LLM-based scoring.

Check type	What it catches	Cost
Format (JSON, Markdown)	Schema violations, malformed output	Near zero
Must-contain / must-not	Hallucinated facts, missing required info	Near zero
Length bounds	Verbose drift, truncation	Near zero
Embedding similarity	Semantic drift from expected answer	Low
LLM-as-judge	Nuanced quality, tone, reasoning	Medium

Start at the top of that table. Most regressions are caught by the cheap checks before you need the expensive ones.

Level 2 - LLM-as-Judge

Deterministic checks cannot tell you whether the answer is actually good - just whether it has the right shape. For quality judgments, you use another LLM call to score the output. This is LLM-as-judge.

The key insight is that judging is easier than generating. A weaker model can reliably tell you whether a summary is accurate, even if it could not write the summary itself. This means you can use a cheaper, faster model as your judge without sacrificing accuracy.

JUDGE_PROMPT = """You are evaluating the quality of an AI-generated summary.

Original document:
{document}

Summary to evaluate:
{summary}

Score the summary on these dimensions. Return JSON only.

{{
  "factual_accuracy": <0.0-1.0, does it match the document facts?>,
  "completeness": <0.0-1.0, does it cover the key points?>,
  "conciseness": <0.0-1.0, is it appropriately brief without losing content?>,
  "format_correct": <true/false, does it use bullet points as required?>,
  "reasoning": "<one sentence explaining the scores>"
}}"""

class LLMJudge:
    def __init__(self, judge_model: str = "claude-haiku-4-5-20251001"):
        self.model = judge_model

    def score(self, document: str, summary: str) -> dict:
        response = llm.generate(
            model=self.model,
            prompt=JUDGE_PROMPT.format(document=document, summary=summary)
        )
        return parse_json(response.text)

    def aggregate(self, scores: list[dict]) -> float:
        weights = {"factual_accuracy": 0.5, "completeness": 0.3, "conciseness": 0.2}
        return sum(
            s.get(k, 0) * w
            for s in scores
            for k, w in weights.items()
        ) / len(scores)

Three rules for LLM-as-judge that actually matter in production:

Anchor the rubric. Vague instructions like “is this a good summary?” produce noisy, inconsistent scores. Define each dimension with explicit criteria and numeric bounds. The more concrete the rubric, the more reproducible the scores.

Use reference answers where you have them. Scoring output against a reference answer is more reliable than scoring in isolation. “Is this summary as accurate as the reference?” is easier for a judge to answer than “is this summary accurate?”

Validate the judge itself. Run the judge against a small set of human-labeled examples and measure agreement. If the judge disagrees with humans 40% of the time, its scores are noise. A Haiku-class model tuned to a concrete rubric typically agrees with human raters at 75-85% for well-defined tasks - good enough for regression detection.

Level 3 - Regression Suites in CI

An eval that only runs manually is not an eval - it is a tool you occasionally use and forget. Regression suites belong in CI, blocking deploys the same way unit tests do.

# .github/workflows/llm-evals.yml
name: LLM Eval Gate

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run eval suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python evals/run.py --dataset golden_v2.jsonl --threshold 0.85
      - name: Compare to baseline
        run: python evals/compare.py --baseline main --current HEAD --fail-on-regression 0.03

The threshold and regression tolerance are the two decisions that matter:

Absolute threshold (e.g., 85% pass rate) blocks a PR if quality drops below a minimum. Set this at your current baseline minus a small buffer, not at 100%.
Regression delta (e.g., -3%) blocks a PR if scores drop more than this from main, even if above the absolute threshold. This catches silent degradation on PRs that look fine in isolation.

                ┌─────────────────────────────────────────┐
                │              CI Pipeline                 │
                │                                          │
                │  PR opened                               │
                │       |                                  │
                │  Deterministic checks (format, schema)   │
                │       |                                  │
                │  Golden dataset run (LLM calls, ~$0.50)  │
                │       |                                  │
                │  LLM-as-judge scoring                    │
                │       |                                  │
                │  Compare to baseline on main             │
                │       |                                  │
                │  Pass / Block / Warn                     │
                └─────────────────────────────────────────┘

One practical concern: LLM eval suites have non-trivial cost and latency. A suite of 200 examples with LLM-as-judge scoring can cost $0.50-2.00 and take 2-4 minutes. This is acceptable for PR gates but do not run it on every commit to a feature branch. Gate it on paths that include prompts or LLM logic.

What to Actually Measure

This is where most guides stop at “measure accuracy” and leave you guessing. Here are the metrics that actually matter, grouped by what they protect you from.

Metric	What goes wrong without it	How to measure
Task success rate	Silent regressions across input types	Golden dataset pass rate
Hallucination rate	Model inventing facts	LLM-as-judge factual accuracy, reference answers
Format compliance	Downstream parsing breaks	Schema validation, regex
Output length	Cost spikes, UX degradation	Token count distribution
Latency (p50, p99)	User experience, timeout failures	Timing in harness
Cost per call	Budget overruns on model upgrades	Token counts x price
Edge case handling	Crashes on real-world inputs	Adversarial subset of golden set

The most undertracked metric is output length. A prompt change that causes the model to generate 2x the tokens doubles your inference cost and often degrades quality (verbosity is usually a quality signal in reverse). Track token count distribution, not just task quality.

Building the Dataset Incrementally

You do not need 500 examples to start. You need enough to detect the regressions that actually happen to you.

# Log production inputs and outputs - these become your next golden set
class EvalLogger:
    def __init__(self, log_path: str):
        self.log = open(log_path, "a")

    def record(self, input_data: dict, output: str, metadata: dict):
        entry = {
            "timestamp": metadata["ts"],
            "input": input_data,
            "output": output,
            "user_id": metadata.get("user_id"),
            "flagged": metadata.get("flagged", False)
        }
        self.log.write(json.dumps(entry) + "\n")

Then periodically pull from this log to build your golden set. Prioritize: inputs users flagged as wrong, edge cases that hit error handlers, and inputs from the long tail of your distribution. Human-label a subset of these with expected outputs or quality ratings, and add them to the dataset. The golden set should grow with every incident.

A 50-example dataset covering your actual failure modes is more useful than 500 generic examples.

Failure Modes of Eval Systems

Overfitting to the golden set. If you tune your prompt to pass specific golden examples, you are memorizing the test. Add new examples regularly - especially ones the model currently fails on - to keep the set honest.

Judge bias. LLM judges have systematic biases: they prefer longer answers, they favor confident-sounding text, and they tend to score outputs similar to their own generation style higher. Use multiple judges on the same rubric and flag large disagreements.

Eval cost kills adoption. If running evals costs $50 per PR, people start skipping them. Keep the default eval suite cheap (deterministic checks + a small golden set) and reserve the expensive LLM judge for weekly baseline runs.

Stale baselines. If your baseline is from three months ago, it may already include known regressions you accepted. Snapshot baselines deliberately - “this is the baseline after the v2 prompt launch” - with a reason and a date, so comparisons are meaningful.

The Honest Assessment

In 2026, eval-driven development is possible but not plug-and-play. The frameworks (LangSmith, Braintrust, PromptLayer) give you logging and a dashboard. They do not give you a rubric, a golden dataset, or a sensible CI gate - those are still engineering work you have to do.

What actually works: start with a small golden dataset of your real failure cases, add deterministic checks first (they catch 60% of regressions for near-zero cost), and only add LLM-as-judge for the quality dimensions that deterministic checks cannot reach. Wire the fast checks into every PR. Run the expensive judge weekly or on major prompt changes.

What does not work: buying an eval platform before you have a dataset, trying to cover every possible input before shipping, or setting a 100% pass threshold that nobody will maintain.

The minimum viable eval stack is 30 golden examples, a format check, and a CI step that blocks on a 10% regression from baseline. That beats “feels good” by an enormous margin and takes one engineer one day to build. Start there, then grow the dataset with every incident.

Why Vibes-Based QA Breaks#

Level 0 - Manual Spot-Checking (Where Everyone Starts)#

Level 1 - Golden Datasets#

Level 2 - LLM-as-Judge#

Level 3 - Regression Suites in CI#

What to Actually Measure#

Building the Dataset Incrementally#

Failure Modes of Eval Systems#

The Honest Assessment#

Comments