The most productive engineers I know stopped bragging about how fast they type. When a coding agent can produce 400 lines of correct code from a paragraph, typing speed is not the bottleneck anymore. The bottleneck is the paragraph. If the paragraph is vague, you get 400 lines of confidently wrong code, fast. If the paragraph is precise, you get something you can ship. The spec is now the leverage point, and most people are still treating it like a throwaway comment.

I have been building with agents for long enough to notice the pattern: the quality of the output tracks the quality of the spec almost linearly, and it tracks the quality of the model much less than you would expect. Swapping to a better model buys you a bit. Writing a better spec buys you a lot. This post is about how to actually do that, and where it stops working.

The Naive Loop and Why It Stalls

Here is how almost everyone starts with a coding agent.

You: "Add rate limiting to the API"
Agent: <writes a token bucket in memory, hardcoded 100 req/min>
You: "No, it needs to be per-user"
Agent: <rewrites with a per-user dict>
You: "It has to survive a restart and work across 3 instances"
Agent: <rewrites again with Redis, guesses the key format>
You: "The limit is different for free vs paid users"
Agent: <rewrites a fourth time>

Four rounds, four full rewrites, and you still have not talked about what happens when Redis is down. This is not the agent being dumb. The agent solved exactly the problem you stated on each turn. The problem is that you were writing the spec one constraint at a time, in the worst possible order, and paying for a rewrite after every reveal.

The naive loop feels productive because something is always happening. But you are doing spec discovery through trial and error, using the most expensive possible feedback mechanism: reading generated code and noticing what is wrong. You would never design a real feature this way with a human. You would write the requirements down first.

Move the Thinking Before the Code

Spec-driven development just moves that discovery to the front, into a document, before a single line is generated. The spec is the artifact you iterate on. The code becomes a compilation target.

The reframe that makes this click: the spec is your source code, the generated implementation is the build output. You do not hand-edit the build output of a compiler; you fix the source and rebuild. Same idea. When the code is wrong, the correct instinct is not to patch the code, it is to ask “what did my spec fail to say?” and fix that.

That single habit changes everything downstream. Bugs become spec gaps. Code review becomes spec review. Regenerating is cheap, so a bad spec costs you a rerun, not a rewrite.

Here is the same rate-limiting task as a spec first.

# Feature: Per-User API Rate Limiting

## Goal
Limit each authenticated user to N requests per rolling 60s window.
Reject over-limit requests with HTTP 429 and a Retry-After header.

## Tiers
- free: 60 req/min
- paid: 600 req/min
- Tier comes from user.plan (already on the auth context).

## Constraints
- Must work across N app instances (shared state).
- Must survive an app restart (state lives in Redis, not memory).
- Redis key: "rl:{user_id}", sliding window via sorted set of timestamps.
- Redis is the source of truth; no in-process cache.

## Failure Mode
- If Redis is unreachable: FAIL OPEN (allow the request), log a
  warning, increment a metric `ratelimit_backend_error`. We would
  rather serve traffic than hard-fail on an infra blip.

## Out of Scope
- Per-endpoint limits (later).
- IP-based limits for unauthenticated traffic (separate feature).

## Acceptance
- 61st request within 60s for a free user returns 429.
- After 60s the window frees up.
- Killing Redis mid-test lets requests through and logs the warning.

Hand an agent that spec and you get one implementation that is close to shippable, because every decision you would otherwise have surfaced through four rewrites is already answered. The “fail open” line alone is worth the whole document. Nobody guesses that correctly, and it is exactly the kind of thing that only surfaces in a 2 a.m. incident if you leave it implicit.

What Actually Goes in a Good Agent Spec

A spec for an agent is not a PRD and it is not a formal specification language. It is the set of decisions a competent engineer would need before writing the code, written down so the agent does not have to guess them. After a lot of iterations, this is the skeleton I keep coming back to.

Section What it pins down Why the agent needs it
Goal One or two sentences, the observable outcome Stops the agent optimizing for the wrong thing
Inputs / Outputs Concrete types, shapes, example values Removes the biggest source of guesswork
Constraints Perf, concurrency, persistence, security These drive the whole architecture
Failure modes What to do when each dependency fails The part humans forget and agents cannot invent
Out of scope What NOT to build Prevents scope creep and over-engineering
Acceptance Checks that prove it works Gives the agent a target and you a review checklist

The two sections that carry the most weight are failure modes and out of scope, and they are the two people skip.

Failure modes matter because an agent will happily produce a happy-path implementation that looks complete and passes a casual read. It will not decide, on its own, whether a payment retry should be idempotent, whether a partial write should roll back, or whether a timeout should surface an error or a stale cache. Those are product decisions dressed as engineering details. If you do not state them, you get whatever was statistically common in the training data, which is rarely what your system needs.

Out of scope matters because agents over-build. Ask for a config loader and you may get a plugin system with hot reload and schema validation you never wanted. A blunt “out of scope: no hot reload, no plugins, single YAML file, fail hard on a bad key” keeps the output small enough to actually review.

Be Concrete, Not Complete

A good spec is specific where it matters and silent where it does not. You do not need to specify variable names or which loop construct to use. You need to nail the interfaces, the invariants, and the edges. Compare:

# Weak
Parse the uploaded CSV and store the records.

# Strong
Parse an uploaded CSV of transactions.
- Columns (exact headers): date (ISO 8601), amount (decimal,
  2 places), currency (ISO 4217), description (free text).
- Reject the whole file if any row is malformed; return row
  numbers of the failures. No partial imports.
- Duplicate detection: (date, amount, description) tuple. Skip
  duplicates silently, report the count skipped.
- Max file size 10 MB, max 50k rows; reject with a clear error
  above either.
- Store into `transactions` table; the import is one DB transaction.

The strong version does not describe the algorithm. It describes the contract and the boundaries. The agent fills in the algorithm, which is the part it is genuinely good at.

A Spec-First Workflow That Holds Up

The loop I use on real work has four phases, and the spec is the center of gravity in all of them.

   +-------------------+
   |  1. Draft spec    |  human writes intent, decisions, edges
   +-------------------+
            |
            v
   +-------------------+
   |  2. Agent critiques|  "what's ambiguous or missing?"
   +-------------------+
            |
            v
   +-------------------+
   |  3. Generate code |  agent implements the frozen spec
   +-------------------+
            |
            v
   +-------------------+
   |  4. Review vs spec|  diff behavior against acceptance;
   +-------------------+  a gap = fix the spec, regenerate
            |
            +--> back to 1 when the spec was wrong,
                 not the code

Phase 1 is you. Write the goal, the interfaces, the constraints, the failure modes, the acceptance checks. This is the actual engineering work now. It is also where all the judgment lives, which is why it does not get automated away.

Phase 2 is underused. Before generating anything, hand the spec back to the agent and ask it to poke holes: “You are the engineer who has to build this. List every ambiguous or missing decision, and every place I have implied something without saying it.” Agents are much better at critiquing a spec than at reading your mind. This turns the four-rewrite discovery loop into a single questions-and-answers pass on a cheap text document.

You: Here is the spec. Do not write code. List the decisions
     I left unspecified that would change the implementation.

Agent: 1. What timezone are the CSV dates in? UTC assumed?
       2. amount sign - are refunds negative or a separate column?
       3. On duplicate, which record wins if descriptions differ
          by whitespace? Do you trim before comparing?
       4. Is the 10MB limit pre or post decompression?
       5. Concurrent uploads of the same file - allowed?

Every one of those is a bug you did not ship. Answering five questions in the spec is minutes; discovering them from generated code across five rewrites is an afternoon.

Phase 3 is the cheap part. Generation is close to free now. Treat it that way. If the output is 80% right, do not start hand-patching it into shape. Ask why the spec allowed the 20%, fix the spec, and regenerate. Hand-patching generated code quietly moves the source of truth back into the code, and then your spec rots and the next regeneration wipes your fixes.

Phase 4 is review against acceptance, not against taste. Your acceptance checks are the contract. Walk them. Where behavior and spec disagree, one of them is wrong, and it is usually the spec being silent. Fix it there.

Where This Breaks Down

Spec-driven development is not a universal method, and pretending it is will burn you. Here is where it stops paying off.

Exploratory work. When you do not yet know what you are building, writing a detailed spec is fiction. You are guessing at requirements you have not discovered. For a spike, a proof of concept, or “I wonder if this API can even do X,” skip the ceremony. Prompt loosely, throw the code away, and write the spec afterward once you actually know the shape. Specs are for building the thing, not for discovering whether the thing is possible.

Large, cross-cutting changes. A spec that tries to describe a change touching 30 files across three services becomes a document longer than the code, and nobody keeps it in sync. Specs work best at the unit a single agent run can hold: a feature, a module, an endpoint, a migration. Above that, you are back to architecture and coordination, which is human work that specs do not replace.

Deep existing systems. The spec assumes the agent can see the constraints. In a large legacy codebase, half the real constraints are unwritten: the load-bearing side effect three call-frames up, the migration that must run in a specific order, the queue that cannot tolerate reordering. The agent cannot spec what it cannot see, and neither can you from memory. Here the failure mode is a spec that reads perfectly and produces code that quietly violates an invariant nobody wrote down.

The spec-drift trap. The moment someone hand-edits generated code and does not backport the change into the spec, the discipline is broken. Now the spec lies, the next regeneration destroys real work, and everyone learns not to trust the loop. Spec-driven development is a team habit, not a personal one. If half the team patches code directly, the spec is just documentation, and documentation always rots.

Specs can be wrong with total confidence. A precise spec generates a precise implementation of a bad idea. The tool removed the friction that used to make you notice. When you typed all the code yourself, the sheer effort of building the wrong thing sometimes made you stop and reconsider. Generation removes that speed bump. A confidently written, internally consistent spec for the wrong design now produces the wrong thing faster than ever. The judgment about whether the spec is correct is entirely on you, and nothing in this workflow checks it.

The Honest Assessment

Spec-driven development is the highest-leverage change I have made to how I build software with agents, and it is not close. For bounded, well-understood features it turns a frustrating multi-rewrite grind into a tight loop where the thinking happens in a document I can review, version, and reason about. The failure-modes and out-of-scope sections alone have caught more real bugs than any linter I have run.

What it does not do: it does not replace judgment, it does not scale to giant cross-service changes, and it does not protect you from specifying the wrong thing with great precision. It also demands real discipline. The whole model collapses the instant the code becomes the source of truth again.

If you want to actually try it, do the smallest version first. Next feature you build with an agent, write the spec before you generate anything, spend one pass having the agent critique it, and when the code comes out wrong, resist patching the code and fix the spec instead. Do that for a week. The skill that grows is not prompting. It is the old skill of stating precisely what you want, which turns out to be the one that was always scarce.