Spec-driven AI coding treats the specification — not the chat prompt — as the primary artifact you write. You invest time in a precise spec that names the user story, acceptance criteria, out-of-scope items, and constraints, and the agent executes it. The spec is reviewable before any code runs, reusable across agents, and version-controllable. As agents get more autonomous and runs get longer, the spec — not the conversation — becomes the leverage point.
Why Specs Beat Chats for Autonomous Agents
A chat prompt is ephemeral. You type, the agent responds, both scroll away. Fine when the unit of work is a suggestion you accept in a second. Not fine when it is a twenty-minute autonomous run that edits a dozen files.
A spec is the opposite shape:
- Reviewable before the agent runs. You read it like a PR description, catch the misunderstanding, and fix it before a single file changes. A chat prompt only reveals misunderstandings in the output.
- Reusable across agents. The same spec can seed Claude Code, GitHub Copilot Workspace, Cursor, or an autonomous agent with different tooling. A chat transcript optimized for one tool does not transfer.
- Version-controllable. A spec lives in a file — in the repo, an issue, or a doc. You can diff it, comment on it, link it from the PR it produced.
- Cheaper to iterate on. Editing a spec costs seconds. Re-running an agent to patch a misunderstanding costs minutes and tokens.
The shift is from conversation as the primary interface to artifact as the primary interface. The conversation still happens, but around the spec instead of replacing it. See the pillar, The Complete Guide to Prompting AI Coding Agents. For the category, see agentic AI.
Anatomy of a Good Spec
The shape is not novel. Engineering teams have written specs for decades; spec-driven AI coding borrows the shape and tightens it around what an agent needs. A complete spec for an agent typically has five sections:
- User story or goal. What outcome does this change produce, and for whom?
- Acceptance criteria. Concrete, verifiable conditions that are true when the change is done.
- Out of scope. Tempting adjacent work the agent should not touch.
- Constraints. Technical and non-technical boundaries — stack, compatibility, budget, conventions.
- Context links. Pointers to files, docs, or tickets the agent should read before starting.
The first three are the irreducible core. Drop any of them and you are relying on the agent's defaults, which are better than they used to be but still not your defaults.
Where Specs Live
There is no single right answer, but there are useful defaults:
| Location | Best for | Trade-off |
|---|---|---|
| Inline in the prompt | One-off tasks, exploratory work | Not reusable, no review trail |
CLAUDE.md or project context file | Stable conventions, repo-wide constraints | Too slow-moving for task-level specs |
| Issue tracker (GitHub, Linear) | Task-level specs for features and fixes | Requires tooling that reads the tracker |
| Dedicated spec doc in the repo | Larger features, multi-step work | Review overhead scales with spec size |
| PR template / scratch doc | Solo work with no tracker | Gets stale; hard to find later |
Most teams settle on a split: repo-wide conventions in CLAUDE.md, task-level specs in the issue tracker or a dedicated doc, inline prompts for tiny changes. Match the lifetime of the spec to the lifetime of the artifact it lives in.
Writing Acceptance Criteria
Acceptance criteria are where most specs fail. The common failure is writing "it works" or "the feature functions correctly" — both unverifiable. An acceptance criterion is verifiable or it is not a criterion.
Three marks of a good one:
- Concrete. Names a specific test command, file, behavior, or output. "
pnpm test auth/session.test.tspasses" beats "tests pass." - Bounded. Says what must be true, not everything that could be true. "Returns 401 on an expired token" is tighter than "the auth flow is correct."
- Checkable without the agent. You — or CI — can verify it independently. If the only way to know it is done is to ask the agent, it is not a criterion.
A weak spec says: "Refresh the session token before expiry." A strong spec lists: "When the token is within 60 seconds of expiry, a new token is fetched before the next API call; pnpm test lib/auth/refresh.test.ts passes; no changes outside lib/auth/; the signature of getSession() is unchanged." The second is slightly longer and dramatically harder to misinterpret.
Explicitly Naming Out-of-Scope
Agents drift. The training distribution rewards helpful-but-unasked-for work — renaming variables while fixing a bug, modernizing a pattern while adding a feature, touching unrelated files because "they were there." Some is useful; most is scope creep that makes the diff harder to review.
Out-of-scope sections are cheap insurance. A few lines that say "do not touch these files" keep the run bounded. Wording matters:
- "Do not edit migrations." — clear.
- "Avoid changes to migrations unless necessary." — invitation to decide it is necessary.
- "Migrations are owned by a different team and must not be touched." — unambiguous.
The list doubles as a review checklist. If git diff --name-only shows a file from it, the run violated the spec.
Stating Constraints
Constraints are the non-behavioral facts the agent needs. They do not describe what the feature does; they describe the world it must fit into.
Useful ones to name:
- Tech stack. Language version, framework, test runner, package manager. Stating them up front avoids a bad assumption in the first few minutes.
- Compatibility. "Must work on Node 18+" steers away from features that will bite you later.
- Budget. "Add no new dependencies" is common. So is "no new network calls in the hot path."
- Patterns. "Use the existing
dbclient" keeps the change consistent with the codebase.
Constraints are the most commonly skipped section, and also the one that causes the subtlest failures — a PR that works but introduces a dependency you do not want, or uses a pattern the codebase is migrating away from.
Spec Iteration — The First One Is Usually Wrong
First drafts are wrong the same way first drafts of anything are wrong: the writer knows more than the text does. You have context the spec does not name, assumptions it does not state, scope decisions you made without writing down. An agent, which has none of that, executes exactly what is on the page.
Treat the first spec as a draft. Run the agent — or, if the tool supports it, just the spec-to-plan step — and read what comes back. Gaps show up as a plan that touches files you did not expect, criteria the agent weakened because the original was ambiguous, or assumptions surfaced that you did not know you had. The second iteration closes those gaps. Two passes is usually right. See plan-and-execute prompting for the same logic applied to the plan step.
A Good Spec Example (Hypothetical)
A hypothetical spec for a small but non-trivial change, shaped for an autonomous agent. Paths and commands are illustrative.
USER STORY
As a user of the password reset flow, I should receive an error
if I submit an expired reset token, not a silent redirect to the
login page that looks like success.
CONTEXT
- Relevant files:
app/api/auth/reset/route.ts (the handler to change)
lib/auth/tokens.ts (token validation lives here)
app/auth/reset/page.tsx (the client page)
app/api/auth/reset/route.test.ts (existing test file)
- The existing flow validates the token, but on failure calls
`redirect('/login')` instead of returning an error response.
ACCEPTANCE CRITERIA
1. Expired token returns HTTP 400 with { error: 'token_expired' }.
2. Invalid token returns HTTP 400 with { error: 'token_invalid' }.
3. A valid, unexpired token continues to work as before.
4. The client page displays a human-readable message per case.
5. `pnpm test app/api/auth/reset/route.test.ts` passes; new tests
cover cases 1, 2, and 3.
6. `pnpm typecheck` and `pnpm lint` pass.
7. `git diff --name-only` shows only the four files listed above.
OUT OF SCOPE
- Changing how reset tokens are generated or stored.
- Refactoring `lib/auth/tokens.ts` beyond what is needed.
- Any changes to the login page or session middleware.
- Adding a rate limit or lockout (separate task).
CONSTRAINTS
- No new dependencies.
- Error messages must be i18n-safe (use the existing `t()` helper).
- Response shape must match app/api/auth/login/route.ts.
Every section earns its place. The user story explains why; acceptance criteria say what "done" means; out-of-scope keeps the diff small; constraints keep the change consistent; the context list tells the agent where to start.
When Spec-Driven Is Overkill
Spec-driven coding is a discipline, not a dogma. Cases where the overhead exceeds the cost of a bad run:
- Tiny changes. Renaming a variable, fixing a typo, adjusting a config value. A one-line prompt is faster and no worse.
- One-liners. "Remove the console.log on line 42" does not need a user story.
- Tight-loop debugging. Hypothesis and response cycling in seconds. Stopping to write a spec breaks the loop — see the agent debugging prompts guide.
- Exploratory work. You do not know what the change should be; the spec will emerge from the exploration.
Rule of thumb: if the run takes less time than writing the spec, skip the spec. If the run is autonomous, hard to restart, or touches files you cannot easily review, write it.
Common Anti-Patterns
- "Works correctly" as an acceptance criterion. Unverifiable, so effectively absent. Fix: name a command or behavior that proves it works.
- Implicit out-of-scope. You know what the agent should not touch; the spec does not. Fix: write the list. It is usually three lines.
- Spec that is actually an implementation. Step-by-step instructions remove the agent's room to think and remove your review surface. Fix: describe the outcome and constraints; let the plan decide the approach.
- Constraints as preferences. "Try to avoid new dependencies" is a wish. Fix: "Do not add new dependencies."
- No context links. The agent re-discovers the codebase every run. Fix: list the three to five files it must read first.
- Editing the spec mid-run. The agent has already planned against the old one; changes create drift. Fix: stop the run, edit, restart.
FAQ
Is spec-driven AI coding the same as formal spec-driven development?
No. Formal methods — TLA+, model checkers — prove properties mathematically. Spec-driven AI coding borrows the word but not the rigor. The spec here is closer to a tightened PR description: enough structure to be reviewable, not a formal proof.
Where should I put the spec — inline, in a file, or in the issue tracker?
Depends on reuse. One-off task, inline is fine. Team task or something you will review, the issue tracker or a dedicated doc travels better. Repo-wide conventions belong in CLAUDE.md or the equivalent, not in every prompt.
How do I write acceptance criteria when the task is exploratory?
You usually cannot, and that is a signal to use a different shape. Use the conversational mode until the shape is clear, then write the spec for the implementation pass. Forcing criteria onto an open question produces either trivial criteria or wrong ones.
Does spec-driven work for teams, or just solo developers?
It works better for teams, because the spec is a shared artifact. Two developers prompting with their own framing get divergent implementations; two developers reviewing the same spec before the run catch divergence before it ships. See the GitHub Copilot Workspace guide for a tool that builds this review surface in.
What happens to the spec after the code ships?
Link it from the PR so the reviewer sees the original intent. If the spec encodes a lasting rule — a convention, a constraint, a pattern — promote that rule to the persistent context file so the next task inherits it.