Skip to main content
Back to Blog
spec-drivenAI coding agentsprompt engineeringspecificationsdeveloper workflow

Spec-Driven AI Coding: Writing Specs Agents Execute Well (2026)

How to write specs agents execute well — user story, acceptance criteria, out-of-scope, constraints. The spec is the prompt when agents run autonomously.

SurePrompts Team
April 20, 2026
11 min read

TL;DR

With autonomous agents, the spec is the prompt. A good spec has a user story, explicit acceptance criteria, out-of-scope items, and constraints — it's reviewable before the agent runs and reusable across agents.

Spec-driven AI coding treats the specification — not the chat prompt — as the primary artifact you write. You invest time in a precise spec that names the user story, acceptance criteria, out-of-scope items, and constraints, and the agent executes it. The spec is reviewable before any code runs, reusable across agents, and version-controllable. As agents get more autonomous and runs get longer, the spec — not the conversation — becomes the leverage point.

Why Specs Beat Chats for Autonomous Agents

A chat prompt is ephemeral. You type, the agent responds, both scroll away. Fine when the unit of work is a suggestion you accept in a second. Not fine when it is a twenty-minute autonomous run that edits a dozen files.

A spec is the opposite shape:

  • Reviewable before the agent runs. You read it like a PR description, catch the misunderstanding, and fix it before a single file changes. A chat prompt only reveals misunderstandings in the output.
  • Reusable across agents. The same spec can seed Claude Code, GitHub Copilot Workspace, Cursor, or an autonomous agent with different tooling. A chat transcript optimized for one tool does not transfer.
  • Version-controllable. A spec lives in a file — in the repo, an issue, or a doc. You can diff it, comment on it, link it from the PR it produced.
  • Cheaper to iterate on. Editing a spec costs seconds. Re-running an agent to patch a misunderstanding costs minutes and tokens.

The shift is from conversation as the primary interface to artifact as the primary interface. The conversation still happens, but around the spec instead of replacing it. See the pillar, The Complete Guide to Prompting AI Coding Agents. For the category, see agentic AI.

Anatomy of a Good Spec

The shape is not novel. Engineering teams have written specs for decades; spec-driven AI coding borrows the shape and tightens it around what an agent needs. A complete spec for an agent typically has five sections:

  • User story or goal. What outcome does this change produce, and for whom?
  • Acceptance criteria. Concrete, verifiable conditions that are true when the change is done.
  • Out of scope. Tempting adjacent work the agent should not touch.
  • Constraints. Technical and non-technical boundaries — stack, compatibility, budget, conventions.
  • Context links. Pointers to files, docs, or tickets the agent should read before starting.

The first three are the irreducible core. Drop any of them and you are relying on the agent's defaults, which are better than they used to be but still not your defaults.

Where Specs Live

There is no single right answer, but there are useful defaults:

LocationBest forTrade-off
Inline in the promptOne-off tasks, exploratory workNot reusable, no review trail
CLAUDE.md or project context fileStable conventions, repo-wide constraintsToo slow-moving for task-level specs
Issue tracker (GitHub, Linear)Task-level specs for features and fixesRequires tooling that reads the tracker
Dedicated spec doc in the repoLarger features, multi-step workReview overhead scales with spec size
PR template / scratch docSolo work with no trackerGets stale; hard to find later

Most teams settle on a split: repo-wide conventions in CLAUDE.md, task-level specs in the issue tracker or a dedicated doc, inline prompts for tiny changes. Match the lifetime of the spec to the lifetime of the artifact it lives in.

Writing Acceptance Criteria

Acceptance criteria are where most specs fail. The common failure is writing "it works" or "the feature functions correctly" — both unverifiable. An acceptance criterion is verifiable or it is not a criterion.

Three marks of a good one:

  • Concrete. Names a specific test command, file, behavior, or output. "pnpm test auth/session.test.ts passes" beats "tests pass."
  • Bounded. Says what must be true, not everything that could be true. "Returns 401 on an expired token" is tighter than "the auth flow is correct."
  • Checkable without the agent. You — or CI — can verify it independently. If the only way to know it is done is to ask the agent, it is not a criterion.

A weak spec says: "Refresh the session token before expiry." A strong spec lists: "When the token is within 60 seconds of expiry, a new token is fetched before the next API call; pnpm test lib/auth/refresh.test.ts passes; no changes outside lib/auth/; the signature of getSession() is unchanged." The second is slightly longer and dramatically harder to misinterpret.

Explicitly Naming Out-of-Scope

Agents drift. The training distribution rewards helpful-but-unasked-for work — renaming variables while fixing a bug, modernizing a pattern while adding a feature, touching unrelated files because "they were there." Some is useful; most is scope creep that makes the diff harder to review.

Out-of-scope sections are cheap insurance. A few lines that say "do not touch these files" keep the run bounded. Wording matters:

  • "Do not edit migrations." — clear.
  • "Avoid changes to migrations unless necessary." — invitation to decide it is necessary.
  • "Migrations are owned by a different team and must not be touched." — unambiguous.

The list doubles as a review checklist. If git diff --name-only shows a file from it, the run violated the spec.

Stating Constraints

Constraints are the non-behavioral facts the agent needs. They do not describe what the feature does; they describe the world it must fit into.

Useful ones to name:

  • Tech stack. Language version, framework, test runner, package manager. Stating them up front avoids a bad assumption in the first few minutes.
  • Compatibility. "Must work on Node 18+" steers away from features that will bite you later.
  • Budget. "Add no new dependencies" is common. So is "no new network calls in the hot path."
  • Patterns. "Use the existing db client" keeps the change consistent with the codebase.

Constraints are the most commonly skipped section, and also the one that causes the subtlest failures — a PR that works but introduces a dependency you do not want, or uses a pattern the codebase is migrating away from.

Spec Iteration — The First One Is Usually Wrong

First drafts are wrong the same way first drafts of anything are wrong: the writer knows more than the text does. You have context the spec does not name, assumptions it does not state, scope decisions you made without writing down. An agent, which has none of that, executes exactly what is on the page.

Treat the first spec as a draft. Run the agent — or, if the tool supports it, just the spec-to-plan step — and read what comes back. Gaps show up as a plan that touches files you did not expect, criteria the agent weakened because the original was ambiguous, or assumptions surfaced that you did not know you had. The second iteration closes those gaps. Two passes is usually right. See plan-and-execute prompting for the same logic applied to the plan step.

A Good Spec Example (Hypothetical)

A hypothetical spec for a small but non-trivial change, shaped for an autonomous agent. Paths and commands are illustrative.

code
USER STORY
  As a user of the password reset flow, I should receive an error
  if I submit an expired reset token, not a silent redirect to the
  login page that looks like success.

CONTEXT
  - Relevant files:
      app/api/auth/reset/route.ts      (the handler to change)
      lib/auth/tokens.ts               (token validation lives here)
      app/auth/reset/page.tsx          (the client page)
      app/api/auth/reset/route.test.ts (existing test file)
  - The existing flow validates the token, but on failure calls
    `redirect('/login')` instead of returning an error response.

ACCEPTANCE CRITERIA
  1. Expired token returns HTTP 400 with { error: 'token_expired' }.
  2. Invalid token returns HTTP 400 with { error: 'token_invalid' }.
  3. A valid, unexpired token continues to work as before.
  4. The client page displays a human-readable message per case.
  5. `pnpm test app/api/auth/reset/route.test.ts` passes; new tests
     cover cases 1, 2, and 3.
  6. `pnpm typecheck` and `pnpm lint` pass.
  7. `git diff --name-only` shows only the four files listed above.

OUT OF SCOPE
  - Changing how reset tokens are generated or stored.
  - Refactoring `lib/auth/tokens.ts` beyond what is needed.
  - Any changes to the login page or session middleware.
  - Adding a rate limit or lockout (separate task).

CONSTRAINTS
  - No new dependencies.
  - Error messages must be i18n-safe (use the existing `t()` helper).
  - Response shape must match app/api/auth/login/route.ts.

Every section earns its place. The user story explains why; acceptance criteria say what "done" means; out-of-scope keeps the diff small; constraints keep the change consistent; the context list tells the agent where to start.

When Spec-Driven Is Overkill

Spec-driven coding is a discipline, not a dogma. Cases where the overhead exceeds the cost of a bad run:

  • Tiny changes. Renaming a variable, fixing a typo, adjusting a config value. A one-line prompt is faster and no worse.
  • One-liners. "Remove the console.log on line 42" does not need a user story.
  • Tight-loop debugging. Hypothesis and response cycling in seconds. Stopping to write a spec breaks the loop — see the agent debugging prompts guide.
  • Exploratory work. You do not know what the change should be; the spec will emerge from the exploration.

Rule of thumb: if the run takes less time than writing the spec, skip the spec. If the run is autonomous, hard to restart, or touches files you cannot easily review, write it.

Common Anti-Patterns

  • "Works correctly" as an acceptance criterion. Unverifiable, so effectively absent. Fix: name a command or behavior that proves it works.
  • Implicit out-of-scope. You know what the agent should not touch; the spec does not. Fix: write the list. It is usually three lines.
  • Spec that is actually an implementation. Step-by-step instructions remove the agent's room to think and remove your review surface. Fix: describe the outcome and constraints; let the plan decide the approach.
  • Constraints as preferences. "Try to avoid new dependencies" is a wish. Fix: "Do not add new dependencies."
  • No context links. The agent re-discovers the codebase every run. Fix: list the three to five files it must read first.
  • Editing the spec mid-run. The agent has already planned against the old one; changes create drift. Fix: stop the run, edit, restart.

FAQ

Is spec-driven AI coding the same as formal spec-driven development?

No. Formal methods — TLA+, model checkers — prove properties mathematically. Spec-driven AI coding borrows the word but not the rigor. The spec here is closer to a tightened PR description: enough structure to be reviewable, not a formal proof.

Where should I put the spec — inline, in a file, or in the issue tracker?

Depends on reuse. One-off task, inline is fine. Team task or something you will review, the issue tracker or a dedicated doc travels better. Repo-wide conventions belong in CLAUDE.md or the equivalent, not in every prompt.

How do I write acceptance criteria when the task is exploratory?

You usually cannot, and that is a signal to use a different shape. Use the conversational mode until the shape is clear, then write the spec for the implementation pass. Forcing criteria onto an open question produces either trivial criteria or wrong ones.

Does spec-driven work for teams, or just solo developers?

It works better for teams, because the spec is a shared artifact. Two developers prompting with their own framing get divergent implementations; two developers reviewing the same spec before the run catch divergence before it ships. See the GitHub Copilot Workspace guide for a tool that builds this review surface in.

What happens to the spec after the code ships?

Link it from the PR so the reviewer sees the original intent. If the spec encodes a lasting rule — a convention, a constraint, a pattern — promote that rule to the persistent context file so the next task inherits it.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

AI prompts built for developers

Skip the trial and error. Our curated prompt collection is designed specifically for developers — ready to use in seconds.

See Developers Prompts