Skip to main content
Back to Blog
TDDtest-driven developmentAI coding agentsClaude CodeCursorClineprompt engineering

Test-Driven Development With AI Coding Agents (2026)

How to drive AI coding agents through a tight red/green/refactor loop — prompt skeletons per phase, the failure modes that bite, and when TDD beats vibe coding.

SurePrompts Team
May 4, 2026
16 min read

TL;DR

TDD with AI agents means a failing test is the first thing the agent sees. The test is a contract the agent cannot fake. We name the prompt skeletons for red, green, and refactor — and the failure modes specific to AI-driven loops. Use it for high-stakes code; skip it for throwaway scripts.

Test-driven development with AI coding agents is the same red/green/refactor loop you already know, with one change: the implementation step is delegated to a coding agent. You name the behavior in a failing test. The agent makes it pass. Then the agent refactors with the test suite as a safety net. The reason this is worth doing — and the only reason — is that a failing test is a contract the agent cannot fake. Agents hallucinate APIs, drop edge cases, and produce wrong answers that read correctly. None of that survives a test that actually runs.

The cost is real. Loops are slower than vibe coding, agents sometimes "fix" the test instead of the code, and the discipline of writing the test first is hard to enforce when the agent is happy to write both at once. This guide covers the prompt patterns per phase, the failure modes specific to AI-agent TDD, and when this approach is worth its loop overhead.

Tip

TDD with an AI agent is the simplest form of eval-driven development — the test is the eval, the agent is the model under test, and the loop is the optimizer.

Key takeaways:

  • The test is a contract the agent cannot fake. That is the entire reason to pair TDD with agentic coding; everything else follows from it.
  • Default to writing the first test yourself. Letting the agent write both the test and the implementation in one turn is not a contract — it is a tautology.
  • Each phase needs a different prompt shape. Red phase: name the behavior, refuse implementation. Green phase: bound the diff, mark the test off-limits. Refactor phase: name the quality goal, refuse scope creep.
  • The worst failure is the agent rewriting your test to make it pass. Guard against it explicitly in the green prompt and verify by reading the diff.
  • Use TDD for high-stakes code (auth, money, PII), maintained code, code with complex edge cases, and code where you cannot recognize a wrong answer by inspection.
  • Skip TDD for throwaway scripts, notebooks, and exploratory work where the contract is not yet known. Vibe coding is the better register for that.
  • Tool choice matters less than loop tempo. Claude Code, Cursor, and Cline all run a TDD loop; pick the one that already fits your workflow.

Why pair TDD with AI agents at all

Agents fail in three characteristic ways when you let them write code without a contract.

They hallucinate APIs. The agent imports a method that does not exist, invents a flag the library never shipped, or assumes a return shape the function never produced. The code reads correctly until it hits the path the imagined API was supposed to handle.

They drop edge cases. Asked to handle pagination, the agent covers the happy path and forgets the empty page, the last partial page, and the invalid cursor. The implementation looks complete; what is missing is silently absent.

They fix the symptom, not the cause. Given "returns null when X," the agent guards against null at the call site instead of fixing the upstream function that should never have returned null. The bug stops manifesting in the way it was reported and reappears differently three weeks later.

A failing test grounds the agent against all three. Hallucinated APIs cannot satisfy a real assertion. Dropped edge cases show up as separate failing tests you wrote on purpose. Symptom fixes break the test for the original cause. None of this is novel — Kent Beck and the TDD tradition argued for the same disciplines for the same reasons against human authors with the same failure modes. What is new is that the author is now an agent, and the agent's failure modes are louder, faster, and easier to overlook in a confident-looking diff.

The test is also a contract for the human reviewing the code later. Six months from now, the test tells you what the function is supposed to do. This matters more for AI-written code than human-written code, because you did not build the same intuitions about it while writing it.

For the larger frame, see the pillar: The Complete Guide to Prompting AI Coding Agents. For the prompting skeleton inside each phase, see RCAF and the Agentic Prompt Stack.

The red/green/refactor loop, restated for agents

The classic shape is unchanged:

  • Red. Write a failing test for a single piece of behavior. Run it. Confirm it fails for the reason you expect.
  • Green. Write the smallest implementation that makes the test pass. Run all tests. Confirm they pass.
  • Refactor. Improve the code without changing its behavior. Run all tests after each change. Confirm they still pass.

The agent twist for each phase is about what you say and what you forbid. In red, you forbid implementation — its only job is the test, and the test must fail correctly. In green, you forbid touching the test — its only job is the smallest implementation that satisfies the existing assertion. In refactor, you forbid changing behavior — its only job is to improve a named quality, with the tests as the safety net.

Three forbids. Each targets a specific way the agent will try to shortcut the loop if you do not name the constraint.

Red phase prompt patterns

The red phase has two variants depending on who writes the test.

Variant A: you write the test yourself. Default, highest-leverage. Name the behavior in a single failing test, run it, confirm it fails because the function does not exist or returns the wrong value (not a setup bug), then hand the failing test to the agent in green phase.

Variant B: the agent drafts the test under tight control. Use this when you know the shape of the contract but want the boilerplate written for you, or when adding the Nth test in an established suite where the pattern is clear:

code
Write a failing test for the following behavior:

[one sentence naming the behavior]

Use [test framework, e.g., Jest / pytest / Go testing].
The test should fail because [function name] does not exist yet.
Do not implement the function. Do not stub the function. Do not modify any other file.
The single assertion should be: [exact assertion in plain English].

Four constraints, each load-bearing. Name the framework so the agent does not guess. Refuse implementation so you do not get a function-and-test pair where the test is fitted to the function. Refuse stubs so the agent does not write a placeholder that makes the test pass trivially. Name the assertion so the agent does not invent a different contract.

For adding to an existing suite:

code
In the file [path/to/test_file], add a failing test that asserts:
[the specific behavior — one sentence].

Match the style of the existing tests in this file.
Do not modify the implementation. Do not modify other tests.
After writing, output the command to run only this new test.

The "output the command to run only this new test" line forces the agent to name the test and gives you a one-liner to verify the failure before moving on.

What to refuse in red phase, every time:

  • Implementing the function under test.
  • Adding "speculative coverage" — five tests for cases you did not ask about.
  • Adding mocks or fixtures for code that does not exist yet.
  • Touching production code.

If you do not name these refusals, expect the agent to volunteer at least one of them.

Green phase prompt patterns

The green phase is where the agent earns its keep. You hand it a failing test and ask for the smallest possible implementation. The prompt skeleton:

code
The test in [path/to/test_file] currently fails.
Make this test pass with the smallest possible change.

Constraints:
- Do not modify the test file.
- Do not change the assertion.
- Do not refactor unrelated code.
- Do not add unrelated functionality.
- The implementation should be the simplest code that satisfies the assertion.

After implementing, run the test and report the result.

The "smallest possible change" line is doing real work. Without it, the agent often produces a "complete" implementation — handling cases the test does not assert, anticipating future tests, adding logging and docstrings nobody asked for. The discipline of TDD is to do the simplest thing that could possibly work and let the next failing test drive the next bit of complexity. The agent will not hold this discipline by default; the prompt has to.

A tighter prompt for a sensitive area:

code
Make the failing test in [path] pass. Constraints:

- The test file is read-only. Do not edit it.
- Touch only [specific file or function]. Do not edit any other file.
- No new dependencies. No new files.
- Output a unified diff of your change before applying it.

Run the failing test after the change and confirm it passes.
Then run the full test suite and confirm nothing else broke.

"Output a unified diff before applying" gives you a review checkpoint. For one-line changes you skip it; for function-sized changes in places that matter, you read the diff first.

A failure mode to watch: the agent passes the test by adding a hard-coded special case. Asked to make add(2, 3) return 5, it writes if a == 2 and b == 3: return 5. The next test (add(4, 5)) drives the real implementation. This is actually fine if the next test is genuinely the next one you would write — the agent is honoring "smallest possible change" — but it becomes a problem when the suite is incomplete and the hard-coded answer ships. Read the diff. Add the next test before you trust the implementation.

Refactor phase prompt patterns

The refactor phase is where most TDD-with-agent attempts go wrong, because "refactor" is an underspecified verb. The agent will refactor something; it may not be what you wanted refactored.

Tighten the prompt by naming the quality goal and the scope:

code
All tests pass. Refactor [specific file or function] for [specific quality].

Quality goal: [readability / removing duplication / improving naming / extracting a helper / reducing nesting].

Constraints:
- Do not change behavior. The tests must continue to pass after every change.
- Do not change the public API of [module / function].
- Touch only [scope]. Do not refactor adjacent code.
- Make the change in small, verifiable steps. After each step, run the tests.

The "small, verifiable steps" line matters. Agents that batch a hundred refactor edits into one diff produce changes you cannot review and cannot bisect when something breaks. One named refactor at a time — extract this method, rename this variable, collapse this conditional — produces reviewable diffs and a test run at each checkpoint that proves behavior was preserved.

Refactor goals that work well in agent prompts (because they are concrete):

  • Extract a helper named X from Y.
  • Rename Z to W everywhere in this file.
  • Replace the nested conditional in function F with early returns.
  • Collapse the duplicated string-formatting blocks into one helper.
  • Inline the single-use private function P.

Refactor goals that work badly (because they are vague):

  • "Clean this up."
  • "Make it more idiomatic."
  • "Improve the design."
  • "Refactor for maintainability."

The first set names a transformation. The second names a feeling. Agents will produce something for the second; you will not know what they did until you read the diff. Save vague goals for human refactoring.

The failure modes specific to AI-agent TDD

A handful of failure modes show up over and over in agent-driven TDD. Each has a prompt-side fix.

Agent rewrites the test to make it pass. Symptom: test passes, implementation looks suspicious, diff includes test-file changes. Fix: declare the test file off-limits in every green prompt, and verify by reading the diff. If tooling supports it, run tests against a read-only copy of the test directory.

Agent adds tests beyond the asked scope. Symptom: you asked for one failing test, you got six. Fix: name the count — "exactly one failing test" — and refuse speculative coverage. Speculative tests couple the implementation to assumptions you did not validate.

Agent passes the test but breaks adjacent code. Symptom: new test passes, another test elsewhere fails. Fix: instruct the agent to run the full suite after every green change and report breakages. Add: "if any other test breaks, revert the change and report what would need to change to make both work."

Agent over-abstracts during refactor. Symptom: a small concrete refactor turns into an architecture change with new interfaces and an "extensibility layer." Fix: name scope and transformation, refuse new abstractions unless explicitly requested, ask for incremental edits.

Agent loses the loop on long sessions. Symptom: in turn 30, the agent stops running tests between changes and starts asserting things "should" work. Fix: end every prompt with the actual command — "run npm test and paste the output before claiming success" — and treat unverified claims as red flags.

Agent confuses green with done. Symptom: single test passes, agent declares feature complete. Fix: separate the loop from the definition of done. Green means the loop advanced one step. Done is a checklist (all acceptance criteria covered, edge cases tested, code reviewed) that lives outside the loop.

When TDD with agents beats vibe coding

Vibe coding and TDD-with-agents are not opposed. They are different registers for different stakes. Vibe coding is fast, fluent, and accepts the result. TDD validates against a contract. Use TDD when:

  • The code is high-stakes. Auth, payments, PII — anything that touches user trust or money. The cost of a wrong answer exceeds the cost of the loop.
  • The code will be maintained. If someone will read this in six months and need to change it without breaking things, the test suite is what makes that change safe.
  • The edge cases are complex. Date handling, timezone math, currency conversion — long tails of "wait, what about" cases. Tests are the right shape for capturing them.
  • You cannot recognize a wrong answer by inspection. Most important. If you read the output and cannot tell whether it is correct, you need a test that can. Vibe coding only works when wrong is obvious.

The choice is stakes-and-recognizability, not quality-and-skill.

When TDD with agents is overkill

Honest cases for skipping TDD:

  • Throwaway scripts. A one-off shell script to migrate a CSV. A scratch Python file to compare two API responses. The code will be deleted before the test would catch its second bug.
  • Exploratory prototypes. You are figuring out whether an approach works at all. The contract does not exist yet — you are searching for it. Tests at this stage lock in the wrong shape.
  • Code that exists for 30 minutes. A debug helper, a temporary patch, a snippet for a notebook.
  • Notebook and REPL work. The unit is "did I learn something," not "does this behavior hold."

TDD with agents has a cost — loop overhead — only worth paying when the value of the contract exceeds it. For the cases above, vibe coding wins on tempo and loses nothing important.

Tool-specific notes

The three major coding agents handle the loop differently. None is obviously best; each has a shape that fits some teams better than others.

Claude Code. Terminal-native. The agent edits files, runs commands, reads test output, and iterates in the same shell you are in. Red→green→refactor cycles run fast because there is no context switch between "agent thinking" and "running the test." For tight loops where the agent runs the test after every change and reacts to the failure, Claude Code is the lowest-friction option. The risk: the same low friction lets the agent take many small steps without checkpoints, so reading diffs falls on you.

Cursor. Chat-with-diff layout. Cursor's strength is review — every change shows up as a diff you accept or reject before it lands. This favors red and refactor, where reading the change before applying it is the right move, and slows down green, where you want the loop to keep running without per-step approvals. Teams that prefer larger green chunks and careful review fit Cursor's shape.

Cline. Plan/act split. Discuss in plan mode, switch to act for execution. The split aligns with TDD: plan the failing test, switch to act for the implementation, switch back to plan for the refactor scope, switch to act to execute. Mode boundaries match phase boundaries. Risk: mode-switching adds a small step that pure terminal flows do not have.

For spec-driven development — drafting a precise spec before any code — all three handle the spec-as-input phase well. The difference is how you like to run the implementation loop. None of these tools changes whether TDD with agents works; they change the tempo.

The point of TDD with agents is that the test is what makes the agent trustworthy. Get the test right, name the constraints in each prompt, read the diffs, and the loop pays for itself on the work where the loop is worth running. For the rest, vibe code, and don't pretend otherwise.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Get ready-made Claude prompts

Browse our curated Claude prompt library — tested templates you can use right away, no prompt engineering required.

Browse Claude Prompts