Skip to content

Agentic TDD: When the Failing Test Is the Spec

18 min read

Agentic TDD: When the Failing Test Is the Spec

· 18 min read
A single index card pinned to a workshop wall with seventeen lines of failing test handwritten on it, used as a metaphor for the failing test as the durable artefact for follow-up agent work

Last week’s post was a four-document spec for a new feature. This week’s spec is seventeen lines of failing test. Same discipline, scoped to one file. Beck called TDD “a superpower for AI agents” in his June 2025 Pragmatic Engineer interview, then noted in the same conversation that agents will delete tests if you let them (Pragmatic Engineer, 2025). Both observations are true. Both resolve via the same move.

Without a discipline, follow-up work falls back to vibe coding, and vibe coding is exactly where the regression budget gets burned. CodeRabbit’s State of AI vs Human Code report measured AI-authored PRs at 1.7x the issue rate of human-authored ones (10.83 vs 6.45 per PR; logic and correctness defects up 75% across 5,000 reviewed PRs; CodeRabbit, 2025). That’s an empirical case for failing-test-first, not a stylistic one. This post walks one real follow-up fix end-to-end, with the failing test as the durable artefact, and names the dividing line between TDD and spec-driven so you pick the right tool the next time.

Key Takeaways

  • Spec-driven and agentic TDD are the same artefact-driven discipline at two weight classes; pick by scope and uncertainty, not preference. DORA 2025 calls TDD “more critical than ever” with AI in the loop (DORA, 2025).
  • The failing test as a durable file solves Beck’s “agents delete tests” without a single prompt change. The test is on disk; deletion becomes a destructive action a human reviews.
  • AI-authored PRs ship 1.7x more issues than human-authored ones (10.83 vs 6.45 per PR; logic defects +75%; CodeRabbit, 2025). A failing test is the cheapest filter for that.
  • Hamel Husain’s “TDD doesn’t work for LLMs” applies to AI features (LLM in the runtime), not to deterministic code written by an AI agent. The boundary is whether your test can pass or fail without a model call.

Why isn’t spec-driven the answer today?

Because today’s work is a tweak to yesterday’s feature, not a new feature. Brainstorm-design-plan is the right shape when the work is uncertain and large. When it’s bounded and small (one file, one behavior change), the failing test captures intent better than a 294-line design doc would, with one tenth the artefact weight. AI-authored PRs ship 1.7x more issues than human-authored ones (10.83 vs 6.45 per PR; logic defects +75%; CodeRabbit, 2025). Follow-up work without a discipline is exactly where the issue rate compounds.

The asymmetry of follow-up work is what makes the heavy ceremony hurt. The scope is small. The context is already in your head. Writing a brainstorm doc for a 17-line fix is theatre, and theatre erodes whatever discipline produced the original feature in the first place. The Beck tension up front (TDD as superpower vs agents-delete-tests) is a real one; the resolution lives in section four. For now the empirical case is enough: 84% of devs use AI, trust in its output dropped to 29% (down 11 points year over year), and “almost-right” output remains the top frustration (Stack Overflow Developer Survey, 2025). A failing test is the cheapest filter for almost-right.

Where does TDD fit in an agentic workflow?

Spec-driven and agentic TDD are the same artefact-driven discipline at two weight classes. Spec-driven asks “what are we building?” and produces brainstorm + design + plan as durable markdown. TDD asks “what specifically is broken or missing?” and produces a failing test as durable code. Both files outlast the chat session. Both are what the agent re-reads to stay aligned. DORA’s 2025 State of AI-assisted Software Development report frames it directly: AI is an amplifier, and TDD becomes “more critical than ever” because AI removes the friction TDD imposes on humans (DORA report, 2025; Google Cloud framing, 2025).

If you missed last week’s post, the punchline fits in a paragraph. Spec-driven agent development uses three durable markdown artefacts (brainstorm, design, plan) that the agent re-reads at every stage. The pylon PR-review feature surfaced a missing mcpServers option during the design step, the fix moved per-agent tool calls from 0.6 to 2.6, and the unchanged MCP=0 became the input to the next round. Today’s post is the lighter cousin: when the work doesn’t earn three artefacts, one failing test plays the same role. Both lanes slot into the broader team-member scaffolding that frames every artefact-driven discipline on this site.

What does the failing test capture that a chat message can’t?

Intent (what should be true), scope (one file, one behavior), and verification (bun test proves it on every save). A chat message captures none of those past the next compaction. AI-generated tests now cover 75% on average versus 60% for manually written tests across the empirical study set, with the gap holding across multiple project sizes (Springer, 2025). The cost of writing the test first has dropped to nearly zero. The bottleneck has moved to deciding what the test should assert.

The test names the assertion. The test names the file. The test names the command that proves it. Three things in one artefact, all on disk, all readable by the agent on the next session. None of those properties survive in chat. A model that “knows” you wanted a behavior because you described it Tuesday will not know that when the context compacts on Friday; a test file at src/main/pr-review/__tests__/reviewer-prompt.test.ts will. The headline isn’t coverage. The headline is durability, and coverage is what durability buys you.

What the failing test does not capture also matters. Architectural decisions, cross-cutting trade-offs, and multi-file wiring don’t fit in one assertion. Those are spec-driven territory. Trying to encode “should we precompute the bundle or query MCP per agent?” as a test is how TDD earns its reputation for slowing teams down on uncertain work. The discipline picks one job and does it well. The dividing line between TDD and spec-driven is the same line as between why one-file scope works for follow-up and why multi-file features need the heavier artefact.

How does the superpowers TDD skill enforce the discipline?

The skill encodes one Iron Law in capital letters: “NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST” (SKILL.md, 2026). The agent reads the skill, follows red -> green -> refactor as a state machine, and refuses to write production code while no test is failing. Superpowers as a framework hit 150,000 GitHub stars and was accepted into the official Anthropic Claude Code marketplace on January 15 2026 (Superpowers GitHub, 2026; origin: obra blog, 2025). The skill ships as one of the framework’s installable workflows, alongside brainstorming, writing-plans, and executing-plans.

The Iron Law is mechanical, not stylistic. It’s the same shape as the spec-driven workflow’s contract that superpowers:executing-plans consumes the plan verbatim. Different artefact, same enforcement. Red: write a test that asserts the behavior you want and run it; it must fail. Green: write the smallest production-code change that makes the test pass. Refactor: clean up while the tests stay green. The agent never enters the green phase without a failing test in scope, because the skill checks for one. Advisory CLAUDE.md lines that say “write tests first” do not check; they hope.

This is the move that resolves Beck’s tension. The failing test is a file on disk, not a transient instruction in the prompt. Deleting it is a destructive action: a tracked diff, a code review surface, a pre-commit hook trigger. Agents will still attempt it (they did during one of my early TDD sessions, removing an expect to make a green run “easier”), but the deletion now has shape. The Iron Law says no production code without a failing test. If the test goes, the production-code work has to revert too. The skill’s state machine treats deletion as the same defection as never writing the test. Both conditions fail the gate.

Worked example: tightening the MCP fallback prompt

The spec-driven post’s second-order finding (bundle reads happen, MCP calls don’t) is this post’s worked example. The reviewer-prompt MCP fallback line read, paraphrased, “if a symbol has referencesTruncated: true, you may call find_references.” The may is doing too much work; agents skip the optional escape hatch and stop at the precomputed bundle. The fix is to tighten the language so the fallback fires when the truncation flag is true, naming the affected symbols by name. The failing test asserts that the prompt section appears whenever the flag is true and is absent otherwise. About seventeen lines, one file. The fix diff is smaller than the test.

The failing test, paraphrased, lives at src/main/pr-review/__tests__/reviewer-prompt.test.ts:

import { test, expect } from "bun:test";
import { buildReviewerPrompt } from "../reviewer-prompt";

test("MCP fallback section names truncated symbols when present", () => {
  const bundle = {
    symbols: [
      { name: "createSession", referencesTruncated: true, references: [] },
    ],
  };
  const prompt = buildReviewerPrompt({ bundle });
  expect(prompt).toMatch(/MUST call find_references for: createSession/);
});

test("MCP fallback section is absent when nothing is truncated", () => {
  const bundle = {
    symbols: [{ name: "createSession", referencesTruncated: false, references: [] }],
  };
  const prompt = buildReviewerPrompt({ bundle });
  expect(prompt).not.toMatch(/find_references/);
});

That’s seventeen lines including imports. The intent is explicit: when truncation happens, the prompt must instruct the agent to call find_references for the affected symbols by name. The MUST replaces the old may. The fix, paraphrased:

- if (anyTruncated) {
-   sections.push("If a symbol has referencesTruncated, you may call find_references.");
- }
+ const truncated = bundle.symbols.filter((s) => s.referencesTruncated);
+ if (truncated.length > 0) {
+   const names = truncated.map((s) => s.name).join(", ");
+   sections.push(`MUST call find_references for: ${names}.`);
+ }

Six lines, smaller than the test. Counter to the “tests are a tax” intuition; here the test is bigger than the fix and that’s the discipline working. The test encodes intent the fix cannot encode by itself; the fix encodes behavior the test cannot encode by itself. The refactor pass extracts the truncation filter into a named helper (truncatedSymbols(bundle)) and the tests stay green. All four steps live as durable files. None of them live in chat, which is also why the next session can pick up the work without re-explaining the change.

Does Hamel’s eval-driven counter break TDD for AI?

Hamel Husain’s “TDD doesn’t work for LLMs because there is no single correct output” is true for AI features (LLM in the runtime, output non-deterministic, evals are the right discipline; Hamel Husain, 2025). It is not true for deterministic code written by an AI agent (the agent is the author, the runtime is bun test, output is a pass/fail boolean). The dividing line is “what is being tested?” not “who wrote the code?”

The practical heuristic fits in one sentence. If your test can pass or fail without a model call, you are in TDD’s zone; if it can’t, you’re in eval-driven territory and TDD is the wrong tool. The MCP-fallback example above passes that test cleanly: buildReviewerPrompt is a pure function over a bundle and its output is a string the test can grep. Most agentic-coding follow-up work lives in this zone, because the agent’s output is code, and code runs in a deterministic runtime. The cases where Hamel’s objection bites are real (LLM-as-judge layers, content-generation features, the AI-feature side of the boundary), and they need evals. The boundary doesn’t dissolve TDD; it locates it.

When is TDD wrong, and you should reach for spec-driven?

TDD breaks down when the work is too large for one failing test to capture (a multi-file feature with non-obvious wiring), when architectural trade-offs need to be named before any code is written, or when the cost of getting the wiring wrong is hours of throwaway test rewrites. The spec-driven post’s pylon worked example is the canonical instance: the bug was a missing mcpServers option that no single test could have surfaced, because the question was “what should the session interface look like?” not “what should this function return?”

A three-line decision heuristic does most of the routing. One file or many? One decision or many? Cost of a wiring mistake? If any answer is “many” or “high,” reach for the heavier cousin. If all three are “one” or “low,” write the failing test. The upgrade path is honest: when your test starts wanting to assert too many things at once, that’s the design begging to be written down. The discipline isn’t loyalty to one tool; it’s matching artefact weight to scope and uncertainty.

Frequently Asked Questions

Doesn’t AI handle TDD by itself? Why bother with the discipline?

No, and the data is empirical, not stylistic. CodeRabbit’s State of AI vs Human Code report measured AI-authored PRs at 1.7x the issue rate of human-authored ones (10.83 vs 6.45 per PR; logic defects up 75%; CodeRabbit, 2025). DORA 2025 frames TDD as “more critical than ever” with AI in the loop precisely because AI removes the friction that pushed humans away from TDD (DORA, 2025).

How do I stop the agent from deleting my tests?

Make the failing test a durable file before the agent writes production code, and use the superpowers:test-driven-development skill so the Iron Law (“NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST”; SKILL.md, 2026) is enforced as a state machine, not a prompt suggestion. Beck observed that TDD is a superpower for AI agents and that agents will delete tests if you let them (Pragmatic Engineer, 2025); both observations resolve via the same move.

Spec-driven or TDD: how do I pick?

Three lines. One file or many? One decision or many? Cost of getting the wiring wrong? If any answer is “many” or “high,” spec-driven. If all three are “one” or “low,” TDD. New features almost always trip the spec-driven side; bug fixes, small adds, and refactors almost always trip the TDD side. When a test starts wanting to assert too many unrelated things, that’s the design begging to be written down.

Won’t writing the test first slow me down?

AI-generated tests now cover 75% on average versus 60% for manually written tests in the published empirical study set (Springer, 2025), so the cost of writing the test first is close to zero. The bottleneck has moved to deciding what the test should assert; that decision is the artefact, and agentic TDD is what makes it durable instead of transient. The “slow” feeling people remember from manual TDD doesn’t survive the cost shift.

The Real Argument

The arc in one sentence: the spec-driven post’s MCP=0 finding became this post’s worked example, with a seventeen-line failing test and a fix smaller than the test. That last clause is the part that changes the intuition. Tests are not a tax on the work; they are the work’s intent rendered durable. Spec-driven encodes that intent across three artefacts because the work is uncertain. Agentic TDD encodes it in one because the work is bounded. Both lanes ship durable files the agent re-reads on the next session, and both lanes refuse to let the prompt be the only place the intent lives. The same artefact discipline scales out to parallel subagent plan execution once the plan is wide enough to fan out, and feeds the time-axis regression discipline once the failing test enters the suite.

Beck and Hamel are not arguing opposite positions. Beck is talking about deterministic code written by an AI agent, where TDD’s preconditions hold and the failing test does the heavy lifting. Hamel is talking about AI features with the LLM in the runtime, where TDD’s preconditions fail and evals do the work TDD would have done. The boundary is “model call in the runtime?”, and once you draw it, both observations stand without contradiction. The Iron Law sits inside the Beck zone. The Hamel zone needs a different tool, and that tool is the next post in this verification arc.

If you take one thing from this post: write the failing test before you open chat. Make it a file. Run it red. Then let the agent ship the smallest fix that turns it green, and let the refactor pass leave the tests green. If the test wants to assert too many unrelated things at once, stop and write the design instead. The next post in the verification arc is on agent evals (Mon 2026-05-18); together with this post and the spec-driven prequel, that closes the verification pillar.

Share this post

If it was useful, pass it along.

What the link looks like when shared.
X LinkedIn Bluesky

Search posts, projects, resume, and site pages.

Jump to

  1. Home Engineering notes from the agent era
  2. Resume Work history, skills, and contact
  3. Projects Selected work and experiments
  4. About Who I am and how I work
  5. Contact Email, LinkedIn, and GitHub