Skip to content

Spec-Driven Agent Development: Brainstorm, Design, Plan

21 min read

Spec-Driven Agent Development: Brainstorm, Design, Plan

· 21 min read
A workshop bench with three labelled drawers (brainstorm, design, plan) feeding a single output, used as a metaphor for the durable artefact pipeline that produces an agent-executable plan

PR 23 told its reviewer agents to use the code-intelligence MCP server. They didn’t. PR 21 made five tool calls across five agents and zero MCP calls. Both prompts said the same thing. Both runs ignored it the same way. The shape of the failure was invisible until I sat down and wrote the design doc that told me to trace the wiring instead of the prompt.

What the design surfaced was small and stupid: sessionManager.createSession did not pass an mcpServers option for PR-review sessions. The reviewer agents physically had no MCP servers attached. The prompt had been lying about a capability that did not exist at runtime, for weeks, across every review the app shipped. Chat iteration had not produced the trace; a 294-line markdown file with a “Wire-up” heading did.

This post is about the discipline that produced the trace. The superpowers plugin calls it brainstorm, design, plan (GitHub, 2026). Kiro calls it user stories, technical design, tasks (Kiro, 2026). The Anthropic-marketplace skills bundle ships them as separate, installable workflows. The names matter less than the shape: three durable markdown artefacts the agent re-reads at every stage, one stage that locks scope by tracing wiring, and a measurement step at the end that produces a question the next round starts from.

Key Takeaways

  • Vibe coding stalls because prompts can’t describe wiring the agent does not have. ~43% of AI-generated code changes need manual debugging in production after staging passes (Lightrun via VentureBeat, 2026).
  • Three artefacts, three jobs: brainstorm explores intent, design locks scope and traces wiring, plan turns scope into checkboxed tasks. The superpowers plugin (150,000 stars, accepted into the Anthropic Claude Code marketplace 2026-01-15) ships them as installable skills.
  • The pylon PR-review design caught a missing mcpServers option that prompts had hidden for weeks. After the fix, per-agent tool activity moved from 0.6 to 2.6 calls, about a 4x lift.
  • MCP usage stayed at zero on both sides of the merge. That is a finding, not a failure, and it is the input to the next design iteration. Vibe coding cannot produce that finding.

Why does vibe coding stall on real features?

Vibe coding stalls when the prompt describes behaviour the runtime cannot produce. Lightrun’s 2026 State of AI-Powered Engineering report found that ~43% of AI-generated code changes still need manual debugging in production after staging passes (VentureBeat, 2026). The model isn’t the bottleneck. The wiring you can’t see is.

The default workflow (open chat, paste a feature description, watch the agent generate code, merge what looks right) works for one-file changes. It breaks the moment the work depends on a capability the agent thinks it has and doesn’t. Stack Overflow’s January 2026 piece on AI coding agents catalogued this exactly: “almost-right output” is the dominant frustration, and the gap between what the prompt asserts and what the runtime delivers is where the bugs live (Stack Overflow, 2026).

Three things keep failing in vibe-coded features. The agent confidently uses a tool that isn’t attached. The agent edits a generated file and the next codegen run wipes it. The agent assumes a contract the codebase doesn’t actually expose. None of the three is a model defect. All three are wiring defects, and a chat session cannot surface them because chat is working memory; it can only describe what the agent thinks is true. It cannot trace what the codebase actually permits.

The audit trail makes the gap concrete. PR 23 had a perfectly serviceable reviewer prompt that named six MCP tools by name. The agents read the prompt, agreed they should use those tools, and made one Read call between them. There was no model failure. There was a runtime that didn’t have the tools, and a prompt that asserted otherwise. The fix wasn’t a smarter prompt. The fix was a design doc that forced a trace from prompt down to session creation.

What jobs do brainstorm, design, and plan actually do?

Brainstorm explores intent and surfaces hidden requirements. Design locks scope, names trade-offs, and traces wiring. Plan turns scope into checkboxed tasks the agent executes against. Each is a durable markdown file the agent re-reads at every stage. The superpowers plugin (150,000 GitHub stars, accepted into the Anthropic Claude Code marketplace on 2026-01-15) ships them as separate, installable skills called brainstorming, writing-plans, executing-plans, and subagent-driven-development (GitHub, 2026).

The triad is not bureaucracy. Each artefact does a different job and consumes a different audience. Brainstorm output is messy, and that’s the point: it ends in a one-page concept the human signs off on. Design output is exact, with file paths, options, and call sites; it traces wiring chat cannot describe. Plan output is mechanical, with checkboxes and verification commands; it’s what superpowers:executing-plans and :subagent-driven-development consume verbatim.

The dashed arrow is the part vibe coding cannot produce. Measurement after execution feeds the next brainstorm with a finding chat couldn’t have surfaced. Skip the design step and the loop loses its only mechanism for discovering things the prompt doesn’t already know. Anthropic’s 2026 marketplace acceptance and Kiro’s three-phase IDE workflow both validate the same shape; Redmonk’s 2026 list of features developers want from agentic IDEs places spec-driven workflows in the top ten (Redmonk: Holterhoff, 2025). The discipline is not the differentiator anymore; the question is which artefacts you keep durable.

The triad slots inside the broader team-member scaffolding. Specs are the per-feature tier; skills are the per-capability tier; CLAUDE.md is the per-repo tier. Each is a different durability, and conflating them is what produces the bureaucracy people fear.

What did the PR-review feature look like before we wrote anything down?

Pylon’s reviewer prompt told every specialist agent to use the code-intelligence MCP server. The pre-merge audit tells you exactly how much of that prompt the runtime delivered. Two real review sessions, transcripts pulled from ~/.claude/projects/-Users-dikrana--pylon-worktrees-*/*.jsonl. PR 23 (2026-04-20), four reviewer agents, one tool call total (a single Read), zero MCP calls. PR 21 (2026-04-15), five reviewer agents, five tool calls total (three Grep, two Read), zero MCP calls.

That works out to about 0.6 tool calls per agent on average, and exactly zero of them on the tools the prompt named. The reviewers were producing reviews; the reviews were diff-only. They confidently named symbols, claimed blast-radius reasoning they couldn’t actually perform, and shipped. Nobody noticed because the reviews still read plausibly. That’s the failure mode. Almost-right output never produces a loud error.

The temptation, working in chat, was to remind the agents harder. Add a “you MUST use the MCP tools” line. Reorder the prompt so the MCP instruction came first. Try a stronger model. None of those would have moved the number, because the number wasn’t a prompt-quality problem. It was a wiring problem the prompt could not describe. Prompt iteration alone is structurally incapable of distinguishing “agent ignored the instruction” from “agent never had the capability.” The audit was the input. The trace had to come from somewhere else.

How did the design catch the bug?

The design doc forced a wiring trace, and the trace surfaced the bug. The 294-line file at ~/Documents/workspace/pylon/docs/superpowers/specs/2026-04-20-pr-review-code-intelligence-design.md opened with a “Summary” section and an “Audit findings that motivate this” section before any architecture. Writing those sections forced me to read the actual session transcripts and the actual session-creation code in the same sitting. That’s the move chat doesn’t make.

The design’s audit-findings section, paraphrased:

PR 23 (4 reviewer agents): 1 tool call total (Read); zero MCP calls.
PR 21 (5 reviewer agents): 5 tool calls total (3 Grep, 2 Read); zero MCP calls.

PR review sessions are created via:
  sessionManager.createSession(cwd, undefined, undefined, 'pr-review')
with no mcpServers override.

test-manager.ts already shows the pattern for passing mcpServers
to the SDK (used by pylon-goal-analysis and playwright); the session
plumbing exists, it just is not used for PR review.

That’s the bug, in eight lines. The reviewer prompt had been instructing agents to call MCP tools that the session API hadn’t been configured to attach. Another part of the same codebase (test-manager.ts) already passed mcpServers to the SDK for unrelated workflows; the plumbing existed, it just wasn’t wired into PR review. Chat iteration would not have produced that comparison. It would have kept asking the agents harder.

The design also had a section called “Why pre-compute instead of letting each agent query.” That section captured a non-obvious decision the chat session never would: even with MCP attached, agents almost never call MCP tools. So the design routed expensive resolution through a single pre-computed bundle (<worktree>/.pylon/pr-context.json) and granted MCP access as an escape hatch for long tails. That trade-off is the kind of thing that has to live in a durable artefact; it’s the part of the work the agent re-reads weeks later when an edge case surfaces and someone asks why the architecture is the shape it is. The audit was a concrete instance of the codebase-shape thesis: the codebase silently disabled the agent, and only a structural trace could see it.

What turned the design into something an agent could execute?

The plan turned the design’s decisions into checkboxed tasks with file paths, verification commands, and a pre-flight section. The accompanying file (2,587 lines, same path family) opens with a “Pre-flight” block that names the exact commands to run before any task starts, then walks task-by-task with frozen inputs and outputs. That artefact is what superpowers:executing-plans and :subagent-driven-development consume verbatim (blog.fsck.com, 2025). The plan is the prompt. The whole skill loop reads from it.

Each plan task carries six fields: name, owner, frozen inputs, frozen outputs, tools allowed, and stop condition. The frozen fields are the contract. Subagents commit to them before dispatch, which is the same discipline that makes wave-based subagent execution survive contact with a real codebase. Without a frozen plan, parallel work disagrees with itself silently. With one, the integration check has something to compare against.

## Pre-flight

- bun install
- bun run typecheck
- bun run lint
- bun test src/main/pr-context

## Task: pr-context-builder

- Inputs (frozen): existing PrReviewManager interface; mcpServers option
  pattern from test-manager.ts.
- Outputs (frozen): src/main/pr-context/pr-context-builder.ts;
  PrContextBundle written to <worktree>/.pylon/pr-context.json.
- Tools: Read, Write, Edit, Bash (test-only).
- Stop: bun test pr-context-builder green; <= 20s build budget.

That’s the shape, paraphrased. Six fields per task, two of them frozen, a verification command per task. The stop condition matters as much as the inputs. A task without a stop condition turns into open-ended work, which is what makes “agent went off and did stuff” a thing. Stop conditions make the work checkable.

Did the numbers actually move?

The fix landed 2026-04-24 in commit aa80b78. Two post-merge sessions. PR 49 (2026-04-26), six reviewer agents, 16 tool calls total, zero MCP calls. PR 31 (2026-04-30), six reviewer agents, 15 tool calls total, zero MCP calls. Per-agent tool activity moved from ~0.6 (pre) to ~2.6 (post), about a 4x lift. Bundle reads of <worktree>/.pylon/pr-context.json show up in every session, five to eight per review.

A 4x activity bump is real, but the more interesting number on that chart is the unchanged one. MCP usage stayed at zero in all four sessions, on both sides of the merge. The wiring fix made MCP genuinely reachable; the agents still didn’t call it. They read the bundle and stopped. PR 29 (Apr 24, ~2h after merge) is excluded from the chart as a same-day-as-merge edge case; the agent processes were likely spawned before the running app picked up the new code path, and the reviewer dropped to 0.7 calls per agent, which fits that interpretation.

Why does writing it down beat prompt-twiddling?

The unchanged MCP=0 is a finding, not a failure, and it is the input to the next design iteration. The reviewer prompt’s MCP fallback line reads, paraphrased, “if a symbol has referencesTruncated: true, you may call find_references.” That may is doing too much work, buried late in a long prompt, against a precomputed bundle that’s almost always sufficient. The agents read the bundle, decide it answers the question, and skip the optional escape hatch. The bundle is sufficient; the fallback prompt is too weak. That’s a prescription for the next brainstorm.

That sentence is only available because the wiring got fixed. Vibe coding cannot distinguish “can’t reach” from “doesn’t pick up.” It would have kept saying “use MCP” and kept observing zero MCP calls and never been able to act on the difference. The design+wire+measure path produces the second-order finding; prompt iteration alone produces a louder version of the first-order failure. That asymmetry is the strongest argument for keeping the artefacts durable.

The next round opens with a different question. Not “why don’t the agents use MCP?” but “what change to the prompt or the bundle would make the fallback fire on the long-tail symbols where the bundle’s per-symbol cap matters?” That question lands inside the brainstorm skill. The skill produces a new design. The design produces a new plan. The plan produces a new measurement. The loop only closes because the artefacts are durable enough for the next round to read them.

When does the spec ossify?

The discipline fails when the artefact is treated as immutable. Designs and plans should be amended in flight when execution surfaces a question. Refusing to amend turns the artefact into ceremony, and ceremony is what people mean when they say “specs slow us down.” The right rhythm is replan between waves, not patch within a wave. If the integration check finds a contract is wrong, edit the design before dispatching the next batch of work, not while the current one is mid-flight.

The symptoms of ossification are easy to name. People stop reading the design. The plan diverges from the code without anyone updating either. The brainstorm becomes a junk drawer of half-formed ideas that never resolve into a sign-off. When any of those show up, the discipline isn’t paying its keep, and the cure is to delete and restart the artefact rather than maintain a fiction.

There is also a real boundary case where the discipline is genuinely overkill. A bug fix that’s smaller than the spec describing it is a tell; a one-file refactor that doesn’t cross a contract is another. Hamel Husain’s longstanding objection that “TDD doesn’t work for LLMs because there is no single correct output” applies to a related class of problems: when the agent itself is the deliverable, formal contracts mismatch the work shape (hamel.dev, 2025). For deterministic code shipped through an agent, the discipline still holds; for the agent’s own behaviour, evals do the job specs would do for code. The right move on follow-up work where the contract is one file is to let the failing test play the role the design plays here. That is the lighter cousin: agentic TDD, the next post in this verification arc.

Where do specs sit relative to skills, CLAUDE.md, and hooks?

Specs are per-feature; skills are per-capability; CLAUDE.md is per-repo; hooks are deterministic substrate. Each is a different durability tier, and conflating them is what produces the bureaucracy people fear when “process” enters the conversation. The shortest version of the rule: behaviour that must run every time is a hook, behaviour that runs on demand is a skill, behaviour that scopes one feature is a spec, behaviour that bounds the project is CLAUDE.md.

Specs live under docs/ or wherever the team puts feature artefacts. They’re ephemeral on the project timescale: a design from six months ago is reference material, not active context. Skills live in .claude/skills/ and load progressively at the moment the agent needs them (loaded on demand, not always). CLAUDE.md sits at the repo root and gets loaded at the start of every session. Hooks sit in .claude/settings.json and run on lifecycle events, regardless of what the prompt says.

The misuse pattern is dumping spec content into CLAUDE.md, where it bloats the always-loaded context, or hand-rolling a skill for something that’s actually a one-shot feature artefact. The corrective is to ask which durability tier the content belongs in, then put it there. Brainstorm output is per-feature: spec. A “how we run tests” workflow is per-capability: skill. “Use the warm-neutral palette” is per-repo: CLAUDE.md. “Block writes outside the worktree” is deterministic: hook. The triad in this post is the spec tier; the rest of the stack already has its own homes. Allocation discipline at the feature level is the same move as allocation discipline at the surface level, one floor down.

Frequently Asked Questions

Do I need the superpowers plugin to do spec-driven agent development?

No. Superpowers ships the workflow as installable skills (brainstorming, writing-plans, executing-plans, subagent-driven-development); 150,000 GitHub stars and accepted into the Anthropic Claude Code marketplace 2026-01-15 (GitHub, 2026). The discipline (durable markdown artefacts the agent re-reads) does not require any plugin; the skills just enforce it consistently. Kiro implements the same triad as a hosted IDE on AWS (Kiro, 2026).

How long should a design doc be?

Long enough to surface non-obvious trade-offs and trace wiring. The pylon PR-review design was 294 lines; the plan was 2,587. Designs that fit on a Slack message rarely catch wiring bugs; designs that read like an RFC and never get re-read also fail. A useful test: does the design contain at least one decision that would not have appeared in chat? If not, the design isn’t earning its keep and the work probably belonged in the chat window in the first place.

What does brainstorm output actually look like?

Messy. The point is to surface intent, hidden requirements, and the design space, not to be readable. Anthropic’s superpowers:brainstorming skill runs as a Socratic dialogue that ends in a sign-off; the artefact saved at the end is concise and consumed downstream by the design step. The mess is upstream. If your brainstorm output reads like a polished spec, you skipped the part where you discover what you didn’t know you needed.

Why does this beat dictating tasks in chat?

Because chat is working memory; specs are durable. The agent re-reads a design at every step of a multi-session feature; it cannot re-read a chat message you sent on Tuesday. Augment Code’s spec-driven analysis found teams covering all six core spec areas saw dramatically fewer post-deployment bugs, while teams covering fewer than four dropped below the human-written baseline (Augment Code, 2026). Coverage matters, and durability is what enables coverage.

When should I skip spec-driven and run lighter?

When the failing test is the spec. Bug fixes, small adds, and refactors fit better with agentic TDD: the test is the durable artefact, scoped to one file. Spec-driven shines on multi-session features where wiring needs to be traced before any code is written. The honest test is whether the design would catch something the chat session wouldn’t. If no, ship from the test. If yes, write the spec.

The Real Argument

The pylon arc, in one sentence: a design doc surfaced a missing mcpServers option, the fix moved per-agent tool activity from 0.6 to 2.6, and the unchanged MCP=0 became the input to the next design iteration. That last clause is the part that doesn’t fit on a vendor slide. It is the part that justifies the artefacts.

Vibe coding is not wrong about the model. It is wrong about the workflow. The model is good enough; the workflow is what fails when the work crosses more than one file. The fix isn’t a smarter prompt or a bigger context window. It is three pieces of durable markdown that the agent re-reads at every stage and a measurement step that produces a question the next round starts from. Three artefacts, three jobs, one loop.

If you take one thing from this post: write the design before you open chat. Trace the wiring on paper. Read the actual session transcripts of the last time the workflow ran. The bug your prompt has been hiding is probably one section heading away.

Pick one feature you’re about to vibe-code. Write a 200-line design instead. Run it through the plan skill. Execute. Measure. Compare what you found to what your prompts said. If the design surfaced something chat would have missed, the discipline earned its keep. If not, the work probably belonged in chat all along, and you’ve just produced the cheapest possible test of which one this was.

Share this post

If it was useful, pass it along.

What the link looks like when shared.
X LinkedIn Bluesky

Search posts, projects, resume, and site pages.

Jump to

  1. Home Engineering notes from the agent era
  2. Resume Work history, skills, and contact
  3. Projects Selected work and experiments
  4. About Who I am and how I work
  5. Contact Email, LinkedIn, and GitHub