Agent Evals: A Test Suite for Your Claude Code Setup

Your hook script logs every tool call to a JSONL file. You can grep it. You can throw a dashboard on top of it. None of that tells you whether the right tool was called. That gap is the difference between observability and evals, and it is wider than most teams notice. LangChain’s State of Agent Engineering 2026 measured 89% of organisations shipping some form of agent observability against 52.4% running offline evals on test sets and 37.3% running online evals (LangChain, 2026). Quality is the production blocker for 32% of teams in the same survey. The 37-point adoption gap is where silent regressions live.

Most evals writing covers output quality: did the answer rock, did it match the golden, did the LLM judge correlate with the human. Claude Code makes a second axis auditable. Did the right skill trigger? Did the right subagent spawn? Did the hook actually fire and record the side effect it claimed? Was the MCP server reachable, and did the agent reach for it when the prompt said it should? Four assertions, all deterministic, none of them need a model call. That second axis is the post. The plan: name the four control-plane evals, wire each to the Claude Code primitive it asserts on, and close the verification arc with a regression eval against the pylon worked example from the previous two posts.

Key Takeaways

89% of teams ship observability, 52.4% ship offline evals, and 37.3% ship online evals; the 37-point gap is the top production blocker (LangChain, 2026).

Output-quality evals score the agent’s answer; control-plane evals score whether the right machinery fired. Anthropic’s “grade what the agent produced, not the path it took” frames axis A; this post is about axis B (Anthropic Engineering, 2026).

Four control-plane evals cover Claude Code: skill trigger, subagent spawn, hook firing, MCP reachability. None need a model call. Anthropic’s Skill-Creator 2.0 release reported “improved triggering on 5 out of 6 public skills” using exactly this pattern (Anthropic, 2026).

The pylon arc closes here. Spec-driven shipped intent in markdown, agentic-tdd shipped a failing test, evals make the regression durable across N runs.

Why does observability stop short, and where do evals start?

Observability records events. Evals record judgements. “I have logs” is not “I have evals” because the log is data and the eval is a pass/fail decision over data. LangChain’s 2026 survey separates the two cleanly: 89% of organisations ship some form of agent observability, 52.4% run offline evals against test sets, and 37.3% run online evals in production (LangChain, 2026). The 37-point gap is not a knowledge gap. Observability ships in two days (one hook, one JSONL file, one dashboard), and evals ship in two months (a corpus, ground truth, a harness, a regression discipline).

The production cost is real and getting more expensive. Datadog’s State of AI Engineering 2026 reports that agent framework adoption doubled year over year and that around 5% of AI model requests fail in production, with about 60% of those failures driven by capacity limits (Datadog, 2026). Gartner predicts that more than 40% of agentic AI projects will be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls in a poll of 3,400-plus organisations (Gartner, 2025). The through line is that teams cancel agent work when they can’t demonstrate it stays correct. Cleanlab’s 2025 survey found that fewer than one in three production teams are satisfied with their observability and guardrails, and that 62% plan to improve observability in the next year (Cleanlab, 2025). The next year is now.

How is a control-plane eval different from an output-quality eval?

Output-quality evals score the agent’s answer. Control-plane evals score whether the right machinery fired. Both are evals; only the first is what existing literature covers. Anthropic’s “Demystifying evals for AI agents” makes a precise case for “grading what the agent produced, not the path it took” (Anthropic Engineering, 2026), and that framing is correct for end-user-visible answer quality. It is incomplete for an instrumented runtime where the path is precisely what the user is paying for: the right skill, the right subagent, the right hook, the right tool. In a Claude Code setup, the path is the product.

Why Claude Code makes the second axis uniquely auditable: skills are JSON, subagent dispatches are JSONL events, hooks are scripts with stdout, MCP calls are JSONRPC payloads. Every machinery surface has a structured payload an assertion can read. The eval is unit-test-shaped because the runtime already emits the data the assertion needs. Output-quality evals don’t get that gift. Shopify Sidekick’s eval team moved an LLM judge’s Cohen’s kappa from 0.02 (barely better than random) to 0.61 against a human baseline of 0.69, and lifted post-fix syntax validation from roughly 93% to 99% along the way (Shopify Engineering, 2025). Calibration cost is real, the discipline is sound, and “Vibe testing, or creating a ‘Vibe LLM Judge’ that’s like ‘Rate this 0-10’, is not going to cut it” is their summary line, not mine. Control-plane evals avoid that calibration entirely, because the assertion is deterministic.

This post is about axis B. Axis A is well-served already, and the right pattern there is the Shopify-style judge-with-rubric calibrated against a human baseline. Control-plane eval is a working definition, not a term of art: an assertion over a structured payload the runtime already emits, where pass/fail is a boolean and the model is not in the loop. The four-eval list in the next section makes that concrete.

What are the four control-plane evals for Claude Code?

Four assertions cover the Claude Code control plane. Skill trigger: when prompt class P arrives, skill S activates with precision above threshold T. Subagent spawn: when intent I appears, subagent A is dispatched and the parent reads back a sane summary on SubagentStop. Hook firing: a PreToolUse or PostToolUse hook actually fired and recorded the expected side effect (a row in JSONL, a denied tool call, a redacted payload). MCP reachability: the named server is up, the named tool is in the catalog, and under condition C the agent actually called it. None of the four needs a model call. All of them are pass/fail over data the runtime already writes.

Skills are JSON. Subagents are JSONL. Hooks are scripts. The repetition matters because each eval reads from a different surface and you want the surface to be the first thing the eval grabs. The skill-trigger eval reads the activated-skills list. The subagent-spawn eval reads the SubagentStop event payload. The hook-firing eval reads the hook’s own log line or exit status. The MCP-reachability eval reads the JSONRPC catalog and the per-session call tally. Sample size scales with how rare the event is. Skill triggers happen many times per session, so 30-50 prompts is enough. Subagent spawns happen a few times per feature, so 10-15 is enough. Hook firing is per-tool-call, so 10-20 trials cover the matrix. MCP reachability is a singleton ping plus a per-session call rate, so the corpus is small but the cadence is high.

There is existing-pattern proof that the first eval works at scale. Anthropic’s Skill-Creator 2.0 release ran a description-optimisation analysis on six public skills and reported “improved triggering on 5 out of 6 public skills” after the change (Anthropic, Improving skill-creator, 2026). That is a control-plane eval, run at scale, with a measurable lift. The closest open-source prior art is TribeAI’s claude-evals, which wires the eval pattern to PreToolUse, PostToolUse, and SubagentStop hook events (TribeAI, 2026). The repo ships the wiring; what’s missing in public writing is the taxonomy. The next four sections name one eval each and ship the assertion shape.

Eval 1: How do you measure skill trigger precision?

Prepare a labelled corpus of 30 to 50 prompts with intended-skill ground truth, run them through the agent in a clean session, record which skill activated, and compute precision, recall, and a confusion matrix. Anthropic’s Skill-Creator 2.0 used exactly this pattern across six public skills, and the description-optimisation analysis improved triggering on five of six (Anthropic, 2026). The labelling step is the only manual piece. Each prompt carries an intent tag and an expected_skill. Ground truth is the human’s call, not the agent’s output.

The clean-session requirement is non-negotiable. Running this against my own claude-blog skill set, the precision read only stabilised after I forced a fresh session per prompt; once context bled across prompts, the eval started measuring session priming, not trigger fitness. Precision matters more than recall for skills that have side effects (a wrongly fired git-commit skill is worse than a missed one). For read-only skills, recall matters more (code-search not firing when it should is the failure mode). The iteration loop is short: re-run after every description change. Anthropic’s “20-50 simple tasks drawn from real failures is a great start” applies directly to the corpus question (Anthropic Engineering, 2026), and the failures come from your own session JSONLs, not synthetic ones.

The assertion shape is small. A paraphrase against the local claude-blog:blog-write skill on this site:

import { test, expect } from "bun:test";
import { runPromptInCleanSession, activatedSkills } from "./harness";

const corpus = [
  { prompt: "write a blog about agent evals, slug claude-code-agent-evals",
    intent: "blog-draft", expected_skill: "claude-blog:blog-write" },
  { prompt: "outline a new post on memory architecture",
    intent: "blog-outline", expected_skill: "claude-blog:blog-outline" },
  // 28 more rows
];

for (const row of corpus) {
  test(`skill trigger: ${row.intent}`, async () => {
    const session = await runPromptInCleanSession(row.prompt);
    expect(activatedSkills(session)).toContain(row.expected_skill);
  });
}

Three things to notice. The harness does the heavy lifting (clean session, JSONL parsing); the assertion is one line. The corpus is checked into the repo alongside the test, so future re-runs are reproducible. The expected_skill field is the ground truth, not the metric: precision and recall are computed over the corpus once all rows have run.

Eval 2: How do you assert a subagent was spawned correctly?

When the parent prompts an intent that should dispatch a subagent, assert that the subagent name in Task matches the expected name and that the parent reads back a non-empty summary on SubagentStop. The eval surface is the SubagentStop event payload, which is structured JSON with the subagent’s stdout already captured. Three failure modes drive three assertions. The parent dispatches the wrong subagent: compare expected vs actual subagent name. The parent dispatches no subagent and inlines the work: assert the Task tool was called at all. The parent dispatches the right subagent but ignores the result: assert the parent’s next message references the subagent’s output by content, not just by id.

Sample size shrinks because subagent spawns are rarer. Ten to fifteen prompts cover the matrix for most setups, and the per-case stakes are higher because a wrongly dispatched subagent typically blows the cost or quality budget for a feature, not for a session. Hamel Husain frames the underlying point cleanly: “improving the infrastructure around the agent mattered more than improving the model” (Hamel Husain, Evals Skills for Coding Agents, 2026). Subagent spawn is dispatch infrastructure. The eval is about whether the dispatch landed where the intent said it should, not whether the model could reason its way to the right pick.

A paraphrase reading the SubagentStop payload:

test("research intent dispatches blog-researcher subagent", async () => {
  const session = await runPromptInCleanSession(
    "research stats for the agent evals post"
  );
  const stops = session.events.filter((e) => e.type === "SubagentStop");
  expect(stops.length).toBeGreaterThan(0);
  expect(stops[0].subagent).toBe("claude-blog:blog-researcher");
  expect(stops[0].summary.length).toBeGreaterThan(200);
  // mode 3: parent's next message references the summary
  const next = session.parentMessageAfter(stops[0]);
  expect(next).toMatch(/Tier ?1|verified stats|sample size/i);
});

The third assertion catches the silent failure mode that breaks dashboards. A subagent runs, returns a summary, and the parent ignores it; no event flags this, but the eval does. The downstream behaviour is what matters; the dispatch and stop are infrastructure for that behaviour. The same artefact discipline scales out into parallel subagent execution once the parent is dispatching multiple subagents per wave, and the spawn-vs-stay decision is the upstream call that this eval measures the correctness of.

Eval 3: How do you assert a hook actually fired?

Hooks are the only deterministic substrate in Claude Code. A CLAUDE.md line is advisory; a hook is code that runs every time the lifecycle event fires. Asserting a hook fired is closer to a unit test than a probabilistic judgement: check the side effect (a row in the hook’s log, a non-zero exit code, a mutated payload). The hook either ran or it didn’t, and the side effect either landed or it didn’t.

PreToolUse assertions cover the cases that protect production. A denylist hook actually blocked the tool when triggered. A logger hook actually wrote a row before the call. A budget hook actually counted the tokens. PostToolUse assertions cover the cases that shape the result. A redaction hook actually mutated the result. A notification hook actually fired with the right payload. A summary hook actually wrote the JSONL row the parent will read. SubagentStop is covered in eval 2 and reuses that surface; don’t double-count it here.

A paraphrase asserting a redaction PostToolUse hook fired and rewrote a payload:

test("redaction hook scrubs API keys from tool results", async () => {
  const session = await runToolCallInCleanSession({
    tool: "Bash",
    args: { command: "echo SECRET_KEY=sk-test_42abc..." },
  });
  const events = session.events.filter((e) => e.type === "PostToolUse");
  expect(events.length).toBe(1);
  expect(events[0].hook).toBe("redact-secrets");
  expect(events[0].mutated).toBe(true);
  expect(events[0].toolResult).not.toMatch(/sk-test_42abc/);
});

The deeper substrate argument lives in the next post in this queue. The illustrative gap, framed as colour rather than a citation anchor: a community write-up reports that without hooks, safety rules in CLAUDE.md get followed about 70% of the time, and with the hook the same rules block 100% of the time (ofox.ai, 2026). Treat that as direction, not data. The eval-level argument doesn’t need a percentage to land; it needs the assertion shape, which is the four lines of expect above.

Eval 4: Is your MCP server actually reachable, and did the agent call it?

MCP reachability is two checks. The cheap one asks whether the named server is up and the named tool is in the advertised catalog: a bare ping eval, run on every test session boot. The interesting one asks whether the agent actually reached for the tool when the prompt should have triggered it. The cheap check fails closed when the wiring breaks (this is what the spec-driven post’s MCP=0 finding was hiding). The interesting check fails open when the wiring works but the prompt fallback is too weak (this is what the agentic-tdd post fixed). Both checks belong in the eval, because a green ping with a zero call rate is still a regression and the spec-driven post made that visible without needing the eval.

The worked regression closes the verification arc. Against the pylon pr-review-code-intelligence reviewer, sample N reviewer sessions where referencesTruncated: true appears in the bundle, and assert that find_references is called at least once on each truncated symbol. Run the eval on the spec-driven snapshot (post-merge, pre-prompt-tightening): the call rate sits at zero. Bundle reads happen in every session, MCP calls do not. Run the eval after the agentic-tdd fix lands (the failing test that tightened the prompt replaced “you may call find_references” with “MUST call find_references for: {names}”): the call rate moves to a non-zero baseline. The chart is a dial that closes the loop the spec-driven post opened.

The closing line is the verification-arc thesis lands once and only here. Spec-driven captured intent in markdown. Agentic-tdd captured one behaviour change in a failing test. Evals capture correctness across N runs. Three durability tiers, one verification pillar, one continuous artefact thread. The post-fix target is drawn as an outlined pass zone because the prompt-tightening PR is still rolling; the eval itself is real and runnable, and the chart will be re-run with an observed bar once N reaches 30 truncated cases. Even with the target shown as a zone, the discipline is the same. The cost-side counterpart of this discipline (asserting agent runs stay inside their token budget) is what the prior cost-side experiment makes durable, and the regression discipline on the time axis is what backtesting ships.

Frequently Asked Questions

Aren’t observability and evals the same thing?

No. Observability records what happened (a JSONL row, a span, a metric); evals are a pass/fail judgement over what happened. LangChain’s State of Agent Engineering 2026 separates them in the survey: 89% of organisations ship observability, 52.4% ship offline evals, 37.3% ship online evals (LangChain, 2026). The gap exists because the second one is harder to ship, not because nobody knows the difference.

How big should the eval corpus be?

Anthropic’s “Demystifying evals for AI agents” recommends “20-50 simple tasks drawn from real failures” as a starting point (Anthropic Engineering, 2026). For control-plane evals, sample size scales with how rare the event is. Skill triggers happen many times per session and need 30 to 50 prompts. Subagent spawns happen a few times per feature and 10 to 15 is enough. MCP reachability is a singleton ping; one is enough.

Do I need an LLM judge for any of this?

No, not for the four control-plane evals. Skill activation, subagent dispatch, hook firing, and MCP reachability are all deterministic events with structured payloads, and the assertions are unit-test-shaped. LLM judges are the right discipline for output-quality evals. Shopify’s Sidekick team moved a judge from Cohen’s kappa 0.02 to 0.61 against a human baseline of 0.69 (Shopify Engineering, 2025); the calibration cost is real. Skip the judge until the four deterministic evals ship first.

What does this look like glued to a hook?

Each control-plane eval is a hook script that reads the relevant payload (the prompt, the SubagentStop event, the PostToolUse event, the MCP advertise call) and emits a pass/fail row to a JSONL file. TribeAI’s claude-evals repo wires the pattern to PreToolUse, PostToolUse, and SubagentStop today (TribeAI, 2026); the wiring is reusable. The contribution of this post is the taxonomy that says which assertion to write at each event.

The Real Argument

Observability and evals are different things, and most teams ship only the first; the 37-point adoption gap is the production blocker. Output-quality evals (judges, golden answers, kappa) score the agent’s answer; control-plane evals score the agent’s machinery. Four control-plane evals cover the Claude Code surface: skill trigger, subagent spawn, hook firing, MCP reachability. None of them needs a model call, because the runtime already writes the data the assertion reads. The pylon arc closes here: spec-driven captured intent in markdown, agentic-tdd captured one behaviour change in a failing test, and the regression eval captures correctness across N runs.

The arc in one sentence: spec-driven, agentic-tdd, and evals are three durability tiers of the same artefact-driven discipline, scaled to the work shape they fit. A multi-session feature earns three artefacts. A one-file follow-up earns one failing test. A behaviour you need to keep correct across N runs earns an eval. None of the three tiers is a coding style; all of them are pass/fail gates the agent can re-read on the next session. The same engineering discipline survives the framework churn that paradigm-chasers spend their time on, because the artefact is the durable thing, not the prompt.

If you take one thing from this post: pick one of the four control-plane evals, wire it to the matching hook event, run it on your last 30 sessions, and ship the result as a row in your repo. Don’t reach for the LLM judge yet. The deterministic evals are cheaper, faster, and they catch the failure modes most setups silently ship. The next post in the queue (agent-memory-architecture, Thu 2026-05-21) steps off the verification pillar to the context pillar. This one closes the verification arc with a runnable eval against the same pylon thread that started it.