Skip to content

Does Your CLAUDE.md Actually Help? The Research Says Maybe Not

21 min read

Does Your CLAUDE.md Actually Help? The Research Says Maybe Not

· 21 min read
An editorial illustration on warm cream paper showing a single markdown document, marked with a small hash glyph, resting on one pan of an old balance scale while the other pan holds a short stack of coin-like token discs. The two pans sit almost level with the document side dipping only slightly, used as a visual metaphor for whether a CLAUDE.md context file earns the tokens it costs. A small all-caps serif title reading DOES YOUR CLAUDE.MD HELP sits in the upper left, and one thin ink-blue line runs along the bottom margin.

Every serious team now has a CLAUDE.md. Anthropic recommends one. OpenAI standardised AGENTS.md. In December 2025 the Linux Foundation made the format a founding pillar of a new foundation. Then someone ran the controlled study, and the file made the agents worse.

A team from ETH Zurich and LogicStar.ai tested four frontier coding agents under three conditions: no context file, an LLM-generated one, and a developer-written one (arXiv:2602.11988, Gloaguen, Mündler, Müller, Raychev, Vechev, Feb 2026). Across SWE-bench tasks and a new 138-issue benchmark, LLM-generated context files reduced average resolution rate by 0.5% on SWE-bench Lite and 2% on the new benchmark, while raising inference cost 20% and 23%. Developer-written files did better, but improved no agent by much, and improved nothing for Claude Code. The paper’s own verdict: “unnecessary requirements from context files make tasks harder.”

That is awkward, because adoption is enormous and a different peer-reviewed study disagrees. AGENTS.md is committed in more than 60,000 open-source projects (Linux Foundation, Dec 2025). And a field study of 124 pull requests found AGENTS.md cut median runtime 28.64% and output tokens 16.58% at comparable completion (arXiv:2601.20404, 2026). So which is it?

Both, and the resolution is the whole point. Success rate and operational efficiency are different ledgers. The file changes the agent’s behaviour, not just its prompt. And a context file earns its tokens only when it carries leverage the agent cannot cheaply recover for itself. This post sits with both numbers, names the mechanism, gives a four-test rule you can run line by line, and ends with an honest audit of the advice on this very blog.

Key Takeaways

  • A controlled study found LLM-generated context files cut success ~0.5% (SWE-bench Lite) and ~2% (the new AGENTbench) and raised cost 20-23% (arXiv:2602.11988, Feb 2026).
  • A field study found AGENTS.md cut median runtime 28.64% and output tokens 16.58% at comparable completion (arXiv:2601.20404, 124 PRs). Different ledger, not a contradiction.
  • The cost is behavioural: context files add 2.45-3.92 extra steps per task (more testing, more traversal), not just prompt tokens (arXiv:2602.11988).
  • The one exception: repositories with no existing documentation gained +2.7% on average from auto-generated context. A context file is a compensation mechanism, not a booster.

A controlled study found your context file makes agents worse

In the most rigorous test to date, repository context files reduced coding-agent task-success relative to no context at all, while raising inference cost over 20% (arXiv:2602.11988, Feb 2026). That is the uncomfortable finding, stated without softening. It is also narrow, and the next section holds that boundary. First, the numbers.

The setup was clean. ETH Zurich and LogicStar.ai ran four frontier coding agents across three conditions (no context, LLM-generated context, developer-written context) and two task sources: established SWE-bench tasks with generated files following published agent-developer recommendations, and a new benchmark, AGENTbench, of 138 issues curated from 5,694 PRs across 12 repositories that had already shipped developer-committed context files. The second source matters. It tests the file a real team actually wrote, on a real issue from that team’s repo.

The result, exact: LLM-generated context “reduced [resolution rate] by 0.5% and 2% on average on SWE-bench Lite and AGENTbench, respectively” and “leads to a cost increase of 20% and 23% on average, respectively” (arXiv:2602.11988). Developer-written files “outperform the LLM-generated ones for all four agents” and beat no-context “for all agents but Claude Code.” Read that last clause twice. For the agent most readers here run daily, the carefully human-written file produced no measurable gain.

The honest framing is a scope statement, not a headline. This measures task success on issue-resolution benchmarks. It does not measure onboarding speed, convention adherence on greenfield code, or whether a human teammate finds the file useful. Those are real and out of scope. What the study kills is the lazy assumption that adding a context file is free or strictly positive. It is neither.

What did the study actually measure?

The finding is real but bounded. It measures issue-resolution success and inference cost on Python software-engineering tasks, under conditions where a context file’s structural value is low because the agent can recover most of what the file says on its own (arXiv:2602.11988). Read it as a result about this kind of file on this kind of task, not a verdict on all documentation.

The benchmark design earns the result. AGENTbench draws its 138 issues from repositories that already had committed context files, so the test is honest: does the file the team actually shipped help the agent close this specific issue? That is the real-world question, not a synthetic one. The agents read the files. They followed them. They did more work because of them, and more work was not better.

The specific dead weight is named in the paper: “detailed directory enumerations … don’t meaningfully reduce the number of steps before agents reach the relevant files.” The map does not speed up the journey. A capable agent navigates a codebase fine without a hand-drawn tree, and it still pays to read the tree you wrote.

What the study does not measure deserves equal airtime, because pretending otherwise would be the same overreach the post is arguing against. It says nothing about human onboarding, about brand and convention consistency on net-new code, about multi-session continuity, or about the value of an AGENTS.md as a human-readable contract between teammates. Those benefits are real. They are simply not what this benchmark scored, and a context file that exists only for those reasons is defensible on those grounds.

Why doesn’t the counter-study contradict it?

It looks like a contradiction and is not. A second peer-reviewed study found AGENTS.md cut median runtime 28.64% and output tokens 16.58% at comparable completion (arXiv:2601.20404, 124 PRs across 10 repositories). The reconciliation is the key insight of this whole post: the two studies measure different ledgers. One weighs correctness; the other weighs efficiency. A file can move them in opposite directions at once.

The counter-study is “On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents” (arXiv:2601.20404; Lulla, Mohsenimofidi, Galster, Zhang, Baltes, Treude; submitted Jan 2026, JAWs@ICSE 2026). It ran agents with and without AGENTS.md and reported that “the presence of AGENTS.md is associated with a lower median runtime (Δ28.64%) and reduced output token consumption (Δ16.58%), while maintaining a comparable task completion behavior.” Faster, cheaper, same completion.

Here is why both hold. Study A (2602.11988) asks did the task get solved correctly and reads off resolution rate and total cost. Study B (2601.20404) asks how long and how many tokens to comparable completion and reads off runtime and output tokens. Those are different questions with different denominators.

Even the cost numbers point in opposite directions because they count different things: Study A counts the extra exploration steps the file induces, so total inference cost rises; Study B counts output tokens at a fixed completion bar, so per-unit output falls. A file can make an agent finish faster while finishing correctly slightly less often. Nothing in that sentence is paradoxical once you stop treating “success” and “speed” as one axis.

The practical read is a fork. If your bottleneck is wall-clock and token spend on tasks that complete anyway, the field data says a tight AGENTS.md can help. If your bottleneck is correctness on genuinely hard issues, the controlled data says a generated file can quietly cost you. Most teams have both bottlenecks at once, which is exactly why the rule later in this post is per-line, not per-file.

How does a context file change agent behaviour?

The reason a context file can hurt is not that it adds tokens to the prompt. It is that it changes what the agent does. Context files induce more exploration: 2.45 to 3.92 extra steps per task on average, more file traversal, more test runs, more reasoning output (arXiv:2602.11988). The agent treats a description of the codebase as an instruction to verify.

One number needs care, because the copy-paste tier got it wrong. The paper reports that context files “increase the # steps in every setting on average by 2.45 and 3.92 steps.” That is an absolute increase of two-to-four steps, not a 2.45x multiplier. Some secondary write-ups reported a “factor of 2.45-3.92.” They misread the paper. The real effect is smaller and more interesting: a handful of extra actions per task, repeated across thousands of tasks, is where the 20-23% cost lives.

Why does exploration cost so much? Because every added step is tokens, latency, and a fresh chance to go wrong. Call it cost without leverage. When the file contains information the agent can recover cheaply (where files live, what the directory tree looks like), the agent pays to read the description and then re-derives the same fact anyway, because the description was never load-bearing for the decision in front of it.

An agent that reads “this project uses strict typing and 90% coverage” does not simply know that and move on. It goes and checks. It runs the suite. It traverses to confirm. The aspirational context file is a to-do list the agent dutifully works through.

This reframes the standard “keep it short” advice. Short is not about prompt economy; a few hundred tokens of input is rounding error against a multi-turn agent run. Short is about behaviour control. Every line is a potential exploration trigger. The question per line is not “does this cost tokens to include” but “will this make the agent do more, and is that more worth it?”

When does a context file clearly help?

There is one documented exception, and it defines what a context file is actually for. Repositories with no existing documentation gained +2.7% on average from auto-generated context (arXiv:2602.11988). The sign flips precisely where there was nothing for the agent to recover on its own. A context file is a compensation mechanism for missing information, not a performance booster layered on top of a repo that already explains itself.

The paper is direct about the zero-docs case: there, “LLM-generated context files not only consistently improve performance by 2.7% on average.” Think about why. In an undocumented repo, even a generated overview adds signal the agent could otherwise get only by expensive exploration. In a well-documented repo, the same overview duplicates what the agent reads for free from the code and the existing docs, and the duplication is what costs steps.

That is the bridge to the rule. The exception is not “use generated files in some repos and not others.” It is sharper: a context file is worth its tokens exactly to the degree it carries leverage the agent cannot cheaply recover. The zero-docs repo is the clean instance of that condition. The four-test rule in the next section generalises it to every line of every file.

The same principle explains why llms.txt nulls out, and the contrast sharpens both. An llms.txt is a passive file published in the hope a crawler fetches it. SE Ranking studied roughly 300,000 domains and found no measurable effect on AI citation frequency, 10.13% adoption, and only 1 of the 50 most-cited domains carrying the file (SE Ranking, 2025); Trakkr’s study of 37,894 domains found 6.8 citations with the file versus 6.7 without, p=0.85, identical median (Trakkr, 2026).

A file a crawler might read is a different artifact from a file loaded into an active reasoning loop on every request. Same “publish markdown for the machines” instinct, opposite efficacy conditions. The mistake is treating documentation as a thing you write once for any machine, rather than asking which machine, in which loop, needs which fact.

When does your CLAUDE.md earn its tokens?

Stop asking “should I have a CLAUDE.md.” Ask, per line, whether the line carries leverage the agent cannot cheaply recover, as a minimal requirement, that will not go stale, at a per-request cost worth its frequency. Four tests. A line passes all four or it comes out. Generated files fail tests one and two by default, which is why they were net-negative everywhere except the zero-docs case (arXiv:2602.11988).

Test 1, leverage. Can the agent recover this in one or two tool calls? Directory maps, “auth lives in src/auth,” file-tree dumps: the agent finds these for free, and the study showed directory enumerations “don’t meaningfully reduce the number of steps.” If the agent can derive it cheaply, the line is cost without leverage. Cut it.

Test 2, minimal requirement. Is this a constraint the task genuinely needs, or an aspiration? “Always write tests for X” is a requirement the agent will go satisfy and verify, adding steps. The paper’s conclusion is blunt: “unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.” If a line is not load-bearing for correctness, it is exploration bait.

Test 3, recoverability and staleness. Will this line drift out of truth? A path that gets renamed turns the file into active misdirection, and stale leverage is worse than no leverage. Prefer invariants (commands that rarely change, hard constraints, the one gotcha that bit you twice) over coordinates (paths, line numbers, anything that moves when the code moves).

Test 4, frequency versus per-request cost. Every line is paid on every request, relevant or not. A line that is load-bearing 1% of the time still taxes 100% of requests. High-frequency leverage stays. Rare-but-critical leverage probably belongs in a progressively disclosed skill or a per-task handoff, not the always-loaded file. This is the eviction criterion behind the promotion ladder for prompts, skills, hooks, and tools: the four-test rule decides what leaves the file, and agent skills with progressive disclosure are where the evicted-but-real leverage goes.

The verdict on generated files is the clearest result in the data. Never ship an LLM-generated CLAUDE.md as-is. Generation defaults to exactly the directory-map and aspirational-overview content that fails tests one and two, which is why the generated condition lost almost everywhere. Draft with the model if it helps you start, then delete every line that fails the four tests. Usually that is most of them. What survives is short: mostly invariants and gotchas, almost no maps, no aspirations, no duplication of recoverable facts. That matches the qualitative consensus from GitHub’s review of 2,500-plus AGENTS.md files (commands early, examples over explanations, boundaries that pair a prohibition with the concrete alternative), now with a falsifiable test behind each cut (GitHub, 2026).

What this means for the rest of the stack

This blog has argued for context files as infrastructure. The data does not retract that; it sharpens it. The failure mode the studies caught is precisely the artifact those posts warned against without the numbers to prove it: the auto-generated, never-pruned, directory-mapping CLAUDE.md (arXiv:2602.11988). Honest cluster maintenance means saying that plainly.

Treat AI as a Team Member put the constitution, skills, memory, and MCP stack forward as the scaffolding. The four-test rule is the missing eviction policy for the constitution layer. A teammate’s onboarding doc that lists where the bathroom is wastes everyone’s attention; the same is true of a CLAUDE.md that maps a tree the agent can ls in one call.

Agent Memory Architecture classified CLAUDE.md as one of four memory types. This post is the efficacy layer that one deferred: not “what kind of memory is it” but “does this entry pay rent.” That post is taxonomy; this is the audit. And Context Engineering in Practice gave the per-surface placement framework for what belongs in CLAUDE.md versus skills versus MCP. This post supplies the per-line test that framework needs to decide what stays on the always-loaded surface at all: that post is where context goes, this is whether a given line earns its always-loaded slot.

The through-line with Engineering That Outlasts the Paradigm is the point of the whole exercise. The discipline is measurement, not faith. We recommended context files; we also ran the studies down and changed the recommendation from “write one” to “write one, then audit every line against the data.” That is the team-member discipline this blog keeps coming back to.

FAQ

Does CLAUDE.md actually improve agent performance?

Not by default, and often the reverse. A controlled study found LLM-generated context files reduced success ~0.5% on SWE-bench Lite and ~2% on AGENTbench and raised cost 20-23%; human-written files improved no Claude Code agent (arXiv:2602.11988, 2026). A context file helps only when it carries leverage the agent cannot cheaply recover for itself.

Should I let the AI generate my AGENTS.md?

Not as-is. Generated files were net-negative everywhere except undocumented repos, because they default to directory maps and overviews the agent recovers for free (arXiv:2602.11988, 2026). Draft with the model if it helps, then delete every line that fails the leverage and minimal-requirement tests. Usually that is most of the file.

Why does AGENTS.md reduce success rate?

It changes behaviour, not just the prompt. Context files add 2.45-3.92 extra steps per task (more testing, more file traversal), and the agent treats codebase descriptions as instructions to verify (arXiv:2602.11988, 2026). Unnecessary requirements make tasks harder; the cost is the induced exploration, not the file’s input tokens.

But another study said AGENTS.md saves time. Which is right?

Both. The field study found 28.64% lower runtime and 16.58% lower output tokens at comparable completion, which is the efficiency ledger (arXiv:2601.20404, 2026). The controlled study found lower success, which is the correctness ledger (arXiv:2602.11988, 2026). A file can finish faster while finishing correctly slightly less often.

Is llms.txt the same idea as AGENTS.md?

No, and conflating them is the mistake. llms.txt is a passive file a crawler may fetch; studies of ~300,000 (SE Ranking) and 37,894 domains (Trakkr, 6.8 versus 6.7 citations, p=0.85) found no measurable effect. AGENTS.md is loaded into an active reasoning loop every request. Different artifacts, opposite efficacy conditions.

What I changed in my own CLAUDE.md

The resolved tension is simple to state and harder to live: the controlled study and the field study are both right because they weigh different ledgers, the file that hurts is the generated, bloated, map-heavy one, and the file that helps carries recoverable-only-expensively leverage as minimal requirements. The rule is per line, not per file. Leverage, minimality, recoverability, frequency. Pass four and stay; fail one and get cut or demoted to a skill.

So I ran the four tests on my own files. The directory maps went first: every one failed test 1, and deleting them changed nothing except the token count. The aspirational “always” lines went next under test 2. What survived was a short list of invariants and two gotchas that had bitten me twice each. The honest close is that this blog recommended context files and still does, with the data-backed caveat that most of what people put in them should not be there. Treat the file like a teammate’s onboarding doc, not a junk drawer. Open your CLAUDE.md tonight, run the four tests, measure tokens-per-session before and after, and keep only the lines that earn the slot.

Share this post

If it was useful, pass it along.

What the link looks like when shared.
X LinkedIn Bluesky

Search posts, projects, resume, and site pages.

Jump to

  1. Home Engineering notes from the agent era
  2. Resume Work history, skills, and contact
  3. Projects Selected work and experiments
  4. About Who I am and how I work
  5. Contact Email, LinkedIn, and GitHub