Agent Skills: Progressive Disclosure That Actually Scales

Q: Why not just put everything in MCP tools?

MCP is the right surface for dynamic context that has to be fetched at query time: call hierarchies, blast-radius queries, current state. Skills are the right surface for stable context that has a clear trigger but doesn't change between turns. A migration playbook doesn't need to be re-fetched per turn; it needs to be loaded when the agent reaches for it. That's a skill, not a tool. The two surfaces compose, they don't substitute.

Most engineers read “skills” as another folder of markdown to maintain. That misses the leverage. The file format is the boring part. The interesting part is the load model: skill descriptions sit in context full-time, but the bodies only load when a description matches the work in front of the agent. Done right, that turns a multi-skill catalog from a context disaster into a catalog you barely feel.

Done wrong, you get the same failure as a giant CLAUDE.md, with extra steps. Most “my skill never triggers” complaints I’ve debugged are not body problems. They’re description problems, and the description is the only part the agent always sees.

This post is about that architecture. The pattern, the token math from a real catalog I maintain, and the authoring mistakes that make a skill catalog cost more than it earns.

Key Takeaways

Skills look like markdown folders. The actual leverage is the progressive-disclosure architecture: only the description sits in context full-time, the body loads on trigger.

On my 34-skill catalog, descriptions cost ~3,900 tokens always loaded; bodies would cost ~87,000 tokens if loaded naively. That’s roughly a 22x reduction in always-loaded skill cost, deferred until a description actually matches.

Three flavors earn their place: process discipline (Superpowers), codebase-specific implementation, and domain thinking (the senior-engineer judgment patterns from Ousterhout, Parnas, Feathers, Brooks).

The skill description is the only part always in context. Most “skill never triggers” failures are description failures, not body failures.

The single highest-leverage move is pruning skills whose description is too vague or too generic to fire. A small precise catalog beats a big vague one every time.

Why are skills suddenly the centre of agent capability?

Skills moved from “interesting Claude Code feature” to “industry pattern” in about six months. Anthropic launched Agent Skills as a Claude Code feature on October 16, 2025, and re-released them as an open cross-platform standard on December 18, 2025 (Anthropic, 2025). Anthropic’s 2026 Agentic Coding Trends Report frames orchestration (skills, context curation, subagent dispatch) as the load-bearing 2026 skill shift, not raw model capability (Anthropic, 2026).

The cross-vendor convergence is the part that surprised me. Spring AI shipped a generic agent-skills primitive on January 13, 2026, modeled on the same metadata-first, body-on-trigger approach, but designed to run across OpenAI, Anthropic, and Gemini without vendor lock-in (Spring, 2026). Microsoft documented the underlying progressive-disclosure pattern in their agent-skills repo, with a concrete token-cost framing: the metadata for ~133 skills consumes 7,000 to 13,000 tokens; loading every body up front would consume “hundreds of thousands” (Microsoft Agent Skills, 2026).

What everyone settled on is the same shape. A skill is a folder. The folder has a frontmatter description. The description sits in the model’s system prompt all session long. The full body, the scripts, the bundled reference files, none of those load until the agent decides the description matches the task at hand.

That is not the part most posts about skills explain. Most posts treat skills as “well-named prompts in folders.” The vendors converged on the load model, not the format. The format is downstream.

A useful framing: prompts are what you write per turn, CLAUDE.md is what’s loaded every turn, and skills are what’s available every turn but only paid for on match. The progressive disclosure is the design law. The Markdown is just where it happens to live. The broader layered scaffolding around the model places skills as Layer 2 in a six-layer stack; this post is the deep dive on that one layer.

What is a skill, technically?

The Anthropic specification names three loading tiers, and each one has a different cost (Anthropic, 2026).

Tier 1 (always loaded). The skill’s name plus its description, drawn from the SKILL.md frontmatter. Roughly 50 to 100 tokens per skill (Microsoft Agent Skills, 2026). This is the only tier the agent reads on every turn. It’s also the only tier that decides whether the rest of the skill ever sees the model.

Tier 2 (loads on trigger). The full body of SKILL.md, read into context only when the agent decides the description matches. Typical bodies run 500 to 2,000 tokens, though disciplined ones go higher. The body carries the doctrine: the steps, the rules, the worked example. The body never participates in the agent’s “is this relevant?” decision. The description does that work alone.

Tier 3 (loads on demand). Bundled files referenced inside the body: reference.md, scripts, taxonomies, longer worked examples. These don’t cost anything until the body explicitly asks for them. There’s no practical limit on Tier 3 size. A skill can ship a 50-page reference doc and still cost ~75 tokens until invoked.

The asymmetry is the design’s superpower. A 100-token description can gate a 5,000-token body and a 50,000-token reference pack. The agent pays one cost up front (deciding which skill might apply) and a second cost only when the answer is “this one.” Everything else is silence.

This is also why a skill is structurally different from a section of a system prompt. A system prompt section costs full tokens on every turn. A skill body costs tokens only on the turns where the description matches. That’s not a productivity difference; it’s an order-of-magnitude one. Both Claude Sonnet 4.6 and Opus 4.7 now default to a 1,000,000-token context window at the same per-token price as the previous 200K cap (Anthropic, 2026), so raw token budget is rarely what kills you anymore. What kills you is attention dilution as the always-loaded surface grows; that cost is quadratic, not linear, and progressive disclosure is the architectural answer.

Why does progressive disclosure change the cost model?

Here’s the math from my own catalog. I maintain @iceinvein/agent-skills, a public pack of skills distilled from foundational software engineering texts: complexity-accountant (after Ousterhout), module-secret-auditor (after Parnas), seam-finder (after Feathers), design-review (after Brooks), plus a couple dozen more. As of this week, the catalog holds 34 skills. The numbers are revealing:

Sum of all description fields: ~15,800 characters, roughly 3,900 tokens.
Sum of all SKILL.md bodies: ~349,000 characters, roughly 87,000 tokens.
Average body: ~10,900 characters per skill (~2,725 tokens).
Largest body: ~17,800 characters (~4,450 tokens).

If I loaded every body on every session, that’s 87,000 tokens of always-loaded skill bloat. Progressive disclosure pays 3,900 tokens for the metadata and defers the other 83,000 tokens until a description actually matches. That’s a ~22x reduction in always-loaded cost, and it scales linearly with catalog size. The wider the catalog, the bigger the win.

The token-budget framing used to be the headline argument. With Sonnet 4.6 and Opus 4.7 both shipping at 1,000,000-token defaults, 87,000 tokens of body bloat fits comfortably; you’re not blowing the window. The real cost has shifted to attention quality, not budget headroom. Naive loading turns the entire window into a noisier place to reason in, and the noise tax compounds with every additional skill body that didn’t need to be there.

The token saving is the most legible part of the argument. The harder part to see is the attention saving. Anthropic’s effective-context-engineering write-up notes that context exhibits n-squared pairwise relationships at the attention layer, so longer context isn’t free; the cost grows quadratically (Anthropic, 2025). Even if the bodies fit, they degrade the model’s ability to reason about whatever’s left.

Chroma’s research on 18 frontier models in 2025 makes the consequence concrete. Performance on long-context tasks degrades at every input-length increment, well before the window fills. Their LongMemEval data shows that a focused ~300-token prompt outperforms a ~113K-token version of the same prompt, with Claude models showing the most pronounced gap (Chroma Research, 2025). Naive skill loading isn’t only expensive in tokens. It’s expensive in attention quality. Progressive disclosure is the architectural answer to a real degradation curve, not a clever optimization for token bills.

Which three flavors of skill earn their place?

Not every skill plays the same role. Across the projects I run this stack on, three distinct flavors keep showing up, and they have different jobs.

Process skills. Brainstorming before building. Systematic debugging before fixing. Verification before claiming “done.” Test-driven development before writing implementation. These are about how the agent works, not what it works on. The Superpowers plugin is the one I’d install first on any new project, because it ships those disciplines as enforced workflows the agent has to walk through, not advice it can ignore. The discipline ones matter most, because the failure mode of unsupervised AI is rarely “wrong code.” It’s “skipping the planning step under time pressure.” A skill the harness loads before the agent acts is the cheapest way to remove that failure mode.

Implementation skills. How your ORM expects migrations. How your release pipeline runs. How your feature flags are wired. These are codebase-specific workflows that you wouldn’t expect any pre-trained model to know. They earn their place when “explain it again every session” becomes more expensive than maintaining a skill. Don’t write these speculatively; write them after the second time you’ve explained the same thing to a fresh agent.

Domain-thinking skills. This is the layer most teams skip and the one I’ve found delivers the most leverage. The senior-engineer move on a hard design problem isn’t “follow a checklist,” it’s “apply the relevant frame.” A complexity audit (after Ousterhout’s A Philosophy of Software Design) asks whether each abstraction is deep (simple interface, rich functionality) or shallow. A module-secret audit (after Parnas) asks what single decision each module hides. A seam-finder (after Feathers) locates the minimal incision before any change to legacy code. A design review (after Brooks’s The Design of Design) interrogates a proposed design for conceptual integrity, constraint exploitation, and scope control before it ships.

The reason this third flavor matters: a lot of senior judgment is already written down in book form. Most agents never apply it because nobody packaged the frame as a triggerable skill. My catalog is most of these. Each one is a frame the agent applies when the description matches; the body walks through the analysis the way the source author would. The frames work better than generic “review this code” prompts because the agent isn’t being asked to summon judgment; it’s being asked to follow a specific dialectic. The same compounding logic that makes an LLM wiki work, synthesize on write so you never re-derive on read, applies here: a domain-thinking skill is a one-time synthesis of senior judgment that pays back on every trigger.

The three flavors compose. Process skills enforce the discipline; implementation skills carry the codebase rules; domain-thinking skills carry the senior frame. Most teams I see have flavor one (badly), no flavor two (everything is in CLAUDE.md), and zero of flavor three. That’s the order of leverage in reverse.

How do you write a description the agent will actually find?

The description is the only part the agent reads on every turn. Get it right and the rest of the catalog scales. Get it wrong and the body never loads, no matter how good it is.

The pattern that works for me is three sentences, in this order: trigger, scope, and output. Trigger names the situation that should fire the skill. Scope names what the skill will and won’t cover. Output names the artifact or judgment the skill produces. A description that does all three crosses ~80 to 150 tokens, which is at the high end of Anthropic’s 50 to 100 range but well worth the spend. A description that does only one of the three usually doesn’t fire reliably.

The most common failure mode is descriptions that read like marketing copy: “useful when you want to think carefully about your code.” Nothing in that sentence tells the agent what changed in the work that should pull the skill in. Compare it to: “Use when reviewing a proposed design before approval. Audits conceptual integrity, constraint exploitation, and scope. Outputs a verdict and a remaining-risk list.” The second one fires on a recognizable trigger, names a recognizable scope, and produces a recognizable output. The agent can route on it.

There’s an asymmetry in the cost of getting the description right that makes this even more obvious in practice.

The shape of the trade is what makes precision worth paying for. A description that doubles in length to gain real precision costs roughly 75 extra tokens, every session, forever. The body it gates can be ten or twenty times that size, and pays nothing until trigger. There’s almost no description-length budget worth saving if the spend buys the agent a clean routing decision. The skills in my catalog with the longest descriptions are the ones that fire most reliably. The skills with the shortest descriptions are the ones I caught not firing when they should have.

One more authoring rule worth naming: don’t put doctrine in the description. The description is the routing layer. The body is the doctrine layer. If your description starts explaining how to do the work, you’ve already lost; the agent’s already routed past the description by the time the doctrine matters.

When should you NOT write a skill?

A skill is a specific shape. Not every piece of context should take that shape. Three categories don’t belong in skills, and putting them there is the most common authoring mistake I see.

Always-relevant rules belong in CLAUDE.md. “Use Bun, not Node.” “Never write ! non-null assertions.” “Run bun test before claiming a feature works.” These rules apply to every session regardless of trigger; routing them through a skill description means they only fire when the description matches, which is a strictly worse outcome than always loading them. CLAUDE.md is the right surface. The context surface decision matrix explains the full allocation.

Codebase-specific structural data belongs in MCP. “What’s the call hierarchy of handleAuth?” “Who imports this module?” “What’s the blast radius of renaming this function?” These can’t be pre-loaded because the answer changes between turns. Routing them through a skill body means writing stale facts down. Routing them through an MCP server like a local code intelligence layer means the agent fetches a live answer at query time. That’s the right shape.

One-off observations belong in episodic memory. “On 2026-04-12 we decided to drop feature X because of customer feedback.” That’s a historical fact, not a discipline. It belongs in a searchable memory layer, not in a skill that has to compete for trigger attention. A skill is doctrine; episodic memory is history. Don’t conflate them.

A useful test before writing any skill: ask whether the content has a trigger (a recognizable situation that should pull it in) and a body worth more than 200 tokens. If there’s no clear trigger, the rule probably belongs in CLAUDE.md. If the body is shorter than the description, the skill is masquerading as a paragraph; promote it to a CLAUDE.md rule and drop the wrapper. If the data is dynamic, route it through MCP. The skill surface is for triggered discipline, and not much else.

The other category worth naming: subagent system prompts are not skills. A subagent has its own constitution; it can load skills the same way the main agent does, but the subagent’s role belongs in the subagent definition, not in a skill the main agent has to route to. The decision of when to spawn vs when to stay in-context is upstream of skills, not parallel to them.

Anti-patterns and how to fix them

Five failure modes show up reliably enough that they’re worth naming. Each one has a fix that’s smaller than the problem looks.

The vague description. “Use this skill for thinking about complex problems.” The agent never figures out what counts as complex enough. Fix: rewrite as trigger + scope + output. “Use when reviewing a proposed system design before approval. Audits for conceptual integrity, scope creep, and constraint exploitation. Outputs a verdict and a remaining-risk list.” Specific descriptions fire; vague ones don’t.

The body that’s a lecture. The skill body re-derives first principles for 4,000 tokens before saying what to do. The agent loads it, struggles to extract the action, and the skill costs more than it earns. Fix: open with the procedure, not the rationale. Move the philosophy to a Tier 3 reference file the body can link to. Keep the body short enough to feel like a checklist with prose between the items, not an essay.

The skill that should have been a CLAUDE.md rule. A “skill” that’s three sentences and applies to every session. Fix: promote it to CLAUDE.md. You’ll save the routing overhead and gain reliability. The reverse promotion is also valid: a CLAUDE.md rule that only applies to specific situations should get demoted into a skill so it stops crowding the always-loaded surface.

The catalog with no taxonomy. Skills accumulate without a shape, descriptions overlap, the agent picks the wrong one because two skills could each plausibly fire. Fix: skim the catalog every quarter and look for description collisions. If two descriptions could plausibly route the same task, merge or differentiate. Microsoft’s progressive-disclosure work hints at this with their tiered framing: when many skills fire on similar triggers, the metadata layer becomes noisier and the routing gets worse, even though no individual skill has changed (Microsoft Agent Skills, 2026).

No audit cadence. Nobody re-reads the catalog. Stale skills routing on stale triggers, descriptions that no longer reflect what the body does, bodies that reference APIs that have been renamed. The fix is the same as for CLAUDE.md: a 30-day audit cadence. Open every SKILL.md, ask whether the description still matches the body, the body still matches the codebase, and the skill itself still earns its place. Cut what doesn’t.

The shape of all five failures is the same: the surface drifts from the doctrine. Either the description over-promises or under-specifies relative to the body, or the body over-explains relative to the work, or the catalog grows without anyone keeping shape on it. The discipline is reading the catalog the way you’d read documentation; reviewing what’s there as often as you author what’s new.

Frequently Asked Questions

Is this just Claude-specific, or does the pattern port?

The pattern ports cleanly. Spring AI shipped a generic agent-skills primitive in January 2026 that supports OpenAI, Anthropic, and Gemini models from the same skill folder (Spring, 2026). Anthropic re-released Skills as an open cross-platform standard in December 2025 (Anthropic, 2025). The file format and the load-tier vocabulary are the same. Only the runtime that interprets them differs.

What’s the right size for a skill catalog?

There isn’t a hard ceiling, but there’s a sensible curve. A 5-skill catalog (process discipline only) is the right starting point. A 20-to-50-skill catalog covers process plus the codebase-specific workflows a real project needs. Past 50, the metadata budget starts to feel real (Microsoft’s 133-skill example sits at 7,000 to 13,000 tokens of always-loaded metadata) and the discipline of pruning low-value skills becomes the bottleneck, not authoring new ones (Microsoft Agent Skills, 2026).

How is a skill different from a section in the system prompt?

A system prompt section costs full tokens on every turn. A skill body costs tokens only on the turns where the description matches. The description is what’s in the system prompt for skills; the body waits in storage. That asymmetry is the entire point. If your skill body would have been fine in the system prompt, you didn’t need progressive disclosure for it.

Why not just put everything in MCP tools?

MCP is the right surface for dynamic context that has to be fetched at query time: call hierarchies, blast-radius queries, current state. Skills are the right surface for stable context that has a clear trigger but doesn’t change between turns. A migration playbook doesn’t need to be re-fetched per turn; it needs to be loaded when the agent reaches for it. That’s a skill, not a tool. The two surfaces compose, they don’t substitute.

Can the agent compose skills?

Yes, and it does in practice. A complex task often triggers two or three skills in sequence: brainstorm-the-design (process), then run-a-design-review (domain), then write-the-migration (implementation). Each one loads on its own trigger. The bodies don’t all sit in context simultaneously; they roll in and roll out as the agent moves between phases of the work. That’s progressive disclosure compounding, and it’s why a well-curated catalog of 30 small skills outperforms one giant “do everything carefully” skill. The same compounding shows up at the subagent layer: when AI runs the first line of PR review, each specialized reviewer loads its own skill set independently, and the main agent only ever sees the synthesized findings.

The Real Argument

The pattern beats the format. You can implement progressive disclosure in any agent runtime that supports metadata-first routing; you can write a SKILL.md that doesn’t get the leverage because the description is too vague, the body is too long, or the rule should have lived somewhere else. The win isn’t the file extension. It’s honoring “metadata in context, body on trigger” as a design law and pruning anything that doesn’t fit that shape.

If you take one move from this post, take this. Open every SKILL.md you’ve authored. Check whether the description names a trigger, a scope, and an output. Cut the ones that don’t. Promote anything that’s truly always relevant into CLAUDE.md, demote anything that’s dynamic into MCP, and route anything historical into episodic memory. What’s left should fit in your head as a curated catalog, not a folder full of intentions. That’s the version that scales.