We Tried to Cut Claude's Output 50%. We Got 5%. So Did Anthropic.
On April 16, 2026, Anthropic added one line to Claude Code’s system prompt: keep final responses to 100 words or fewer unless the task requires more detail. By April 20, they had removed it. Their internal evals showed Opus 4.6 and 4.7 had each lost 3% intelligence under the cap (Anthropic postmortem, April 23 2026).
We had been running our own version of the same experiment. The benchmark we collected four days later told us why.
TL;DR: We aimed for 50% Claude output compression. Our benchmark showed 4.7%. Simple-fact answers compressed 62%; tradeoff answers expanded 14%. Anthropic hit the same wall four days earlier with a 100-word cap, measured a 3% intelligence drop on Opus 4.6 and 4.7, and reverted (postmortem, Apr 23 2026). The failure mode we name: recommendation reversal.
Key Takeaways
- The achievable compression ceiling for general-purpose LLM output sits closer to 20-30% than the 50%+ figures circulating in skill READMEs.
- The dominant failure mode below that ceiling is recommendation reversal on tradeoff tasks: shortened answers leave only the supporting reasoning for the rejected option, and readers infer the opposite recommendation.
- The structural fix is “recommendation in sentence one.” The mechanistic frame is Liu et al.’s “Lost in the Middle” (TACL 2024) generalized from input attention to output structure.
- Independent measurement matters: a model lab and a single builder hit the same wall in the same week. Anthropic reverted in 4 days; we reverted after one benchmark.
- Output is 5x input cost on Opus, Sonnet, and Haiku (Anthropic pricing). The right place to apply the discount is after the answer exists, gated by task type.
Why did our 50% goal become 4.7%?
Our terseness skill targeted 50-60% output reduction across a 5-eval benchmark. The actual measured reduction was 4.7%, and the per-task breakdown told a more interesting story than the average.
The skill ships three intensity levels (clean, tight, sharp) and activates at session start. Five evals were chosen to span the kinds of work Claude Code does day to day: a factual lookup (binary-search complexity), a code debugging case (a useless React re-render), one technical explanation (database connection pooling), and two opinionated tradeoffs (GraphQL vs REST and monorepo vs polyrepo). Each eval was run single-shot in two configurations: working-tree skill against HEAD snapshot. Both runs were graded against objective assertions: recommendation preservation, no banned narration openers, and length caps where applicable. The shape borrows from the record-and-replay pattern I use for agent regression testing.
488 to 465 words, across all five evals. Net delta of -4.7%.
Both versions hit 100% on the objective assertions. The compression delta is the only differentiated signal between them, and the per-eval breakdown looks nothing like the headline:
The two compression-friendly evals (binary-search and react-rerender) cleared 60% on their own. The two tradeoff evals (graphql-vs-rest and monorepo-vs-polyrepo) moved in the wrong direction by the same magnitude, +14%. Database pooling barely moved at +4%. The “average” is an artifact of summing wins and losses from completely different families of task. There is no single number you can quote here without lying about one of the families.
Citation capsule. On a 5-eval benchmark of a Claude Code terseness skill targeting 50-60% output compression, the measured reduction was 4.7% across 488 → 465 total words. Simple-fact answers compressed 62%; tradeoff answers expanded 14%. Both skill versions passed 100% of objective assertions. (Source: this post’s benchmark.)
Anthropic hit the same wall four days earlier
On April 16, 2026, Anthropic shipped a Claude Code system-prompt change with two length rules: 25 words or fewer between tool calls, and 100 words or fewer for final responses unless the task requires more detail. On April 20, they reverted it. The April 23 postmortem cites a 3% intelligence drop on both Opus 4.6 and 4.7 (Anthropic, April 23 2026).
That was their published number on the cap, measured against their internal evaluation suite, not against ours. The intent was the same as ours: cut output. The mechanism was different in two ways. Theirs was a system-prompt rule shipped to every Claude Code install; ours is a skill that can be turned off mid-conversation. Theirs forced compression on every response; ours fires on a session-start activation and exempts code, error messages, and irreversible-action confirmations.
Both mechanisms hit a wall in the same week. A model lab measuring against an internal eval suite, and a single builder measuring against five hand-picked evals, converged on the same conclusion: the achievable compression ceiling for general-purpose LLM output sits much lower than the 50% figures circulating in skill READMEs. Anthropic backed off in 4 days. We backed off after one benchmark.
The published postmortem is the most credible external evidence of a compression ceiling published this year. It is also the only one to date that names a measured intelligence cost (3%) for a published compression target (100 words). Most of the other “make Claude shorter” content sources its reduction numbers from before-and-after token counts on a single sample task and never mentions the quality side of the ledger.
Citation capsule. Anthropic’s April 23, 2026 postmortem confirms that a Claude Code system-prompt change capping final responses at 100 words caused a 3% intelligence drop on Opus 4.6 and 4.7. The change shipped April 16 and was reverted in the April 20 release, a four-day window. (Source: anthropic.com/engineering/april-23-postmortem.)
Why do some answers compress 60% and others expand 14%?
Compression works on simple-fact answers and backfires on tradeoff answers. Our binary-search complexity question dropped from 21 words to 8 (-62%), preserving the answer. Our GraphQL-vs-REST question grew from 108 words to 123 (+14%), because locking the recommendation into sentence one expanded the supporting reasoning rather than shortening it.
Simple facts have one load-bearing token; everything else is filler. “Binary search is O(log n)” has eight words and zero of them are negotiable, so trimming the surrounding text down to those eight is genuinely lossless. A factual lookup that started at 21 words ends at 8 because 13 of the original 21 were padding.
Tradeoff answers have one load-bearing position whose justification is the whole point. “Use REST” is the answer; the reason it is a defensible answer is the four sentences that follow it. Compress those four sentences and you have not produced a shorter answer; you have produced a different (and often opposite) answer. The supporting reasoning is the answer when the task is a recommendation.
Chen et al. (arxiv 2412.21187) measured 48.6% token cuts on MATH500 with no accuracy loss for self-trained chain-of-thought compression. Their result does not contradict ours. It confirms that the compressible regime exists; it just does not extend to tradeoff tasks.
Citation capsule. Compression behaves asymmetrically across task types: simple-fact LLM answers tolerate 60%+ reduction (Chen et al. 2024 measured 48.6% token cuts on MATH500 with no accuracy loss); recommendation answers expand under structural compression rules because the conclusion’s supporting reasoning is itself load-bearing. (Source: arxiv 2412.21187 plus this post’s benchmark.)
What is recommendation reversal?
Recommendation reversal is what happens when a tradeoff answer is compressed evenly across its conclusion and its supporting reasoning. The conclusion gets cut at the same rate as the rest of the text; the supporting reasoning for the rejected option survives in fragments; the reader infers the wrong recommendation. We are coining the term in this post. It does not appear in any practitioner or academic source surfaced during research.
Concretely, here is what we saw on the GraphQL-vs-REST eval before the patch.
Full uncompressed answer:
Use REST for your case. GraphQL’s main wins (multi-client federation, deep nesting, flexible queries) do not apply at your scale; you pay the resolver and caching cost without the benefit. Stick to REST until your client surface or query shape forces the change.
Aggressively compressed early version (excerpt):
GraphQL solves over-fetching. Single endpoint, flexible queries. Federation for many clients.
The recommendation is gone. What survives reads like a GraphQL endorsement because the supporting reasoning for the rejected option (GraphQL has features) is dense, factual, and easy to compress. The reasoning for the accepted option (none of those features apply at your scale) is conditional, comparative, and the first thing a “be terse” rule cuts.
Royal Society Open Science finds that 26% to 73% of LLM summaries overgeneralize, with Claude documented as the exception (PMC12042776). That paper measures a related family of failure (semantic drift under compression) at the summarization scope. Recommendation reversal is the same family applied at the response scope, and the surface symptom is sharper: ours flips a single load-bearing word (“REST” to “GraphQL”) rather than blurring a paragraph.
We observed this informally during development, patched the rule that allowed it, and could not measure the patch’s effect at single-shot scale because both versions of the skill cleared the assertion checks in our objective grader. The empirical validation requires N=5+ runs per tradeoff eval and was not in scope for iteration 1. Calling out the limitation here is part of the result.
Citation capsule. Recommendation reversal is a compression failure mode where shortening a tradeoff answer leaves only the supporting reasoning for the rejected option, causing the reader to infer the opposite recommendation. The structural fix is to place the recommendation in the first sentence so it sits in the high-attention slot. (Source: this post.)
Why does putting the conclusion first fix compression?
Liu et al.’s “Lost in the Middle” (TACL 2024 / arxiv 2307.03172) documents a U-shaped attention pattern: information at the start and end of a context window gets attended to far more than information in the middle. The same effect operates inside a model’s own response. Putting the conclusion in sentence one places it in the high-attention slot, where compression cannot dilute it.
The framing matters because “recommendation in sentence one” is not a stylistic preference; it is a structural defense. Move load-bearing content out of the compression zone and you can compress aggressively elsewhere without losing the load-bearing content. Leave it in the middle and any “be terse” rule eats it first.
This is the same principle that already underpins prompt engineering on the input side. Anthropic’s own docs recommend leading with the answer in system prompts and tool descriptions; we are extending the principle to output structure. The prompt-engineering rule and the output-structure rule are the same rule with different subjects.
The patch we shipped to the skill makes the recommendation-first rule explicit and ties it to opinion-or-tradeoff prompts via trigger phrases. Compression continues to fire on filler, hedge stacking, narration openers, and trailing summaries; it stops firing on the first sentence of any response that contains a load-bearing position.
Citation capsule. Liu et al.’s “Lost in the Middle” (TACL 2024) shows U-shaped attention in long contexts, with information at the start and end attended to more than information in the middle. Applied to output structure, the rule generalizes: recommendations placed in the first sentence resist compression-induced dilution. (Source: arxiv 2307.03172.)
What actually worked, and what did not?
Three rules made measurable differences. Put the recommendation in sentence one. Gate structural rewrites by response length, applying them only when the response exceeds about 100 tokens. Cap the compression target at 20-30% rather than 50-60%. The rule that did not work: “just be terse.”
The most concrete change was the skill description:
- Professional output compression. Cuts ~50-60% of output tokens.
+ Professional output compression. Cuts ~20-30% of output tokens
+ on average. More on factual lookups, less on tradeoff answers.
That is the line the model reads when deciding when and how aggressively to fire the skill, and overstating the target there pushed compression into territory the eval set then flagged. The honest target moved the average expectation closer to what the benchmark actually delivers.
The new “Semantic Preservation” section adds the recommendation-first rule:
For tradeoff prompts, the recommendation belongs in sentence one.
Detect: question contains "vs", "or", "should I", "which".
Action: state the recommendation explicitly in the first sentence.
Compress justifications, not the recommendation.
Generic “be terse” instructions, aggressive synonym substitution, and sentence-length minimums all moved the headline number on factual evals and did nothing useful on tradeoff evals. The structural rules did the opposite: small effect on factual evals, large qualitative effect on tradeoff evals. The two sets of changes complement each other rather than competing.
When should you compress LLM output?
Compress factual lookups, debugging diagnoses, and explanations of well-known concepts. Do not compress recommendations, tradeoff analyses, or any answer where the order of the words carries the meaning.
The decision is task-shaped, not user-shaped:
A practical workflow falls out of the rubric. Write the answer at full length first, then edit. Compressing at draft time is the prose equivalent of premature optimization: you push for short before you know what the answer needs to say, and the structural information you cut is exactly what the model will need to reach the right recommendation. Get the answer right at full length, then compress what is not load-bearing.
The 5x output-to-input cost ratio on Opus, Sonnet, and Haiku (Anthropic pricing) makes the temptation to compress at draft time understandable. It is the wrong place to apply the discount. Apply it after the answer exists; gate it by task type; expect your savings to skew heavily toward factual work, and keep an eye on per-session token spend so the verbose-prose tax stays legible against actual usage.
Citation capsule. A practical rule for builders: compress factual and debugging answers freely (40-60% safe); compress explanations modestly (10-30% safe); leave tradeoff and recommendation answers alone or expect recommendation reversal. Always edit at full length first, then compress, rather than compressing at draft time. (Source: this post.)
Frequently Asked Questions
How much can I shorten Claude’s output without losing quality?
On factual or debugging tasks, 40-60% reductions are safe and routinely lossless. On tradeoff or recommendation tasks, expect 0-15% before semantic drift sets in. Our 5-eval benchmark and Anthropic’s April 23 postmortem reach the same conclusion at different scales: above roughly 30% net compression on mixed work, intelligence costs become measurable.
What is recommendation reversal?
Recommendation reversal is a compression failure mode where shortening a tradeoff answer leaves only the supporting reasoning for the rejected option, causing the reader to infer the opposite recommendation. We are coining the term in this post; it does not appear in prior literature. The structural fix is to place the recommendation in the first sentence.
Why did Anthropic remove the 100-word cap from Claude Code?
Anthropic’s internal evals showed a 3% intelligence drop on both Opus 4.6 and 4.7 under a system-prompt rule capping final Claude Code responses at 100 words. They reverted the change in the April 20 release, four days after shipping it. The full timeline is in the April 23 postmortem.
Should I use a terseness skill at all?
Yes, with calibrated targets. Aim for 20-30% reduction at the skill level, accept that 50%+ goals will either fail to measure or hide quality loss, and run a per-task benchmark before deploying. Use the open-source terse skill as a starting point, or build your own with the recommendation-first rule from this post.
What this means for builders
Four threads pull together. The achievable compression ceiling for general-purpose LLM output sits closer to 20-30% than the 50%+ figures circulating online. The dominant failure mode below that ceiling is recommendation reversal on tradeoff tasks. The structural fix is “recommendation in sentence one”; the mechanistic frame is Lost in the Middle generalized to output structure. Independent measurement matters: a model lab and a single builder hit the same wall in the same week.
If you build your own version of this skill (and you should: the surface is small enough for a weekend), run your own benchmark before quoting a reduction number. Five evals across factual, debug, explain, and tradeoff is enough to expose the asymmetry. The open-source terse skill in this repo is a starting point. If you run a benchmark against it, share the numbers; the data is the most useful thing this corner of practitioner work could produce.
If it was useful, pass it along.