Skip to content

Cache-Aware Prompting: Engineering for 90%+ Hit Rate

23 min read

Cache-Aware Prompting: Engineering for 90%+ Hit Rate

· 23 min read
An editorial illustration of a tall stable stack of identical paper folders sitting beneath a thin band of fluttering loose pages, with a cache-breakpoint line drawn between them, used as a metaphor for the stable prefix and ephemeral suffix discipline of cache-aware prompting

On 2026-03-06 Anthropic silently changed the default cache TTL from 1 hour to 5 minutes. One developer’s bill rose 17.1% across 119,866 API calls before they noticed (GitHub anthropics/claude-code#46829, 2026). The issue stayed open for five weeks; the fix was a config flag, not a refund. A 90% discount that depends on a five-minute clock is not a feature you toggle. It is a discipline you engineer for, every prompt, at design time.

Anthropic’s prompt cache pays a 90% read discount and a 25% write surcharge on the 5-minute tier (Anthropic prompt caching docs, 2026). Break-even is two reads. Most agentic workloads reuse a stable prefix dozens of times a session, so the math should land at a 70-80% effective discount. It usually lands at 30%. The gap is structural, not bug-shaped: prompts that interleave dynamic content with stable content; agent loops that pause longer than TTL; cache breakpoints placed downstream of a single volatile field that invalidates everything above. This post names the discipline, calibrates the targets, and walks the math.

Key Takeaways

  • The 90% read discount is silent when it fails. ProjectDiscovery moved a production agent from a 7% to an 84% cache hit rate by relocating one timestamp out of the system prompt; total input cost fell 59% across 9.8 billion cached tokens (ProjectDiscovery, 2026).
  • The discipline has a name. Stable prefix, ephemeral suffix: every byte that is identical across calls within TTL belongs at the front; every byte that varies belongs at the back. The break is one dynamic field placed mid-prefix; the cost is everything below it.
  • The arXiv “Don’t Break the Cache” study measured 41 to 80% cost reduction across OpenAI, Anthropic, and Google on agentic benchmarks when the prefix was placed correctly (Lumer et al, January 2026). Naive full-context caching can paradoxically increase latency.
  • Five named failure modes cover most production cache misses, each with a distinct diagnostic: late breakpoint, prefix drift, tool definition instability, TTL mismatch, and 20-block lookback miss.
  • TTL-aware scheduling closes the loop. Under 4 minutes inter-turn use 5-minute TTL; 4 to 55 minutes use 1-hour TTL; over 55 minutes pre-warm at session start.

The 90% that most teams pay 60% of

Anthropic publishes a 90% read discount: cache reads cost 0.1x base input price (Anthropic pricing, 2026). Cache writes carry a 25% surcharge on the 5-minute TTL and a 100% surcharge on the 1-hour TTL. Break-even is 1.39 reads on the short tier and 2.22 reads on the long. Past that, every cache hit pays back 0.9x base. The arXiv “Don’t Break the Cache” study (Lumer et al, January 2026) measured 41 to 80% cost reduction across OpenAI, Anthropic, and Google on agentic benchmarks when the prefix was placed correctly. Most teams measure something closer to 30%. The delta is structural, not bug-shaped.

The arithmetic for one Anthropic-stated example tells the rest of the story. On Opus 4.7, 40,000 of 50,000 input tokens cached (an 80% cache ratio) drops the input bill from $0.25 to $0.07 per request: a 72% reduction on input cost (Anthropic Managed Agents docs, 2026). Most teams running long Claude Code sessions sit nowhere near that ratio. The bill stays high silently. The fields needed to detect the gap are already in your JSONL today: cache_read_input_tokens and cache_creation_input_tokens ride every API response. If you have not built the personal version of this measurement yet, start with the JSONL substrate; the team-level pipeline assumes you understand the shape it scales out of.

The stable prefix, ephemeral suffix discipline

Anthropic’s prompt cache works on a strict ordering hierarchy: tools, then system, then messages. A change in any tier invalidates that tier and everything downstream (Anthropic prompt caching docs, 2026). The discipline that follows is what I will call “stable prefix, ephemeral suffix”: every byte of content that is identical across calls within TTL belongs at the front; every byte that varies belongs at the back. The naming is deliberate. It is a constraint on the shape of context itself, on par with the context engineering frame Simon Willison coined in June 2025 (simonwillison.net, 2025), and the rest of this post defines it precisely.

The rigorous definition has three parts. Stable prefix is content that is byte-identical across N calls within TTL: system prompt, tool definitions, few-shot examples, static persona blocks, and any stable memory injection (Anthropic’s Memory tool, GA March 2026, can produce stable injections if the blocks themselves are constant). Cache breakpoints belong at the end of the longest stable run, not somewhere in the middle. Ephemeral suffix is content that varies per call: turn history beyond the cached window, runtime variables, timestamps, dynamic tool results, user input. It belongs strictly after the last breakpoint. The break is what fails most workloads. One dynamic field placed mid-prefix invalidates everything below it. ProjectDiscovery’s 7%-to-84% jump came from moving a single timestamp out of the system prompt and into a user message at the prompt tail.

The four-breakpoint budget is the lever the discipline uses. Anthropic allows up to four explicit cache breakpoints per request, plus one auto-cache slot. The canonical sliding layout for a multi-turn agent uses all four: BP1 at the end of static system, BP2 at the end of static tool definitions, BP3 at the end of few-shot examples, and BP4 sliding inside the messages array up to the compaction window. A single breakpoint at the end of the system prompt is rarely enough for sessions past 20 turns; the 20-block lookback runs out before BP1 reaches the back of the conversation. Anthropic’s “Effective harnesses for long-running agents” (November 2025) treats context resets as mandatory at 24-hour scale; resets are misses on turn history but hits on a stable system prompt, if and only if the prefix is structured for it.

The practical rule is design-time. Architect the context for cacheability first, then add features. Reordering a finished prompt almost always breaks an existing cache that took weeks to land at its current hit rate. Reordering a draft costs nothing.

A workload taxonomy with calibrated hit rates

No public source gives engineers a calibration baseline for what hit rate to expect from their architecture. The Anthropic Claude Code team blog cites 92% for its own production fleet (claude.com/blog, 2026). The arXiv paper reports 41 to 80% cost reduction across agentic benchmarks. ProjectDiscovery shows task-complexity correlation: 1-step tasks land at ~35% hit rate, 20-step tasks at ~74%. A workload taxonomy makes those numbers actionable. Four archetypes cover most production agent and API workloads, each with a realistic target band and an explanation for why it lands there.

Claude Code agentic session, 50+ turns with compaction

Target 85 to 95% on the stable prefix, 70 to 85% on the messages tier. The prefix mass dominates (system prompt plus tool definitions plus few-shot), so the prefix ceiling is high. Compaction cycles regularly invalidate the messages cache, so the messages floor is lower. The Anthropic team blog’s 92% number comes from this archetype with explicit breakpoints and 1-hour TTL on the static blocks.

API integration with stable system and variable user input

Target 75 to 90% overall. A single explicit breakpoint at the end of the system prompt covers 80%+ of the total token mass; the only source of misses is user-message variability. Auto-caching on its own gets close to the lower band; an explicit breakpoint at the end of system reaches the upper.

Batch summarization or extraction pipeline

Target 90 to 98%. Identical system, tool definitions, and few-shot every call; only the document varies. The highest expected hit rate of any archetype because the prefix-to-suffix ratio is most favourable. Use 1-hour TTL with pre-warming (a max_tokens: 0 request) at the start of each batch run.

Bursty interactive chat with 5-plus-minute idle gaps

The worst case for the 5-minute default. With 5-minute TTL, expect 30 to 60% as idle gaps blow the cache between sessions. With 1-hour TTL, expect 70 to 85%. The TTL mismatch dominates everything else; the failure mode is structural, not prompt-shaped.

The diagnostic for “I sit in archetype 1 but I am measuring 30%” is almost always a dynamic field mid-prefix; walk the failure-mode tree below. The diagnostic for “I sit in archetype 4 and I am measuring 30%” is TTL mismatch; switch to 1-hour or implement keep-alive. The LangChain State of Agent Engineering 2026 finding that cost dropped to 18.4% as a stated production blocker, behind reliability at 41% (LangChain, 2026), is consistent with teams getting observability without optimizing the cache. Visibility arrived. The discipline did not.

Five named failure modes

Cache misses are silent. No error fires. The bill just stays high. Five named modes cover most production failures, and each has a distinct diagnostic. startdebugging.net documents five failure modes (Apr 2026 measurement guide, 2026); mager.co documents seven patterns (mager.co, 2026). Neither distinguishes late-breakpoint placement from prefix drift, which need different fixes.

1. Late breakpoint placement. Symptom: cache_creation_input_tokens greater than zero on every call but cache_read_input_tokens always zero. Diagnosis: a dynamic block (timestamp, runtime variable, working memory) sits between the cache breakpoint and the start of the prompt. The cache key includes the volatile content, so it never matches. Fix: relocate the dynamic field downstream of the breakpoint, into a user message at the prompt tail. This is the relocation trick ProjectDiscovery used to go from 7% to 84%.

2. Prefix drift. Symptom: hit rate that erodes slowly over weeks without any obvious code change. Diagnosis: chart cache hit rate over calendar time alongside deploy timestamps. A small system-prompt edit that nobody flagged as cache-relevant broke a stable prefix that had cached well for months. Fix: lock the system prompt under change control; surface drift via a hit-rate alarm in the team-cost dashboard. Drift is the failure mode that survives every other fix. It deserves its own monitor.

3. Tool definition instability. Symptom: cache misses on tool-heavy workloads despite the system prompt being stable. Diagnosis: tool schemas are JSON-serialized non-deterministically (key order varies between calls) or include dynamic descriptions (timestamps, version strings). Fix: sorted-key JSON serialization (json.dumps(tools, sort_keys=True)); strip volatile fields from tool descriptions.

4. TTL mismatch. Symptom: cache miss spike on a regular interval, every Nth turn at fixed cadence. Diagnosis: the agent loop’s inter-turn pause exceeds TTL. Fix: switch to 1-hour TTL (write surcharge 100% but pays back at three reads), implement a keep-alive ping at TTL times 0.8 intervals, or pre-warm with max_tokens: 0 at session start.

5. 20-block lookback miss. Symptom: cache miss despite correct breakpoint placement on long conversations. Diagnosis: more than 20 content blocks sit between the most recent breakpoint and the start of the prompt; the lookback window runs out. Fix: add a second breakpoint inside the messages array. The lookback is per breakpoint, so a second breakpoint extends the reach.

The instrumentation that surfaces all five lives in the same place: a PostToolUse hook reading cache_read_input_tokens and cache_creation_input_tokens on every API response and emitting a metric. Five lines of hook code; five failure modes diagnosable from one chart.

TTL-aware scheduling: the math nobody published

Anthropic ships two TTL tiers: 5 minutes (write surcharge 25%) and 1 hour (write surcharge 100%). The break-even math determines which tier to use based on agent loop cadence. Five-minute pays for the write at 1.39 reads; one-hour pays for the write at 2.22 reads. Past those thresholds, every cache hit pays back 0.9x base regardless of tier. The choice is therefore not about cost per hit; it is about whether the cache survives the next idle gap.

The decision rule has three bands. Under 4 minutes inter-turn, 5-minute TTL is correct: the cache is alive when the next call arrives. Between 4 and 55 minutes, the 5-minute cache is dead by the time the next turn fires; switch to 1-hour TTL or implement a keep-alive ping at TTL times 0.8 intervals. Over 55 minutes, the discipline shifts again: pre-warm at session start with a max_tokens: 0 request, and use 1-hour TTL throughout. This last band is where scheduled agents, cron-driven sessions, and overnight batch jobs live. The March 6 2026 silent default change from 1-hour to 5-minute hit this band hardest; bursty cron-driven workloads went from a 70%+ effective discount to a near-zero one with no announcement.

Two adjacencies sharpen the math. Anthropic’s “Effective harnesses for long-running agents” treats context resets as mandatory at 24-hour scale. A reset is a cache miss on turn history but a cache hit on the stable system prompt at 1-hour TTL: the harness pays for one write per reset and reads it back across the next hour of operation. The Claude Managed Agents beta charges $0.08 per session-hour on top of token rates (Anthropic Managed Agents, 2026); that flat fee makes cache hit rate a direct lever on the infrastructure bill, not just the token bill. Cache discipline pays back twice on long-running workloads: once at the API line, once at the harness line. The output-cutting work covered in we-tried-to-cut-claude-output is the demand-side counterpart; cache-aware prompting is the input-side complement.

A five-line discipline

Cache-aware prompting is a five-line discipline applied at design time, not an optimization applied after the fact. Apply it before the agent is written, before the prompt template is finalized, before the first call goes out. The five lines are:

  1. Place every byte that is identical across calls within TTL at the front. System prompt, tool definitions, few-shot examples, stable memory blocks. Order them by stability descending, longest stable run first.
  2. Place every byte that varies per call at the back. Turn history beyond the cached window, timestamps, runtime variables, dynamic tool results, user input.
  3. Put the cache breakpoint at the end of the longest stable run. Use up to four breakpoints; place them at tier boundaries (tools end, system end, examples end, sliding turn-history end).
  4. Pick the TTL by inter-turn cadence. Under 4 minutes, 5-minute. Four to 55 minutes, 1-hour. Over 55 minutes, 1-hour with pre-warm at session start.
  5. Measure the hit rate every call. Log cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens_5m + cache_creation_input_tokens_1h); alert on regression past a threshold.

The connection to the subagent-driven development pattern is direct. Each subagent gets a fresh context window; the cache discipline applies per-window. The orchestrator’s prompt is usually the largest stable block in the system; orchestrator-level structure determines orchestrator-level hit rate. Subagents do not inherit the orchestrator’s cache, but each subagent can inherit the same discipline. The five lines are content-free about scope; they apply identically at every layer.

The production check is one count. Run the discipline against an existing prompt by counting bytes that vary call-to-call. If more than 5% of total tokens are in volatile blocks above the breakpoint, the prompt is broken regardless of what the dashboard says. Move the volatile content downstream and remeasure.

When you should not engineer for 90%

Not every workload deserves the discipline. Two filters: monthly Anthropic spend exceeds the cost of one engineer-week (roughly $5,000 to $8,000 loaded), or the workload is in archetype 1 or 3 (Claude Code agentic, batch pipeline) where the prefix-to-suffix ratio rewards cache placement most. Below that bar, automatic caching is closer to “good enough”; the gain from explicit four-breakpoint sliding layouts is real but not load-bearing.

The contrarian is honest. Implicit caching is the long-term direction: OpenAI shipped automatic prompt caching in October 2024; Gemini implicit caching arrived May 2025; Anthropic’s January 2025 simplification moved in the same direction. Explicit-breakpoint discipline is, in the long run, transitional. The answer is that the long run has not arrived. A naive prompt that interleaves dynamic with stable content breaks the cache silently regardless of whether the annotation is automatic; measuring hit rate remains the necessary feedback loop even when annotation is not. And many production workloads are constrained to Anthropic for reasons that have nothing to do with caching (model quality on agentic coding, vendor lock-in, contract terms). Until implicit caching is universal and reliable, the discipline is the lever.

Frequently Asked Questions

What hit rate should I expect for a Claude Code agentic session?

85 to 95% on the stable prefix and 70 to 85% on the messages tier, with explicit breakpoints and 1-hour TTL on the static blocks. The Anthropic Claude Code team blog cites 92% for its production fleet (claude.com/blog, 2026). Lower numbers usually indicate a dynamic field mid-prefix; walk the failure-mode tree, starting with cache_creation_input_tokens greater than zero on every call.

Why is my cache_creation_input_tokens always greater than zero?

Late breakpoint placement. A dynamic block sits between the cache breakpoint and the start of the prompt, so the cache key includes the volatile content and never matches. Fix: move the dynamic field downstream of the breakpoint into a user message at the prompt tail. This is the relocation trick ProjectDiscovery used to go from 7% to 84% hit rate (ProjectDiscovery, 2026).

When should I use the 1-hour TTL instead of the 5-minute default?

When inter-turn pause exceeds 4 minutes. Break-even is 1.39 reads on the 5-minute tier and 2.22 reads on the 1-hour tier (Anthropic pricing, 2026). Long-running agents, scheduled cron jobs, and batch pipelines should always use 1-hour. Note: on 2026-03-06 Anthropic silently changed the server-side default from 1-hour to 5-minute, costing one developer 17.1% across 119,866 calls before they noticed (GitHub #46829, 2026).

Does prompt caching break with tool changes?

Yes, in two ways. Toggling a tool (web search on or off, for instance) invalidates the system and messages caches because the tools tier changed. Non-deterministic JSON serialization (key order varies between calls) silently breaks the cache even when the tool set is logically identical. Fix: sorted-key serialization (json.dumps(tools, sort_keys=True)); avoid dynamic content in tool descriptions (Anthropic prompt caching docs, 2026).

What now?

Three takeaways for the build conversation.

  • The 90% read discount is a discipline, not a feature. The bill stays high silently when prompts interleave dynamic content with stable content. The fields needed to detect the gap are already in your JSONL.
  • Stable prefix, ephemeral suffix is a context-engineering sub-discipline. Apply it at design time, not as an after-the-fact optimization. Five lines of structure changes the bill 5x.
  • Five named failure modes cover most production cache misses. Each has a distinct diagnostic; one PostToolUse hook surfaces all five.

Measure your hit rate this week. Run the formula on your last 1,000 API calls: cache_read / (cache_read + cache_creation_5m + cache_creation_1h). If you sit below the upper band of your archetype, walk the failure-mode tree. The bill is already high. The discipline closes the gap.

Share this post

If it was useful, pass it along.

What the link looks like when shared.
X LinkedIn Bluesky

Search posts, projects, resume, and site pages.

Jump to

  1. Home Engineering notes from the agent era
  2. Resume Work history, skills, and contact
  3. Projects Selected work and experiments
  4. About Who I am and how I work
  5. Contact Email, LinkedIn, and GitHub