Agent Memory Architecture: Four Memories, Four Fixes

Q: What about CLAUDE.md? Where does that fit?

CLAUDE.md is read at session start as part of the prompt, not as recall. It is closer to the system prompt than to semantic memory: standing instructions, not retrieved facts. Treat CLAUDE.md as the constitution and MEMORY.md as the journal. The constitution does not change every session; the journal does. The journal is the slot the Memory tool, the wiki, and a vector store are all candidates for. The constitution sits one layer above the architecture this post describes.

Your agent re-asks for the constraint you laid out two hours ago. It repeats a decision you made yesterday. It re-reads the same transcript to remember what got built. The complaint compresses to “my agent forgot,” and the fix is not a longer window. All eighteen frontier models tested in Chroma’s Context Rot study lose accuracy as input grows, and 200K-window models start degrading by 50K tokens (Chroma, 2025). Coding agents accumulate 80,000 to 150,000 tokens within roughly thirty-five minutes in the same study. Stuffing more bytes into one bucket does not make the bucket bigger; it makes the rot land sooner.

CoALA names four memory types for a reason: working, episodic, semantic, procedural (arXiv:2309.02427, Sumers et al., 2023). Each one has a different write rate, a different consolidation discipline, and a different retrieval shape. Conflating them is what causes the “my agent forgot” complaint even when the bytes are technically still in the context. Claude Code already ships a primitive for each of the four. .remember/now.md is working memory. Session JSONL transcripts are episodic. MEMORY.md and ~/.wiki are semantic. Skills and subagents are procedural. Anthropic’s Memory tool (GA 2026-04-23) is one slot in this map, not the map itself. This post lays the four-to-four mapping down end to end, and names what the Memory tool solves and what it does not.

Key Takeaways

All 18 frontier models tested by Chroma rot under longer input; 200K-window models degrade noticeably by 50K tokens, and coding agents hit 80K-150K tokens within ~35 minutes (Chroma, 2025).

CoALA names four memories (working, episodic, semantic, procedural); each has a different write rate, consolidation discipline, and retrieval shape (arXiv:2309.02427, 2023).

Claude Code already has a primitive for each slot: .remember/now.md (working), JSONL transcripts (episodic), MEMORY.md plus ~/.wiki (semantic), skills plus subagents (procedural).

Anthropic’s Memory tool (GA 2026-04-23, identifier memory_20250818) is the right primitive for the semantic slot and the wrong primitive for the other three (Anthropic Memory tool docs, 2026).

Why does context length stop being the answer?

Context length is necessary, not sufficient. Chroma’s Context Rot study tested eighteen frontier models across input lengths from roughly 100 tokens to nearly the full advertised window and found accuracy degradation in all of them; 200K-window models showed noticeable rot by 50K tokens, and the degradation was non-uniform, with semantic similarity between needle and question, distractor density, and haystack structure all moving the curve (Chroma, 2025). The window keeps shipping. The rot keeps shipping with it.

The position effect is foundational, not a single-paper artefact. Liu et al. showed that relevant content placed in the middle of the input is recovered worse than the same content at either end (arXiv:2307.03172, 2023; peer-published in TACL 2024). Anthropic’s own framing names the constraint directly: context exhibits roughly n-squared token-pair relationships and “context is a finite resource with diminishing marginal returns” (Anthropic Engineering, Effective context engineering for AI agents, 2025). The reframe matters because most teams reach for a longer window first and a memory architecture second, and the order should be reversed.

Memory is the discipline of writing less in and reading less out. It isn’t the discipline of buying a longer window, because the longer window doesn’t preserve the per-token attention quality that made the short window work.

The next four sections name the four jobs that “writing less in” decomposes into, and the four Claude Code primitives that already serve them. The whole point of separating jobs is that each gets its own write rate, its own consolidation step, and its own retrieval shape. None of that is possible inside one bucket, even when the bucket is large.

What are the four memories an agent actually needs?

CoALA names four memory types: working, episodic, semantic, procedural. Each has a different operational signature, and conflating them is what causes the “my agent forgot” complaint even when the bytes are technically still in context. The taxonomy is older than agents. Tulving named episodic and semantic in 1985. Anderson’s ACT-R named procedural and declarative. CoALA’s contribution is naming all four together as the substrate language model agents actually need (arXiv:2309.02427, Sumers et al., 2023).

Working is volatile. Episodic is append-only. Semantic is consolidated. Procedural is compiled. The repetition matters because each slot’s signature is what determines where Claude Code already serves it. Working memory is the live scratchpad the agent acts on right now: high write rate, no consolidation, lost at the end of the session by design. Episodic memory is the time-indexed record of what happened: structured events, ordered, never edited. Semantic memory is consolidated facts about the world: low write rate, key-value or graph, retrievable by phrase. Procedural memory is how-to: compiled into a routine, parameterised, retrievable by trigger.

The grid is the post in one image. The next four sections walk each cell and name what each primitive’s discipline actually looks like in a working repo. Two things to notice before that. First, the four cells aren’t interchangeable. Putting durable facts in working memory smears them across stale buffer state; replaying episodic memory raw into the live context drags the rot from the first H2 back at full strength.

Second, the parent framework that decides which bytes go into the window for this turn operates on top of these four storage layers. This post is the storage layer, and context engineering is the retrieval layer that runs on top.

Working memory is the live scratchpad, not a database

Working memory is the buffer the agent reads and writes during the current task. Volatile is the feature, not the bug. In a Claude Code setup, the file ~/.claude/projects/<project>/memory/.remember/now.md plays this role: a rolling timestamped log that captures the in-flight decision, the current branch, and the open question. Anthropic’s own context-engineering essay describes the same mechanic as note-taking in a sub-agent’s working buffer (Anthropic Engineering, 2025); the file is the durable version of that buffer that survives a single context compact without smearing into the system prompt.

The most common mistake is putting durable facts here. Working memory gets rewritten every session. A fact that lands in now.md and never rolls forward gets smeared across stale buffer state, and when the next session re-ingests it the agent treats stale state as canonical. The fix is consolidation, not retention. The four-tier pipeline this site actually runs:

## 14:18 | master
Compressed Q2 calendar to Mon/Thu cadence; pivoted topic 0 to superpowers.
Brief locked, voice anchor pulled from prior arc.
## 14:31 | master
Validated brief stats against pylon PR audit (MCP=0); calendar updated.
## 15:12 | master
Outline shipped: 7 H2s, 3 charts (line, 2x2, lollipop). Ready for /blog write.

Each rolling entry is a bullet under a ## HH:MM | branch heading. End of day, now.md rolls into today-YYYY-MM-DD.md (per-day compression). Every seventh day, the daily files roll into recent.md (7-day window). Older still, recent.md rolls into archive.md (long tail). Each step is a write-amplification budget on the way to the semantic store; the durable distillation surfaces in MEMORY.md only after it has survived the chain.

The heuristic is short: write here when the fact will not survive the next compact. If the fact should survive the next session, this is the wrong slot, and the fix is to send it down the consolidation pipeline. Failure mode when working memory leaks into the durable store: the agent re-ingests stale buffer contents as if they were canonical, and the rot from the first H2 hits even harder because the noise is high-relevance noise. The buffer isn’t the journal; the journal is what the buffer compresses into after the fact.

Episodic memory is the append-only log, not the summary

Episodic memory is the time-indexed record of what happened. In Claude Code that is the per-session JSONL transcript: every prompt, tool call, response, and hook event in order. The right operational shape is append-only, structured, queryable by event type, never edited in place. Letta’s MemGPT-style work names the role explicitly: the conversation log is the episodic store, and the consolidated profile is semantic (Letta, Benchmarking AI Agent Memory, 2025). The split is not site-specific. It is the operational standard.

The load-bearing distinction is between transcript and summary. The transcript is the only ground truth you have for “did the agent do that?” The summary is a compressed read of the transcript, and it lives in semantic memory, not episodic.

Two failure modes follow. Over-summarisation drops the only forensic record that can answer the question “which tool fired in which order, with what arguments, against what response.” The summary tells you what someone wanted to remember; the transcript tells you what actually happened. The audit story breaks the moment the two diverge, because there’s no replay to compare against. Under-replay is the opposite mistake. Reading the JSONL into the live context for decision support brings the rot back at full strength, this time as structured-event noise that’s high-relevance to the model and therefore extra-distracting.

The discipline that works: append every event, summarise selectively, replay almost never. The summary writes go to semantic memory under a stable key. The transcript stays where it is, queryable by tool type, session id, or event timestamp. Forensic value comes from the fidelity of the structured payload, not the prose summary on top. The same per-session transcript is also the eval surface for the verification half that just shipped: control-plane assertions read these payloads to decide pass or fail, and the assertions only work if the transcript’s intact. Treat episodic memory as a read-only history that other layers consume.

Where does semantic memory live in Claude Code?

Semantic memory is consolidated facts about the world, retrievable by key. In Claude Code that role is split between MEMORY.md (project-and-user-scoped, agent-mutable, lives in ~/.claude/projects/<project>/memory/) and ~/.wiki (cross-project, durable, query-shaped). Anthropic’s Memory tool, GA on 2026-04-23 with model identifier memory_20250818, is a third viable primitive for this slot: a sandboxed /memories/ directory the model can read, write, and delete, with built-in path traversal protection (Anthropic Memory tool docs, 2026). The right choice depends on how much consolidation you want the model to do versus the human.

The shape that travels well is an index plus pointers. Each entry is one line: a human-readable title, a file pointer, and a one-line hook describing why the entry exists. The index loads at session start; the pointed-to file loads only when the agent needs the body. Concretely:

- [Stack conventions](stack.md) -- Bun over Node; Bun.serve, bun:sqlite, bun test by default
- [Voice anchor](voice.md) -- terse, declarative, no emdashes; cite as ([Source](url), YEAR)
- [Chart conventions](charts.md) -- inline SVG, data-XX attrs, dark-mode in global.css
- [Pylon audit data](pylon-audit.md) -- pre-merge MCP=0; post-merge per-agent 0.6 to 2.6
- [Q2 calendar](calendar-2026-q2.md) -- Mon/Thu cadence; verification arc closed 2026-05-18

The benchmarking evidence backs the compact-wins framing. Mem0 reaches 91.6 on the LoCoMo benchmark at fewer than 7,000 tokens against 25,000-plus for full-context baselines, with 91% lower p95 latency and 26% better LLM-as-a-judge accuracy (arXiv:2504.19413, Mem0, 2025). The intuition is straightforward: a flat key-value store the model retrieves by phrase keeps signal-to-noise far higher than a transcript dump that pretends every byte is equally relevant. The wiki side of the same slot covers cross-project recall; the working CLI behind it is what makes “query an instance of long-tail knowledge” survive across repos. Heuristic for the slot: semantic memory is for facts you want recovered by phrase. If recency matters, episodic was the right slot. If “how-to” matters, procedural was.

Procedural memory: skills and subagents

Procedural memory is how-to. The agent does not retrieve a fact; it executes a routine. In Claude Code, skills (progressive-disclosure-shaped definitions in ~/.claude/skills/ and plugins) and subagents (specialised executors invoked by the parent) play this role. Voyager’s procedural skill library is the canonical proof that procedural memory pays interest: a Minecraft agent that builds a library of reusable skills explores 3.3x more unique items and reaches milestones 15.3x faster than a memoryless baseline (arXiv:2305.16291, Voyager, Wang et al., 2023). Procedural memory compounds. Semantic memory does not.

The split that matters in practice is fact versus routine. “We use Bun, not Node” is a fact. It belongs in MEMORY.md or CLAUDE.md, not in a skill, and the agent never executes it. “Run /blog write against the agent-memory-architecture brief and outline, then build” is a routine. It belongs in a skill, because the agent executes it the same way every time given the same trigger. The triggering condition is metadata; the body is the executable instructions. Progressive disclosure is the mechanism that keeps both cheap to host. Skills surface metadata at zero context cost and load body on trigger, which matches how procedural memory works in cognition: you do not load every routine you know on every task; you trigger one when its conditions appear.

Subagents are procedural workers in a different shape. The parent’s procedural store is the catalogue of subagent types, each with a known dispatch contract: name, expected input shape, expected output shape, expected runtime behaviour. Triggering the right subagent is the same retrieval operation as triggering the right skill, just at a coarser grain (a feature instead of a step). Skills carry both procedural signal (the body) and semantic signal (the metadata that makes them retrievable); the dual nature is real and the procedural slot is dominant. Heuristic: if the answer is the same every time, write a skill. If the answer is a fact, write to MEMORY.md. If the answer is event ordering, read the JSONL.

Where does the Anthropic Memory tool fit?

Anthropic shipped the Memory tool to GA on 2026-04-23 with model identifier memory_20250818, available across Claude Sonnet 4 and 4.5, Claude Opus 4, 4.1, and 4.5, and Claude Haiku 4.5. The tool exposes commands over a sandboxed /memories/ directory with path traversal protection, no end-of-context-window auto-clearing (it is a separate persistence layer the user controls), and is implemented client-side (Anthropic Memory tool docs, 2026). The tool is exactly the right shape for the semantic slot from the previous H2: a flat, agent-mutable, file-shaped store that survives the context window. It is the wrong shape for the other three slots, and treating it as the answer collapses four problems back into one.

The Memory tool persists. Working memory should not. The Memory tool is free-text-shaped. Episodic transcripts are structured-event-shaped. The Memory tool is flat-file-shaped. Procedural memory needs progressive-disclosure layering on top of a mutable file, which the tool doesn’t provide. Three lines, three reasons the same primitive cannot serve four jobs.

None of those is a defect of the tool. The docs are honest about scope, and the seven-command surface (view, create, str_replace, insert, delete, rename, plus a list-shaped view) is the right minimal API for managed semantic memory. The slot is correct; the architecture is the four-to-four mapping that wraps the slot.

What the Memory tool solves well: long-running agents that need cross-conversation semantic recall without the user wiring MEMORY.md and the consolidation pipeline by hand. Managed sandbox, agent-mutable, no auto-clearing, predictable persistence semantics, shipped across the current Claude family. The trade is consolidation control. Hand-maintained MEMORY.md lets the human decide what becomes durable, when, and under what key. The Memory tool lets the model decide. Neither is universally better; they’re the same slot under different governance.

Anthropic’s own context-engineering essay frames the design principle that applies to both: “tools should encourage efficient context use and target a high signal-to-noise ratio” (Anthropic Engineering, 2025). The tool meets that bar for the semantic slot, and isn’t designed to meet it for four. The architecture is the four-to-four mapping. The tool is one of the four. The post ships the map.

Frequently Asked Questions

Is the Anthropic Memory tool the same as long-term memory?

Partly. The Memory tool is a managed semantic-memory primitive with model-mutable read, write, and delete on a sandboxed /memories/ directory, available across Claude Sonnet 4, 4.5, Claude Opus 4, 4.1, 4.5, and Claude Haiku 4.5 since 2026-04-23 (Anthropic Memory tool docs, 2026). It is one of the four CoALA slots (semantic). It is not designed to serve working, episodic, or procedural; using it that way collapses the taxonomy and reintroduces the context rot the architecture is meant to avoid.

How does this relate to context engineering?

Memory is the durable substrate context engineering operates on. Context engineering decides which bytes go into the window for this turn (Anthropic Engineering, 2025). Memory architecture decides where bytes live between turns. The four-memory split is the storage layer; context engineering is the retrieval layer that runs on top, and the parent framework that decides where each piece goes is the load-bearing companion to this post.

Is a vector store enough for semantic memory?

Sometimes. Mem0’s compact-semantic results show 91.6 on the LoCoMo benchmark at fewer than 7,000 tokens against 25,000-plus for full-context baselines, with 91% lower p95 latency and 26% better LLM-as-a-judge accuracy (arXiv:2504.19413, Mem0, 2025). Compact semantic structures beat full-context dumps. A flat key-value file like MEMORY.md plus a query-shaped wiki covers most working setups; a vector store buys retrieval over larger corpora at the cost of the consolidation discipline.

What about CLAUDE.md? Where does that fit?

CLAUDE.md is read at session start as part of the prompt, not as recall. It is closer to the system prompt than to semantic memory: standing instructions, not retrieved facts. Treat CLAUDE.md as the constitution and MEMORY.md as the journal. The constitution does not change every session; the journal does. The journal is the slot the Memory tool, the wiki, and a vector store are all candidates for. The constitution sits one layer above the architecture this post describes.

The Real Argument

One bucket cannot serve four jobs, and stuffing more bytes into the window doesn’t solve it because rot is real and shows up in every model tested. CoALA’s four memories (working, episodic, semantic, procedural) map cleanly onto Claude Code primitives that already exist: .remember/now.md, JSONL transcripts, MEMORY.md plus ~/.wiki, skills plus subagents. The Anthropic Memory tool is the right primitive for the semantic slot and the wrong primitive for the other three; treating it as the architecture collapses four problems back into one and reintroduces the rot.

The operational discipline is consolidation, not retention. Working rolls into daily, daily into weekly, weekly into archive, archive distills into semantic. Procedural compounds with use, and skills are the slot that pays interest. The pillar arc, in one sentence: verification asks “is the answer correct?”; memory architecture asks “is the substrate correct?”; both are the same engineering discipline at different layers, and both depend on the team-member framing that says agent infrastructure deserves the same care as any other production substrate.

Pick one of the four buckets. Audit which Claude Code primitive currently serves it. Write down the consolidation step that pushes it to the next tier, and ship one row in your repo. The next post in the queue (claude-code-hooks-substrate, Mon 2026-05-25) is on hooks: the only deterministic substrate Claude Code ships, and the layer that can enforce the consolidation discipline this post describes. Memory is the substrate. Hooks are the enforcement. Evals are the assertion. One pillar, three pieces.