Long-Running Autonomous Agents: Drift, Checkpointing, Recovery
Your agent ran for 14 hours. The first 3 were perfect. The last 11 destroyed it. How would you know? METR’s January 2026 benchmark put Claude Opus 4.5 at a 5.3-hour 50%-time-horizon on software tasks; the February 2026 update on Opus 4.6 pushed the point estimate to 14.5 hours, with the 95% confidence interval stretching from 6 hours to 98 hours and the benchmark suite itself near saturation (METR, 2026-02-20). OpenAI ran Codex uninterrupted for about 25 hours over 13 million tokens that same month (OpenAI Developers, 2026).
Across 10 frontier models, aggregate pass@1 drops from 76.3% on short tasks to 52.1% on very-long tasks, a 24.3-point collapse driven by trajectory failures the paper names “meltdowns” (Khanal et al, arXiv:2603.29231, 2026). The failure shape changed; the harness discipline has not caught up. Through 2024, agent failure meant a wrong answer at one step. By 2026, agent failure means a right answer 11 hours ago and a drifted trajectory since. Drift is not a hallucination. Drift is not a crash. Drift is its own thing, and it needs its own discipline: drift as a structured eval over running state, checkpoint on threshold not on interval, and fork instead of restart, with a typed handoff artifact that survives the cut.
Key Takeaways
- METR’s Opus 4.6 50%-time-horizon point estimate is 14.5 hours with a 95% CI of 6 to 98 hours; the benchmark suite is near saturation, so the upper bound is noisy by design (METR, 2026-02-20).
- Aggregate pass@1 across 10 frontier models drops 24.3 points (76.3% to 52.1%) from short to very-long tasks; Software Engineering GDS collapses from 0.90 to 0.44 (Khanal et al, arXiv:2603.29231, 2026).
- Anthropic’s own engineering blog shipped two harness designs in five months and named “context anxiety” and “premature victory” as load-bearing failure modes (Anthropic, 2026-03-24).
- A 6-hour Opus 4.5 session in Anthropic’s harness experiments cost $200; a 3-hour-50-minute Opus 4.6 session cost $124.70. Without checkpointing, every trajectory failure means buying that envelope again (Anthropic, 2026-03-24).
How did the agent failure mode shift in 2025?
Through 2024, evaluation was offline and output-shaped: run a benchmark, score the patch, ship or do not. Failure meant a wrong patch. By 2025-2026 the failure vocabulary moved to trajectory-shape, with named modes like “context overflow,” “goal drift,” “compaction loss,” “premature victory,” “endless file reading,” and “context anxiety.” These are not new names for old hallucinations. They are correct-seeming steps that compound into wrong trajectories over hours.
SWE-bench Pro’s failure analysis makes the shift concrete. On Claude Sonnet 4, 35.6% of failures are context overflow and 17% are endless file reading; on Claude Opus 4.1, the primary mode is “Wrong Solution” (plausible patch, wrong problem) at 35.9% of failures, which the paper classifies as goal drift (arXiv:2509.16941, 2026). The “endless file reading” pattern is the giveaway: the agent senses it has lost state and re-reads the same files repeatedly, paying for tokens without making progress. Anthropic’s March 2026 harness design post coined “context anxiety” for the same family of failures, describing Sonnet 4.5 “wrapping up work prematurely as it approaches what it believes is its context limit” (Anthropic Engineering, 2026). Premature wrap-up is drift in operational form: the agent shifts from building to summarizing before the work is done.
The METR curve frames the timing. The post-2023 cohort doubling time is 131 days (roughly 4.3 months); the longer 2019-2025 series doubles every 7 months (METR TH1.1, 2026). GPT-4 1106 sat at 3.6 minutes. Claude Opus 4 reached 101 minutes. Claude Opus 4.5 reached 320 minutes. Opus 4.6’s point estimate is 14.5 hours, with a CI that spans an order of magnitude because the suite is near saturation. Capability moved faster than the harness discipline that has to catch the failures capability enables.
Cost stakes follow the capability curve. Anthropic’s harness experiments cost $200 for a 6-hour Opus 4.5 Retro Game Maker session and $124.70 for a 3-hour-50-minute Opus 4.6 Digital Audio Workstation session (Anthropic Engineering, 2026). Restart without a checkpoint means buying that envelope again. An agent that crashes is a solved problem; an agent that succeeds at the wrong task is not. Those two failure shapes need different primitives. The rest of the post names them.
Why is drift the third half of reliability?
Temporal’s April 2026 post divides agent reliability into two halves: model reliability (does the LLM produce correct output) and infrastructure reliability (does the system survive crashes). The post itself admits “we are still only solving half of it” (Temporal, 2026). Neither half catches an agent whose trajectory drifted while every model call returned reasonable output and every infrastructure layer stayed up. That gap is the third half of reliability: behavioral coherence over time, and it needs its own engineering primitives.
Drift is not a crash. Drift is not context overflow. Drift is not a hallucination. Crash recovery is solved by Temporal, Inngest, Restate, Cloudflare Fibers, and LangGraph PostgresSaver; none of these recover a drifted trajectory, they replay a crashed one. Context overflow is solved by compaction. Anthropic’s compaction beta (anthropic-beta: compact-2026-01-12, trigger threshold 150,000 input tokens by default) reduces window pressure but does not detect or recover from a drifted history; in practice compaction often hides drift because the compactor produces a clean-looking summary of an already-drifted trajectory (Anthropic compaction docs, 2026). Hallucinations are wrong facts at one step; drift is a chain of individually-plausible facts pointed at the wrong task.
The academic frame names three drift types. Semantic drift is the agent’s working definition of the goal shifting. Coordination drift is multi-agent or subagent communication degrading. Behavioral drift is action patterns deviating from the established trajectory. The Agent Stability Index measures these across 12 dimensions and reports a 42-percentage-point task-success drop (87.3% to 50.6%) under sustained drift, with the degradation accelerating nonlinearly in later phases (arXiv:2601.04170, Rath, 2026). Twelve dimensions is the rigorous instrument. The next section ships the three-signal practitioner subset that runs in a hook.
The third-half framing matters operationally. If drift is not named, it gets misattributed. Either “wait for a smarter model” (model reliability problem) or “add another retry” (infrastructure reliability problem). Neither response addresses the failure. Name the gap, instrument it, and ship recovery primitives that fit its shape rather than reusing primitives that fit the other two.
Drift as a structured eval over running state
A drift detector is not a prompt; it is an eval running against a slice of agent state. Anthropic’s own reference implementation in the Code with Claude 2026 workshop ships the shape. The Default-FAIL evaluator subagent is a separate agent with reduced tools (Read, Glob, Grep, Bash for git diff only), no visibility into how the work was built, and a strict output contract: return bare PASS or NEEDS_WORK on the first line for programmatic parsing. The instruction reads, “Plausibility is not correctness. A diff that looks reasonable paired with a screenshot that shows a broken layout is NEEDS_WORK” (anthropic/cwc-long-running-agents, 2026). That is drift-as-eval in operational form. Generalize it.
Three signals, each cheap, each independently useful, each computable without an additional LLM call beyond a small embedding model:
1. Goal-adherence delta. Embed the session-start goal text once. At each checkpoint interval, embed the concatenation of the last K tool-call descriptions plus their inputs. Score the cosine similarity. Trigger inspection if similarity falls below 0.65 from a moving baseline. Append the score plus timestamp to a drift.jsonl file in the session workspace.
2. Environment-action coherence. Of the file-system mutations performed in window W, what fraction touch files within K commits of the goal’s stated scope? Compute “stated scope” once at session start as the set of paths the planner identified. Trigger inspection when coherence falls below 0.5. This signal catches the agent wandering into adjacent modules without justification, which is the failure shape behind the SWE-bench Pro “Wrong Solution” mode at 35.9% of Opus 4.1 failures (arXiv:2509.16941, 2026).
3. Re-entry repetition rate. Of the tool calls in window W, what fraction are exact-match (tool name plus normalized arguments) duplicates of calls in window W-1? Trigger inspection above 0.3. This catches the “endless file reading” failure mode where the agent senses lost state and re-reads the same files repeatedly. SWE-bench Pro logs this at 17% of Claude Sonnet 4 failures.
def goal_adherence(goal_emb, recent_calls, k=20):
recent_text = "\n".join(call.describe() for call in recent_calls[-k:])
return cosine(goal_emb, embed(recent_text))
def env_coherence(touched_paths, planner_scope, k_commits=5):
in_scope = sum(1 for p in touched_paths if within(p, planner_scope, k_commits))
return in_scope / max(len(touched_paths), 1)
def reentry_rate(calls_w, calls_prev_w):
prev = {hash_call(c) for c in calls_prev_w}
repeats = sum(1 for c in calls_w if hash_call(c) in prev)
return repeats / max(len(calls_w), 1)
def should_checkpoint(signals: dict, thresholds: dict) -> bool:
breaches = sum(
1
for k, v in signals.items()
if (k != "reentry" and v < thresholds[k])
or (k == "reentry" and v > thresholds[k])
)
return breaches >= 2 # two-of-three triggers checkpoint
K equals 5 commits and W equals 20 tool calls is the working default for coding agents. Tune from there. Signal interpretation is layered: one breach is a flag worth logging, two-of-three is a structured checkpoint, three-of-three is a fork candidate. The asymmetry matters; a single high-confidence signal (the agent re-read the same auth helper 12 times in 20 calls) can still trigger inspection on its own.
Why eval rather than classifier? Replit’s Decision-Time Guidance uses a classifier and reports a 90% cost reduction versus dynamic prompts (Replit Engineering, 2026), but the classifier output is fed back as prompt input. Treating the same signals as eval outputs makes them first-class operational primitives: log them, alert on them, gate recovery on them. The output goes to disk, not to the next message. The same signal that gates a checkpoint can fire an alert, populate a dashboard, or seed an offline backtest. Replit-style guidance and drift-as-eval are complementary, not alternatives; the eval is the durable artifact, the guidance is the optional inline use of it.
Where the spec runs is a deployment question, not an architecture question. Inline as a PostToolUse hook for immediate intervention, or asynchronously on a sidecar (LangGraph PostgresSaver checkpoint hook) for continuous monitoring without latency cost. The choice depends on the cost of a false-positive intervention; the spec itself does not change.
Checkpoint on threshold, not on interval
Every durable-execution infrastructure piece checkpoints by step or activity boundary. Temporal records Event History on every Activity. LangGraph snapshots at every node. Inngest writes after every step.run. These checkpoints are necessary for crash recovery and insufficient for drift recovery. The recovery primitive long-horizon agents need is checkpoint-on-threshold: when the drift score crosses a bound, write a recovery-grade checkpoint with annotation, regardless of where in the interval the agent is.
Anthropic’s automatic context compaction beta is the right shape, generalized. The compaction trigger is a token threshold (default 150,000 input tokens; minimum 50,000), with pause_after_compaction: true for manual intervention via stop_reason: "compaction", behind the anthropic-beta: compact-2026-01-12 header (Anthropic compaction docs, 2026). A cookbook example reduced a 5-ticket workflow from 208,838 to 86,446 tokens (58.6% reduction) by triggering 2 compaction events. Drift checkpointing follows the same operational shape: signal-threshold trigger, recoverable pause, configurable continuation. The difference is the signal source. Compaction watches tokens; drift checkpointing watches behavioral coherence.
A recovery-grade checkpoint is not the same as an infrastructure replay record. The infrastructure record asks “if the process dies right now, where do we restart?” The recovery-grade record asks “if the trajectory is wrong, what does a recovery agent need to grade and steer?” Five things go in:
- Session ID and the Anthropic SDK
resume=session_idhandle (the durable resume primitive). - Current drift signal values plus the rolling baseline.
- The goal embedding and the goal text as written by the planner.
- The last 3 to 5 high-signal tool calls and their results.
- Environment hashes (lockfile, last-committed test-suite pass rate, branch SHA).
- An explicit “what would good look like” rubric the recovery agent can grade against.
Trigger taxonomy belongs at the runtime layer, not the inference layer. Per-signal thresholds, aggregate weighted scores, or external triggers (test-suite regression, build break, human escalation) all compose. Where the checkpoint lives is operational, not architectural: Postgres via LangGraph PostgresSaver, or Anthropic-hosted via the Managed Agents event log. The interface contract matters more than the storage backend. Read the rubric; grade against it; emit a verdict; decide.
The cost case sharpens the design. Anthropic Managed Agents charges $0.08 per session-hour plus token costs (Anthropic pricing, 2026). At Anthropic’s own published session costs ($200 over 6 hours), restarting a drifted session is a $100+ decision, dwarfing the few cents of overhead the threshold checkpoint adds. The frame from the team-cost observability post carries directly: measurement gives you the ability to make the recovery decision; threshold checkpoints give you the artifact to recover from.
When should you fork an agent session instead of restart?
When the drift threshold breaches there are three recovery paths, not two. Restart discards the trajectory and pays for it again. Resume continues from the last checkpoint with the same agent, the same tool inventory, and the same compromised attention pattern. Fork preserves the pre-drift trajectory as one branch while spawning a corrected branch from the checkpoint, leaving the original intact for diff and audit. Fork is not “rewind.” Fork is “branch.”
The mechanic already exists. Claude Code’s session-resume API exposes the building blocks; the Anthropic Agent SDK exposes resume=session_id on ClaudeAgentOptions and a forked session can be started from any captured session ID. The pieces are in the SDK. The pattern does not yet have a name in published practitioner writing, and so engineers default to restart even when fork would preserve more value.
The fork procedure: at the threshold breach, snapshot the pre-drift state, fork from the checkpoint with corrected goal injection (“you were drifting toward X; the original goal was Y; here is the diff of what you changed; consider this rubric”), and run the fork in parallel with the original branch suspended. The original branch is preserved on disk as one row in the audit log. If the fork performs better, retire the original; if the fork performs worse, the original is still there to resume.
When does fork beat resume? When the trajectory contains valuable intermediate work but the recent direction is wrong (the 80% milestone case where the agent has built most of the right thing and is now drifting toward an adjacent wrong thing). When does fork beat restart? When the original goal is correct but the agent’s reading of it has drifted, which describes most multi-hour failures. When is restart still right? When the goal itself was mis-specified at session start; nothing in the trajectory is salvageable, and the next session needs a clean slate. When is resume still right? When the drift signal was a false positive and the trajectory passes external audit (test suite still green, environment-action coherence recovers without intervention).
The handoff artifact schema is the explicit answer to “what exactly goes in the progress file?”, the question both Anthropic harness posts and the OpenAI Codex post leave open:
.session/
goal.md # goal text + rubric for "done"; written by planner; never edited mid-session
progress.json # tested-and-passing milestones; commit SHAs; test-suite hashes
drift.jsonl # per-checkpoint signal log; inherited by forks for trajectory comparison
environment.lock # dependency lockfile hash; last-known-green test pass rate; branch SHA
open-questions.md # decisions deferred to the next session; "the agent should ask" markers
Anthropic’s two harness posts ship feature-list.json and progress notes; OpenAI Codex uses spec plus plan plus status markdown. Neither specifies required fields, ordering rules, or validation gates. The schema above is one concrete answer. The contribution is making the schema discussable: a fork-and-resume mechanic that can be hardened, validated, and version-controlled. The spawn-vs-stay frame maps directly here: fork-on-drift is the spawn-vs-stay decision applied to the agent’s own future.
A worked example, end to end
Walk one concrete scenario. A Claude Code session on Opus 4.6 starts at 09:00 with the goal “rewrite the auth middleware to handle SSO via SAML, keep existing API contracts.” The planner writes goal.md (“rewrite src/auth/middleware.ts and adjacent helpers to negotiate SAML assertions; keep public types in src/auth/types.ts byte-identical; rubric: existing API contract tests green, new SAML round-trip test green”) and progress.json (empty, ready to append). At 09:00, goal-adherence baselines at 1.0 and environment-action coherence baselines on the src/auth/* and tests/auth/* paths.
By hour 4, the agent is touching files in src/users/ because the user model imports the auth helper. Environment-action coherence drops to 0.71. Acceptable; the post-tool hook writes a drift.jsonl entry but does not pause. Goal-adherence sits at 0.91. By hour 7, the agent has refactored the user model, started writing migration scripts, and re-edited the auth helper to take a User object instead of a SAML assertion. Coherence drops to 0.42. Goal-adherence falls from 0.91 to 0.58. Re-entry repetition rises as the agent re-reads the same auth helper 6 times in 20 calls. Two-of-three breach.
The threshold checkpoint fires. The PostToolUse hook writes a drift.jsonl entry with all three signals plus a snapshot of the session via resume=session_id and the file diff since session start. The recovery agent, a separate evaluator subagent with read-only access, grades the trajectory against goal.md. The grade returns: NEEDS_WORK. Reason: “scope expanded from auth/* to include users/* refactor. Original goal does not require user model changes; rubric does not specify a User parameter shape.” The grade is a structured artifact, not a prose summary; the next agent reads it as JSON.
The fork executes. The post-hook script spawns a forked session from the hour-4 checkpoint where environment-action coherence was still 0.71. Goal injection: “you were drifting toward refactoring the user model. The original goal is SAML negotiation in src/auth/*, keeping API contracts. The user model is out of scope; if the user model needs to change, append to open-questions.md instead of editing.” The original branch is suspended but preserved on disk as session-abc123.original.
By hour 9 of fork time (hour 13 of total elapsed), SAML handling is implemented and tested. The fork replayed the most useful tool calls from the original (test runs, the SAML library investigation, the type-shape exploration) but did not touch users/*. The original branch’s 3 hours of users/* work are not discarded; they are filed as open-questions.md for the next session: “should the user model change to support SSO? Why or why not? Trace at session-abc123.original.” Two days later, when product confirms the user model is staying as-is and SSO will return assertions to the existing token shape, the open question closes without rework.
What this example demonstrates: drift is a measurable signal, not vibes. Threshold checkpoints are recoverable pause points, not transparent infrastructure. Fork preserves trajectory value and audit trail simultaneously. The recovery primitive is not “more autonomy” or “less autonomy”; it is structured intervention at the right point. The same drift detector is the runtime analogue of the offline backtest; the same handoff artifact is the substrate the memory architecture writes against between sessions.
What breaks this pattern
The pattern is not free. Three categories of cost and risk earn naming.
The detector itself drifts. Embeddings of the goal text change with model versions; the goal-adherence cosine baseline is not stable across Anthropic embedding-model upgrades. Mitigation: version the embedding model alongside the goal, log embedding-model hashes in drift.jsonl, and re-baseline on model upgrades. The post is honest that this is real overhead. The cheapest reduction is to run drift detection on a quantized open-weights embedding model locally rather than against the latest hosted model; coupling the detector to the same vendor that ships the agent reduces drift signal stability for no operational reason.
False positives waste fork budgets. A correctly-scoped exploration looks like environment-action drift at the file-path level. Two-of-three signal breaches reduce false positives but do not eliminate them. Mitigation: keep the fork shallow at first (3 to 5 tool calls of corrected execution before deciding whether to retire the original branch); use an external eval (test-suite pass rate, build state) as the tiebreaker. The cost of an unnecessary fork is mostly token-budget, not lost work, because the original branch is preserved.
The handoff artifact rots. progress.json lies if the agent declared a milestone passing without running the tests; environment.lock lies if the agent committed a dependency change without updating the lockfile. Mitigation: validate the artifact at every session resume. The Anthropic cwc-long-running-agents evaluator subagent pattern, with restricted tools and a PASS or NEEDS_WORK output contract, is exactly this validation layer (anthropic/cwc-long-running-agents, 2026).
Two anti-trends are worth pushing back on directly. “More autonomy is always better” is the loudest; a 2025 Gartner survey found only 15% of IT leaders are considering, piloting, or deploying fully autonomous agents, and 71% of users prefer human-in-the-loop for high-stakes decisions (Strata.io, 2025). The post is not arguing for longer runs everywhere; it is arguing that long runs, where they are valuable, need a third reliability half. “Better models will solve drift” is the second; Opus 4.7 shows real coherence gains and the Khanal et al. aggregate still measures a 24.3-point pass@1 collapse on long tasks. Capability moves; the discipline needs to move with it. Drift is not eliminated by better models. It is detected sooner. The team-member framing carries: trust is calibrated through measurement and intervention, not through omitting either.
FAQ
What is the difference between context compaction and checkpointing?
Compaction is a token-management primitive that summarizes earlier turns to fit the model’s context window; Anthropic’s compaction beta triggers at a configurable token threshold (default 150,000 input tokens) and emits a single delta event (Anthropic compaction docs, 2026). Checkpointing is a recovery primitive that captures session state for resumption or forking; LangGraph PostgresSaver and the Anthropic Agent SDK resume=session_id are the canonical examples. They are complementary, not interchangeable. Compaction without checkpointing means the agent compresses away the very state a recovery would need. Checkpointing without compaction means the agent runs out of context before recovery is needed.
How do I implement a drift detector in Claude Code?
Three options. The fastest is a PostToolUse hook that logs to .session/drift.jsonl and pauses the session above a threshold; the hook is the deterministic substrate, so the signal logging happens regardless of model state. The most rigorous is an evaluator subagent invoked at session-resume points, modeled on the Default-FAIL pattern in anthropic/cwc-long-running-agents. The most production-grade is a sidecar process that reads the session JSONL stream and emits drift signals to your observability layer. Start with option 1 for speed, layer option 2 for rigor, move to option 3 when you have more than one engineer per agent.
When should I fork instead of restart?
Fork when the original goal is correct but the trajectory has drifted, and the intermediate work has audit or partial-progress value (the 80% milestone case). Restart when the goal itself was mis-specified at session start and nothing in the trajectory is salvageable; the next session needs a clean slate. Resume when the drift signal was a false positive and the trajectory passes external audit. The default reach in published practice is restart; fork is the missing third path. The cost of fork is one extra session-hour of compute against an at-risk trajectory; the cost of restart is the entire envelope ($50 to $200, per Anthropic’s own published session costs) plus the lost audit trail.
How do I measure drift without a separate model?
All three signals are computable without a per-call LLM. Cosine similarity on embeddings needs one cheap embedding model call per checkpoint, not per tool call (small local models like the BGE family are sufficient). File-path set membership is pure Python over the planner’s stated scope; the planner produces that set once at session start. Exact-match tool-call deduplication is hash comparison over a rolling window, with no model involvement at all. The only expensive step is the recovery agent’s grade after a threshold breach, and that runs once per breach, not continuously.
Conclusion
The failure mode shifted in 2025. Agent failure now means a drifted trajectory, not a wrong step (Khanal et al, arXiv:2603.29231, 2026; SWE-bench Pro, arXiv:2509.16941, 2026). Drift is the third half of reliability between model and infrastructure, and Temporal itself admits the field is “still only solving half” of it (Temporal, 2026). Three named patterns close the gap. Drift is measurable as a structured eval over running state; three signals (goal-adherence delta, environment-action coherence, re-entry repetition rate) get a working detector in under 100 lines. Checkpoint on threshold, not on interval; the threshold checkpoint is the recoverable pause point the durable-execution layer does not provide. Fork-not-restart is the missing recovery primitive; preserve the pre-drift trajectory as audit, spawn a corrected branch with a typed handoff artifact (goal.md, progress.json, drift.jsonl, environment.lock, open-questions.md). Verification asks “is the answer correct?” Context engineering asks “is the substrate correct?” Autonomy asks “is the trajectory correct?” Same engineering discipline, three layers.
Instrument one signal this week. Goal-adherence delta is the cheapest. Add it to your existing PostToolUse hook stack, log to drift.jsonl, and alert on threshold breach. The human side of autonomy is the inverse of the machine side this post covers; expect the next piece to take the keyboard back rather than hand it over.
If it was useful, pass it along.