Engineering notes from the agent era
Most posts here are about what changes when the agent becomes a first-class collaborator: local code intelligence, testing systems that analyze behavior over time, knowledge tooling that compounds. The rest of the career (cloud, low-latency, mobile) shows up when something's worth writing down.
The Promotion Ladder: Prompt, Skill, Hook, Tool
Four rungs trade flexibility for determinism. AGENTIF puts the best model at under 30% adherence; well-designed schema cuts MCP cost 99.9%. Match the rung.
The Handoff Problem: When to Take the Keyboard Back
Developers use AI in 60% of work but fully delegate only 0 to 20% of tasks. Four named cues for the moment you reclaim: drift, scope, novel error, 80%.
- autonomy
- agents
- claude-code
- AI pair programming
- agentic coding
Long-Running Autonomous Agents: Drift, Checkpointing, Recovery
METR pegs Opus 4.6 at a 14.5-hour 50%-time-horizon. Pass@1 collapses 24 points on long tasks. Drift as eval, checkpoint on threshold, fork not restart.
- AI agents
- Claude Code
- agent reliability
- long-horizon agents
- agentic coding
Cache-Aware Prompting: Engineering for 90%+ Hit Rate
ProjectDiscovery moved from 7% to 84% cache hit rate without changing the model. The discipline, the workload taxonomy, the five named failure modes.
- AI engineering
- prompt engineering
- developer productivity
- Anthropic
- cost optimization
Agent Cost Observability: From Personal Token Budget to Team-Wide
98% of FinOps teams now manage AI spend, up from 31% in 2024. Solo ccusage solved one developer; here is the three-axis, two-cap recipe for 200.
- FinOps
- AI engineering
- developer productivity
- observability
- platform engineering
Tool Design for Agents: Schema Is the Prompt
97% of MCP tool descriptions have at least one code smell. 56% fail to state purpose. The description is the prompt your agent reads to pick a tool. Here is the rubric.
- AI agents
- MCP
- developer tools
- prompt engineering
- platform engineering
The Project Graph: What Agents Need That Filesystems Can't Give
40 questions, two large repos, three LLM judges. Code-intelligence: judge 7.12 vs default 6.30 (+0.82), 29% faster, +8% tokens. Cites 50% vs CodeGraph's 32%.
- AI agents
- developer tools
- MCP
- code intelligence
- platform engineering
The Five Failure Modes of Autonomous Coding Agents
Five named failure modes for autonomous coding agents, each with a real 2026 incident, a detection signal, and a retro template you can drop into CLAUDE.md today.
- AI agents
- incident response
- Claude Code
- evals
- hooks
From Localhost to Production: The Handoff Brief for AI-Built Apps
45% of AI-generated code ships OWASP vulns. 380K vibe-coded apps public right now. The seven-gap handoff brief for builders and engineers.
- AI engineering
- vibe coding
- production
- security
- developer experience
UI Libraries vs AI-Generated Components: The Tailwind Substrate
Tailwind 51%, v0 4M users, shadcn passed Chakra. The library-vs-AI debate is the wrong frame: substrate placement is the right one. The four-quadrant framework.
- frontend
- AI engineering
- Tailwind
- shadcn
- developer productivity
MCP Server for Your Codebase: Tool-Shape, Not API-Mirror
Cloudflare's first MCP server would have eaten 1.17M input tokens. Their redesign got it to roughly 1,000. Here is the framework, applied to a codebase server.
- MCP
- model context protocol
- AI engineering
- developer tools
- platform engineering
Claude Code Hooks: The Only Deterministic Substrate
The best frontier model follows under 30% of agentic instructions perfectly. Hooks run as code on every matched event regardless. Here is the substrate map.
- Claude Code
- AI agents
- hooks
- policy enforcement
- agentic coding
Agent Evals: A Test Suite for Your Claude Code Setup
Observability says what happened. Evals say if the right thing happened. 89% ship the first, 52% the second. Four control-plane evals for Claude Code.
- Claude Code
- AI agents
- evals
- agentic coding
- developer productivity
The Permission Prompt Is Dying in AI Coding Agents
Claude Code users approve 93% of prompts. For AI coding agents, prompt walls failed as governance; safety is policy: allow, gate, block, log.
- Claude Code
- AI agents
- permissions
- policy enforcement
- agentic coding
Stdio MCP Doesn't Scale: Dropping 3,662 Lines for a Daemon
Five subagents across three repos loaded 2.6 GB of duplicated embedding models. v4 deleted the stdio path; the daemon shares everything. Here is the migration.
- MCP
- model context protocol
- developer tools
- AI engineering
- platform engineering
Claude Mythos vs. the CVE Surge: AI Security in May 2026
On May 11 curl's Daniel Stenberg called Anthropic's Mythos report mostly marketing. The same six months delivered the CurXecute RCE, the Claude Code chain, and a 35-CVE March.
- AI security
- CVE
- Claude
- Copilot
- AppSec
- developer productivity
Agent Memory Architecture: Four Memories, Four Fixes
200K-context models rot by 50K tokens. Coding agents hit 150K in 35 minutes. Map four memories onto Claude Code: MEMORY.md, .remember, JSONL, skills.
- Claude Code
- AI agents
- memory architecture
- context engineering
- agentic coding
Anthropic Just Metered the Agent SDK: What Breaks on June 15
On May 13 Anthropic split Claude subscriptions into interactive and programmatic pools. Power users call it a 25x cost cut. Here is the strategic read.
- AI engineering
- Claude
- Codex
- agent SDK
- developer productivity
DORA in the Agent Era: Three Metrics Stop Measuring
DORA's four metrics measured human-paced delivery. With agents writing 46% of code and review time up 441% YoY, three no longer measure what they claim.
- DORA metrics
- AI engineering
- developer productivity
- DevEx
- engineering management
Agentic TDD: When the Failing Test Is the Spec
Spec-driven was last week's new feature. Today's spec: 17 lines of failing test. Artefact-driven TDD for follow-up agent work, against the 1.7x AI-issue rate.
- Claude Code
- AI agents
- TDD
- agentic coding
- developer productivity
Spec-Driven Agent Development: Brainstorm, Design, Plan
PR 23 told its reviewers to use MCP. They didn't. Per-agent tool calls jumped from 0.6 to 2.6 after a design doc surfaced the wiring bug prompts hid for weeks.
- Claude Code
- AI agents
- spec-driven development
- agentic coding
- developer productivity
Agent Skills: Progressive Disclosure That Actually Scales
Naive skill loading costs roughly 22x more tokens than progressive disclosure, and the attention math gets worse with every model upgrade. The pattern, the catalog math, and the authoring mistakes that break it.
- agent skills
- Claude Code
- context engineering
- progressive disclosure
- AI agents
Subagent-Driven Development: How to Fan Out a Feature Build
Subagents fan out feature builds at ~15x token cost. Wave dispatch, a frozen plan, and the five failure modes specific to subagent-driven development.
- Claude Code
- AI agents
- subagents
- developer productivity
- agentic coding
Engineering That Outlasts the Paradigm
Trust in AI accuracy hit 29% the same year vibe coding became Word of the Year. Both numbers describe the same mistake. Engineering outlasts the paradigm.
- AI agents
- software engineering
- agentic coding
- career
- thought leadership
Subagent Patterns: When to Spawn vs Stay In-Context
Multi-agent burns 15x more tokens than chat. Five-question decision tree, 2026 token math, and three reproducible failure modes for Claude Code subagents.
- Claude Code
- AI agents
- subagents
- developer productivity
- agentic coding
We Tried to Cut Claude's Output 50%. We Got 5%. So Did Anthropic.
We aimed for 50% Claude output compression. We hit 4.7%. Anthropic hit the same wall and reverted at 3%. Here is the data and the failure mode.
- claude-code
- llm-output-compression
- prompt-engineering
- claude-skills
- anthropic
Your Codebase Is the Agent's Operating Environment
Frontier agents hit 90% on SWE-Bench Verified and 21% on SWE-EVO. The variable is the shape of the codebase, not the size of the model.
- AI agents
- monorepo
- code intelligence
- code graph
- developer tooling
AI Reviews the Diff. Humans Review the Decision.
AI code-review adoption tripled to 51.4% in 2025, but 31% of PRs now merge unreviewed. Honest market scan, security posture, and a Claude Code DIY recipe.
- AI PR review
- AI code review
- Claude Code
- GitHub Actions
- developer tooling
Backtesting AI Agents: Replay to Catch Regressions
54% of enterprises ship AI agents in production. Most cannot tell when a CLAUDE.md edit silently regresses behavior. Backtesting is the missing discipline.
- agent evaluation
- backtesting
- Claude Code
- AI agents
- regression testing
- LLM as judge
Context Engineering in Practice: Where Does Each Piece Go?
Context engineering became the #1 2026 skill shift. Anthropic's research notes context exhibits n² token relationships. Here's the per-surface decision framework.
- context engineering
- Claude Code
- MCP
- AI agents
- developer productivity
Treat AI as a Team Member, Not a Chat Window
84% of developers use AI, 46% distrust it. The right scaffolding (constitution, skills, memory, MCP, subagents) turns an assistant into a team member.
- AI agents
- developer productivity
- Claude Code
- MCP
- team workflows
How to Track Claude Code 5-Hour Window Usage
40.8% of devs use Claude Code, but the 5-hour window is opaque. Build a local dashboard that parses transcripts, estimates your token budget, and rolls up team-wide cost via Grafana Loki.
- claude-code
- token-usage
- developer-tools
- ai-coding
- cost-tracking
Your AI Agent Is Flying Blind Without Local Code Intelligence
84% of developers use AI tools but 46% distrust the output. Three on-device models, 32 MCP tools, 9.93/10 relevance, and zero source code leaving your machine.
- local code intelligence
- AI agents
- MCP
- code search
- developer tools
Building an LLM Wiki: From Karpathy's Gist to a Working CLI
I turned Andrej Karpathy's LLM wiki concept into a Bun CLI (~500 lines of TypeScript) that automatically builds a persistent knowledge base from Claude Code sessions, files, and URLs.
- llm
- cli
- knowledge-management
- claude-code
- bun
How Do You Test Systems That Analyze Behavior Over Time?
Backtesting borrows from quant finance to catch temporal bugs unit tests miss. Poor US software quality costs $2.41T per year. Here's the technique.
- backtesting
- software-engineering
- data-pipelines
- temporal-data
- regression-testing
- synthetic-data
- developer-tooling