Skip to content

DORA in the Agent Era: Three Metrics Stop Measuring

23 min read

DORA in the Agent Era: Three Metrics Stop Measuring

· 23 min read
An editorial illustration of a four-dial dashboard with three of the dials visibly disconnected from the underlying meters they claim to read, used as a metaphor for DORA's four classic metrics drifting from the system they were built to measure

DORA 2025 reports 90% of software professionals working with AI, two hours a day median, and deployment frequency rising across the board (DORA, 2025). Open the dashboard and the picture is pristine. The dashboard is the wrong instrument.

The four DORA metrics were built for a 2010s system where humans wrote every line, opened every PR, and read every diff. AI now writes 46% of the code in files Copilot touches (GitHub Octoverse, 2025) and a meaningful fraction of PRs ship with no human review at all. The metrics did not move because the work moved underneath them. This post decomposes each of the four classic metrics, names where the measurement breaks under AI, and proposes what to instrument instead. No vendor platform required.

Key Takeaways

  • DORA 2025: 90% AI adoption, +14 pts YoY, median 2 hours per day in an assistant. Four-tier cohorts retired in favour of seven archetypes; rework rate joined as the fifth metric (DORA, 2025; CD Foundation, 2025).
  • Lead time did not shrink. It shifted. Faros telemetry on the DORA 2025 cohort: median PR review time +441% YoY, PR size +51% to +154%, 31% of PRs merge with no human review (Faros, 2026).
  • Deployment frequency hides a churn tax. GitClear’s 211M-line corpus shows two-week churn rose from 3.1% to 5.7%, refactoring fell from 25% to under 10%, cloned code climbed from 8.3% to 12.3% (GitClear, 2026).
  • Change failure rate now reports test-gate sophistication as much as code quality. Production incidents per PR rose 242.7% while CFR stayed roughly flat. CodeRabbit measured AI-authored PRs at 1.7x the issue rate of human-authored ones (CodeRabbit, 2025).
  • MTTR is the metric AI improves least. Vendor 25-40% reductions reflect runbook automation, not novel-incident resolution (Rootly, 2025).
  • The five-move instrumentation kit: decompose lead time, pair rework rate with two-week in-repo churn, mutation-test AI-authored test files, separate coordination time from code-fix time, tag every metric with AI-authorship signals.

What did DORA 2025 actually find?

DORA 2025 measured a uniformly AI-augmented population. 90% of software professionals work with AI tools at work, up 14 points year over year, with median two hours a day spent in an assistant (DORA, 2025). The report retired the elite, high, medium, low cohort ladder that anchored DORA’s first decade and replaced it with seven archetypes (the Harmonious High Achiever, the Legacy Bottleneck, and so on). Rework rate joined the framework as the fifth metric, with most teams falling between 8% and 32% and elite teams reporting under 2% (CD Foundation, 2025).

The headline finding is the “amplifier” framing: AI strengthens disciplined teams and exposes fragmented ones. Every commentary post on the SERP repeats this framing, usually verbatim. Treat it as foundation, not analysis. The interesting question is not whether AI amplifies; it is which of the four classic metrics survive the amplification with their meaning intact.

The methodology change has a quiet cost. The four-tier ladder let DORA chart whether AI was widening or closing the elite-to-low gap year over year. The seven-archetype model breaks that longitudinal continuity. The 2025 report is a more nuanced cross-section and a less comparable time series. That trade-off matters when the next sections argue that the metrics are measuring something different than they did three years ago. We can no longer ask the cohort data the question it would most help us answer.

The rest of this post departs from the consensus framing. The four classic metrics tell a story about a system shipping more code with stable quality. Decompose them and a different story appears.

Lead time did not shrink. It shifted.

AI compressed the time it takes to write a pull request and handed every minute saved directly to the reviewer. Faros telemetry on the DORA 2025 cohort shows median PR review time grew 441% year over year, PR size grew between 51% and 154%, PRs per developer rose 98%, and 31% of PRs began merging with no human review at all (Faros, 2026). End-to-end lead time stayed within a familiar range across the cohort. Inside that range, the system flipped.

The numbers are the same number told four ways. AI is producing diffs faster than humans can read them. Each PR carries more lines, contains more potential defects, and waits longer for the small number of qualified eyes that can sign off on it. CodeRabbit’s December 2025 study of 470 open-source PRs measured AI-coauthored PRs at 1.7x the per-PR issue rate of human-only PRs (10.83 vs 6.45 findings, with logic defects up 75% and security issues up 174%) (CodeRabbit, 2025). Reviewers are reading larger diffs that contain more issues per line. The bottleneck did not disappear; it changed shape.

DORA’s lead-time metric measures the pipe end to end. It cannot see this internal redistribution. A team whose authorship time fell by 60% and whose review time grew by 400% looks identical, on the lead-time chart, to a team that did neither. The aggregate is the wrong instrument for an AI-augmented system because the cost moved between stages the metric does not separate.

This is also the reason “add more reviewers” stopped working. Senior attention is the constraint and senior attention does not scale. The post on the volume problem is structural makes the longer case for AI as the first line of review; for DORA’s purposes, the implication is narrower: lead time is now an aggregate that hides where the work actually lives. Track time-to-first-review, review duration, and review-to-merge as separate timelines. The aggregate has stopped telling the truth.

Deployment frequency hides a churn tax.

Deployments rose. The same code is being rewritten inside two weeks at twice the historical rate. GitClear’s longitudinal study of 211 million changed lines across Google, Microsoft, Meta, and enterprise repositories from 2020 through 2024 found two-week churn climbed from 3.1% to 5.7%, refactoring (lines moved or restructured) fell from 25% of changed lines to under 10%, and cloned code rose from 8.3% to 12.3%, with duplicated code blocks increasing roughly 4x in 2024 versus prior years (GitClear, 2026). The deployments are real. The durability of what is being deployed is not what it used to be.

Rework rate, DORA’s new fifth metric, partially closes this gap. Rework captures unplanned production deployments as a share of total: the “we shipped, then had to fix it shortly after” signal. The benchmark range is 8% to 32% for most teams, with elite teams reporting under 2% (CD Foundation, 2025). It is the metric most directly aimed at AI’s hidden cost, and it works for what it measures.

What rework rate does not measure is in-repo churn caught before production. A team that ships an AI-assisted feature, breaks a related component in pre-prod, fixes it through three more PRs over the following ten days, and never triggers an unplanned production deploy will report a clean rework rate while spending double the historical work to ship the same net change. GitClear’s two-week churn rate is precisely this signal, and it is invisible to all six DORA dimensions.

The protective discipline is the one DORA 2025 itself flags. The report calls test-driven development “more critical than ever” with AI in the loop, because TDD becomes more critical than ever with AI precisely when the agent removes the friction that used to push humans away from writing tests first. Spec-driven development is the upstream version of the same instinct: durable artefacts that the agent re-reads before each step, so the system stays aligned over multiple iterations rather than churning toward whatever the most recent prompt happened to ask for. Both practices reduce the kind of two-week rework GitClear is measuring. Neither shows up in deployment frequency.

The honest reading of DORA’s deployment-frequency rise is therefore: the system is shipping more, the system is also shipping shorter-lived code, and the metric is silent on the second half. Pair deployment frequency with two-week in-repo churn for a complete picture. One number alone is now the wrong instrument.

Is change failure rate measuring code quality, or your test gate?

Change failure rate did not spike commensurate with AI adoption. Production incidents per PR did. Faros telemetry shows incidents per PR up 242.7% on the DORA 2025 cohort, while CFR (the share of deployments that fail) stayed roughly flat (Faros, 2026). Two metrics that should have moved together moved apart. The most direct interpretation is that elite teams’ continuous-integration gates are absorbing AI defects before they reach production, holding CFR steady on the strength of the test suite rather than the strength of the code.

If that interpretation is right, CFR is no longer reporting what it claims to report. It is reporting test-gate sophistication on the elite-team side and code quality on every other side, mixed together into a single average. The teams that catch AI defects in CI look great on CFR. The teams that do not, do not. The aggregate flatness is a composition story, not a quality story.

The defect profile is uneven. CodeRabbit’s 1.7x overall multiple is the floor, not the ceiling. Logic and correctness defects come in at 1.75x, error handling around 2x, formatting at 2.66x, and security issues at 2.74x (CodeRabbit, 2025). Reading CFR as a single number flattens this distribution. A team whose CI tests are strong on logic but weak on input validation will catch the 1.75x logic problem and miss most of the 2.74x security problem, then report a clean CFR while shipping vulnerabilities downstream.

There is a second and harder failure mode. AI assistants are very good at writing assertion-light tests that pass coverage thresholds without exercising meaningful behaviour. A repository can be filled with such tests faster than a human reviewer can audit them. The tests pass; coverage looks healthy; CFR holds. The next production incident reveals what the tests did not check. DORA has no mutation-score dimension to detect this. CFR cannot distinguish a 95% coverage report built on assertion-rich tests from a 95% coverage report built on assertion-light tests written in an afternoon by an agent. The signal that would distinguish them sits outside the framework.

The companion discipline lives in eval-driven development. An eval discipline for agent pipelines treats output quality as a measurable property of the system, sampled and tracked the same way CFR is. Eval suites are to agent runs what CFR is to deployments: the rate at which the system produces failing output. On our own pipelines, agent-backtest treats the eval pass rate exactly this way: a single number that moves on every model bump and prompt change, sampled the way CFR is sampled. The deployments that ship those changes do not show the movement; the eval chart does. As more delivery happens at agent-run granularity, the eval-pass-rate chart becomes the CFR chart for that surface. DORA does not measure this yet. Most teams worth measuring already do.

Why does AI improve MTTR least?

MTTR is the metric AI improves least, and the reason is that incidents are coordination problems disguised as code problems. The time-consuming parts of a real incident are deciding who owns what, establishing blast radius across systems, choosing rollback versus patch-forward, and managing communication with downstream teams and customers. AI assists at the margin and cannot own any of these. It can draft a hotfix in two minutes; it cannot tell the platform team why the hotfix needs an urgent review or convince the SRE on call that a particular service degradation is in scope.

Vendor reductions of 25% to 40% in MTTR are real numbers measured against a narrower problem (Rootly, 2025). They reflect runbook automation: known failure modes, known remediation steps, scripted response. When the failure pattern matches a trained surface, an agent can dispatch the playbook faster than a human paged at 02:30. When the failure does not match, which is the case for the high-severity incidents that drive MTTR’s tail distribution, the agent assists the same way a junior engineer assists: helpfully, but without the context to make the load-bearing decision.

DORA 2025 frames MTTR as “under significant pressure” rather than as an AI-improvement area. That framing is itself a signal. If the report’s authors had seen MTTR move materially under AI adoption, they would have said so. Instead they describe pressure: more code shipping, more incidents to respond to, the same coordination cost per incident. The metric did not improve because the work it measures did not change.

The instrumentation move is to separate code-fix time from coordination time. Most incident-management tools record a single “time to resolution” timestamp; this aggregates the two. Decompose the timeline into detection time, coordination time, code-fix time, and verification time, and the AI lever is visible only on the third. A 25% improvement on the smallest of four components is not a 25% improvement on MTTR. Vendor claims that conflate them are claiming credit for a slice of the problem that is already the smallest one.

How do assistants, in-editor agents, and autonomous agents pressure the metrics differently?

DORA’s survey treats AI adoption as a single signal: do you use AI tools at work, yes or no. The reality in 2026 is three different operating modes that pressure each metric differently, and the differences matter for what you measure. Autocomplete (Copilot-style ghost text) compresses authorship at the line level without changing PR shape. In-editor agents (Cursor, Claude Code interactive) compress authorship at the feature level and shift cost into multi-turn iteration. Autonomous agents (background or scheduled) commit, push, and open PRs without a human in the editor at all, which is what makes deployment frequency mechanically gameable.

Three observations follow from the map. First, deployment frequency is the metric most exposed to Goodhart pressure. An autonomous agent can be prompted to ship in arbitrarily small batches: split a feature into eight commits, open eight PRs, merge them across a single afternoon, and the dashboard reports a healthy doubling of deployment frequency without a single underlying improvement. The metric measures a behaviour that is now mechanically configurable at the prompt level. Second, lead time and CFR are the metrics whose interpretation depends most on what mode produced the code; the same lead-time number means different things when authored by autocomplete versus by an autonomous agent, and DORA’s survey instrument cannot tell them apart. Third, MTTR is the metric on which all three modes converge to roughly the same answer: marginal improvement on known incidents, no help on novel ones.

The protective discipline against Goodhart is upstream of the metric. Spec-driven discipline upstream and running agent fleets both rest on the same principle: durable artefacts (design documents, plan files, failing tests) that the agent re-reads before each step. A team that runs autonomous agents against a written specification produces a different distribution of PRs than a team that runs them against a chat thread, even though both teams report the same deployment-frequency number. DORA cannot see the difference. Your dashboards can, if you tag every PR with the mode and the artefact.

When we instrumented pylon’s PR-review agents before and after a design-phase rewiring (commit aa80b78, April 2026), per-agent tool-call rate moved from 0.6 to 2.6 across comparable PRs while deployment frequency for pylon itself was unchanged across the same window. DORA could not have told the two fleets apart on its own dashboards. A bundle-read covariate (.pylon/pr-context.json appearing in every session) and the per-agent tool-call delta were the signals that distinguished a fleet using its tools from one that was not. Both lived in the git history; neither would have shown up in a four-key DORA report.

A scope caveat is worth naming. METR’s randomized controlled trial of 16 experienced open-source developers across 246 tasks found the developers were 19% slower with AI tools than without, while predicting beforehand they would be 24% faster (METR / arXiv, 2025). The N is small and the tasks were real maintenance work, not synthetic benchmarks; both caveats matter. The finding is that AI productivity claims are routinely larger than AI productivity, especially among practitioners who use the tool most. The pressure map above is what AI does to the metrics; whether the underlying productivity story is real is a separate empirical question, and the answer is more complicated than the dashboards suggest.

What should you actually track on Monday?

Five concrete instrumentation moves, none of which require buying a vendor platform. Each uses data the reader already has: git, the GitHub API, CI logs, the incident-management tool of choice. The kit is deliberately vendor-neutral; the goal is to make DORA tell the truth again on a dashboard you already maintain.

Decompose lead time into three timelines. Track time-to-first-review (PR opened to first human review event), review duration (first review to approval), and review-to-merge (approval to merge) as separate metrics. The end-to-end aggregate stops being useful when the internal redistribution is the story; three metrics restore the resolution. GitHub’s PR API exposes all three timestamps already.

Pair rework rate with two-week in-repo churn. Rework rate captures unplanned production deploys; two-week churn captures pre-production correction work that rework rate misses. Compute churn as the share of changed lines (versus the prior commit on the same file) that are themselves changed again within a 14-day window. GitClear’s methodology is the reference; a 50-line shell pipeline against git log --numstat covers most of it.

Mutation-test AI-authored test files. Pick a mutation-testing tool (Stryker for Node, mutmut for Python, PIT for Java) and run it weekly against test files that include AI authorship signals (Co-Authored-By, agent commit metadata, or a directory convention you adopt). The mutation score is the fraction of injected bugs that the tests detect. Coverage thresholds are gameable; mutation scores are not, in the same way.

Separate incident coordination time from incident code-fix time. Most incident tools record one timestamp for resolution. Add two custom timestamps: when the responder finished diagnosing and started fixing, and when the fix was deployed. The delta between page time and “started fixing” is coordination; the delta between “started fixing” and resolution is code-fix. AI levers operate on the second; reporting both keeps vendor claims honest.

Tag every metric with AI-authorship signals. Use Co-Authored-By on commits made by Claude Code, Cursor, Copilot, or any agent. Treat the signal as a covariate on every other DORA measurement. Charts become four-line: human authored, autocomplete assisted, in-editor agent, autonomous agent. The pressure-map distinctions in the previous section become observable in your own data instead of inferred from survey aggregates. The post on instrumentation for AI assistant usage walks through the lower-level mechanics for one such pipeline.

None of these moves require buying anything. All five run on the data and tools a typical team already operates. The vendor platforms in the SERP do bundle them, and that bundling has value if your team has the budget and the appetite. If it does not, the difference between a DORA dashboard that is misleading and one that is useful is roughly a week of instrumentation work.

FAQ

Are DORA metrics still useful with AI? Yes for ordinal comparison; partially for absolute interpretation. The four classic metrics still rank teams against themselves over time, but absolute numbers reflect a different system than they did five years ago. Use them as trend signals and decompose them when you need to know what is actually moving (DORA, 2025; CD Foundation, 2025).

What is rework rate and how is it different from change failure rate? Rework rate measures unplanned production deployments as a share of total deployments; CFR measures the share of deployments that fail. Rework catches “we shipped, then had to fix it shortly after.” CFR catches “we shipped and it broke.” Both miss in-repo churn caught in pre-production, which is why pairing rework rate with two-week churn gives a fuller picture (Faros, 2026).

Why is PR review time growing under AI adoption? Because AI shifted cost from authorship to validation. CodeRabbit measured AI-authored PRs at 1.7x the issue rate of human-authored ones, with logic defects up 75% and security issues up 174%. Faros measured 441% year-over-year growth in median PR review time. The bottleneck moved from writing to judging (CodeRabbit, 2025; Faros, 2026).

Does AI improve MTTR? Marginally, on well-defined recurring incidents. Vendor claims of 25% to 40% reduction generally reflect runbook automation rather than novel-incident resolution. DORA 2025 lists MTTR as pressured, not improved, which is itself a tell (Rootly, 2025; DORA, 2025).

What now?

Three takeaways for the dashboard.

  • Lead time and deployment frequency are aggregates that hide internal redistribution. Decompose them. Three timelines for lead time; deployment frequency paired with two-week in-repo churn.
  • Change failure rate and MTTR are now reporting test-gate sophistication and coordination quality respectively. Instrument those underlying signals directly: mutation scores on AI-authored tests, coordination time separated from code-fix time.
  • Rework rate is the only classic metric that survives the shift with minimal redefinition. Track it. Pair it with two-week churn for the part it misses.

DORA is still the best ordinal comparison framework engineering organisations have. As an absolute description of how a system is performing under AI, it has stopped being self-sufficient. Run the five-move pass on your own dashboard this week. Keep using DORA for trend lines. Stop using it as the single number that explains the system. The dashboard did not lie. You stopped asking it the right question.

Share this post

If it was useful, pass it along.

What the link looks like when shared.
X LinkedIn Bluesky

Search posts, projects, resume, and site pages.

Jump to

  1. Home Engineering notes from the agent era
  2. Resume Work history, skills, and contact
  3. Projects Selected work and experiments
  4. About Who I am and how I work
  5. Contact Email, LinkedIn, and GitHub