AI Reviews the Diff. Humans Review the Decision.
AI code-review adoption rose from 14.8% in January 2025 to 51.4% in October (DORA via Faros, 2025). Over the same nine months median PR review time climbed 441%, PR size grew 51%, and 31% of PRs began merging with no human review at all. AI authoring is producing pull requests faster than humans can read them, and “add more reviewers” stopped working a long time ago.
So the case for AI as the first line of PR review writes itself. The case people skip is the harder one: what AI review structurally cannot do, why two confirmed 2025 incidents proved the reviewer itself is now an attack surface, and why Pro and Max users have no first-party answer at all. This post is the opinionated map: the market in one table, the diligence checklist that comes from the incidents, a working DIY recipe on the Claude Code subscription you already pay for, and a clear line on where senior humans still belong in the loop.
Key Takeaways
- DORA 2025: AI code-review adoption rose from 14.8% to 51.4% in nine months, but 31% of PRs now merge with no human review and median review time is up 441% (DORA via Faros, 2025).
- 45% of AI-generated code introduces security flaws (Veracode, 2025). Asking the same class of system to review its own output is a closed loop.
- Two 2025 incidents proved the reviewer is itself an attack surface: Kudelski’s CodeRabbit RCE reached write access on roughly 1M repos, and the Comment-and-Control prompt injection leaked an Anthropic API key from
claude-code-action(CVSS 9.4).- Pro and Max users have no first-party managed PR review. The DIY path is
gh pr diff | claude -pplus the officialpr-review-toolkitsubagents, wired intoanthropics/claude-code-action@v1. Cost per medium PR: roughly $0.20 to $0.60 on Sonnet via API.- AI is the first line, not the only line. Architecture judgment, calibration of the reviewer, accountability, and mentorship still belong to senior humans.
Why does PR review need a first line at all?
The volume problem is structural, not cultural. DORA’s 2025 cohort saw AI code-review adoption triple in nine months, while the median time a PR spends in review climbed 441% year over year, PR size grew 51%, and 31% of PRs began merging with no human review at all (DORA via Faros, 2025). Those four numbers are the same number told four ways: AI is now writing diffs faster than humans can read them.
“More reviewers” is not the fix. Senior attention does not scale. Queue length grows roughly with the square of team size, not linearly with it, because every PR needs at least one of a small number of qualified eyes. So when the number of PRs triples, the queue does not triple; it explodes. AI as the first line is not a luxury feature; it is the only realistic answer to a queue that AI itself created.
The stake is not just throughput. Veracode’s 2025 GenAI Code Security Report tested 100+ models on 80 real coding tasks and found 45% of AI-generated code introduces security flaws, with Java failing 72% of the time and XSS-secure outputs landing at 12-13% (Veracode, 2025). Stack Overflow’s 2025 developer survey adds the quality stake: 66% of developers report spending more time fixing “almost-right” AI code, and trust in AI accuracy fell from 40% to 29% in a single year (Stack Overflow, 2025). What is being merged unreviewed is not just larger; it is increasingly AI-authored and increasingly flawed. The first line catches volume so the senior line can spend its attention where the model cannot help.
What is actually on the market in 2026?
67% of teams using an AI code reviewer use GitHub Copilot Review by default, with CodeRabbit a distant 12% and the rest split among Greptile, Qodo Merge, Bito, Sourcery, Cursor Bugbot, and Graphite Agent (DORA via Faros, 2025). Most of these are GitHub Apps with similar core mechanics. The differences are in pricing model, context depth, and trust posture, not in whether they can summarize a diff.
| Tool | Install | Pricing trap | Distinct strength | Weakness |
|---|---|---|---|---|
| GitHub Copilot Review | Native in PRs (GitHub only) | Premium-request metering on unlicensed-contributor PRs | Default-on; org admins can review every PR | Lower depth than dedicated tools; GitHub-only |
| CodeRabbit | GitHub App / GitLab / Azure DevOps | Pro $24-30/dev; Enterprise five figures monthly | Most mature feature surface; BYOK (Claude/Gemini/GPT) | Largest install base means largest blast radius |
| Greptile | GitHub / GitLab / Bitbucket | Flat $30/dev/mo (recently simplified) | Highest catch rate in independent evals; deep cross-repo context | Highest false-positive rate alongside that catch rate |
| Qodo Merge / PR-Agent | SaaS or self-host (Apache 2.0 core) | Teams $30; Enterprise $45 | Only credible OSS option; broadest VCS coverage | Self-host operational burden; UX behind CodeRabbit |
| Bito AI Code Review | GitHub / GitLab / Bitbucket | Team $15; Pro $25 | Validates PR against the linked Jira ticket | Vendor-supplied performance claims, not independently verified |
| Sourcery | GitHub / GitLab; self-host added 2025 | Pro $10 | Cheapest serious option; 30+ languages; learns from dismissals | Lighter on agentic features |
| Cursor Bugbot | GitHub only | $40/dev/mo (active-contributor billing) | Bug-focused; “70%+ of flags resolved pre-merge” (vendor) | Per-active-seat math punishes drive-by committers |
| Graphite Agent | GitHub only | Team $40 | Tight integration with stacking + merge queue | Recent rebrand churn (Diamond → Graphite Agent, Oct 2025) |
Three of these are worth knowing in detail.
Copilot Review is the share leader because it is bundled, not because it is best. The December 2025 changelog made Copilot Review usable on PRs from contributors without their own license, billed to the organisation as premium requests; if you turn it on org-wide, model the peak-load cost before the first invoice arrives (GitHub Changelog, 2025).
CodeRabbit is the most polished standalone reviewer, and its install base of 2M+ repos and 13M+ PRs (vendor-reported) is precisely what makes it the largest target on the market. Qodo Merge sits in the OSS lane with an Apache 2.0 PR-Agent core; it is the only realistic answer if your security team will not approve a hosted GitHub App with write scope.
The three you can usually ignore unless they fit your shape: Bito (only differentiator is Jira-aware review), Sourcery (lightweight, Python-first, cheap), and Bugbot (per-active-contributor billing punishes any open-source-flavoured workflow). Sweep AI was a notable PR-review entrant in 2024; in 2025 it pivoted to JetBrains autocomplete and is no longer in this category.
What should you look out for before buying one?
Two failure modes have already happened to AI PR reviewers in production. Prompt injection through the PR surface itself, and the reviewer being compromised as a supply-chain dependency. Both shipped to real codebases in 2025, and both pay bounties at CVSS 9 or higher. The rest of this section is the diligence checklist that comes from those incidents.
Prompt injection via PR title, description, and diff. Johns Hopkins researchers (Guan, Liu, Zhong) disclosed a “Comment-and-Control” attack that hijacks the agent through attacker-controlled PR text, leaking an Anthropic API key from Anthropic’s own claude-code-action at CVSS 9.4; the same shape hit Gemini CLI Action and GitHub Copilot Agent (VentureBeat, 2026). It is the predictable outcome of running a tool-using LLM on attacker-controlled text, cataloged by OWASP as LLM01:2025 Prompt Injection (OWASP GenAI, 2025). Any reviewer that reads PR descriptions has the same surface.
Reviewer-as-supply-chain. Kudelski Security disclosed a chain in CodeRabbit in August 2025 that escalated from a crafted PR to remote code execution in the reviewer’s environment, ending with write access to roughly 1 million repositories the GitHub App was installed on (Kudelski, 2025). The exploit was patched. The pattern, “any GitHub App with write scope is a privileged supply-chain dependency,” is permanent. Diligence: SOC 2 Type II report, pen-test cadence, scoped tokens, customer-managed keys, and a clear answer to “what does write scope on our org get you in the worst case?”
Code retention and training. Read the Trust Center, not the marketing page. CodeRabbit, Qodo Enterprise, Bito, and Sourcery Team are explicit no-train and no-retain on customer code; free and Pro tiers of any tool are riskier and should be treated as if your code goes to a model provider, because it does. Copilot Business and Enterprise are also explicit; Copilot Free is not.
False-positive flooding. Greptile reports the highest catch rate in independent comparisons and the highest false-positive rate in the same comparisons; ProjectDiscovery’s recent benchmark logged 41 verified findings against 24 false positives on a single pass. The operational cost is review fatigue: developers dismiss everything and miss the legitimate signals in the same thread. Mitigations are scope-to-changed-files, severity ranking with hard suppression of style noise, and a senior owner who tunes the prompt every month. Without that owner, the tool degrades silently.
Pricing scaling traps. Two to model before the procurement deck. Copilot’s premium-request metering on unlicensed-contributor PRs can surprise an org that turns review on globally. Cursor Bugbot and Graphite Agent bill per active contributor, which punishes any workflow with drive-by committers (open source, contractor-heavy, or large-org review-as-a-service). Treat any per-review meter as a tax on the behaviour you want.
AI is the first line, not the only line
31% of PRs already merge with no human review (DORA via Faros, 2025). A noisy bot accelerates that trend; it does not fix it. There are four things AI reviewers structurally cannot do, and each one has a failure mode that compounds when teams pretend otherwise.
They cannot judge whether the feature should exist or fit the system. The model has yesterday’s CLAUDE.md, not yesterday’s hallway conversation. AI catches symptoms; principals catch architectural drift, cross-cutting concerns, and “this is solved three modules over.” Failure mode: codebase entropy compounds while every individual PR looks fine.
They cannot calibrate themselves. Which findings to suppress, which custom rules to add, when to override “Important” with “actually fine here.” Greptile’s catch-vs-noise tradeoff and ProjectDiscovery’s 41-vs-24 benchmark both end at the same place: someone senior owns the signal. Failure mode: review fatigue, devs auto-dismissing legitimate flags.
They cannot take responsibility. An AI flag is information, not accountability. Someone signs off on the merge, owns the incident, and is on the hook six months later. Rubber-stamping a bot’s “LGTM” is how 31% of PRs already merge without review. Failure mode: diffuse ownership, slower incident response.
They cannot mentor. Human review is teaching, not just gating. Juniors learn the codebase, the standards, and the politics through review comments. A bot can flag a missing test; it cannot say “we used to do it that way until the 2024 outage.” Failure mode: institutional knowledge stops compounding, juniors plateau.
The strongest argument for keeping seniors in the loop is the closed-loop trap. Veracode’s 45% AI-code-with-flaws and Stack Overflow’s 66%-fixing-almost-right combine into one sentence: AI is increasingly authoring the code AI is reviewing, and asking the same class of system to grade its own homework drives quality toward the model’s median, not toward the team’s bar. Senior humans are the only thing in that loop pulling the bar back up; one operational follow-on is to backtest the reviewer agent itself every time you change its prompt or model, so behaviour drift shows up before a human is the canary.
So the right division of labour is concrete, not philosophical. AI handles volume: mechanical correctness, missing tests, style and lint, known security patterns, dead code, comment rot, type-design red flags. Cheap, fast, runs on every push. Senior and principal review handles judgment: architecture fit, product alignment, cross-system implications, calibration of the AI reviewer itself, mentorship through review comments, sign-off and accountability. Expensive, slow, runs on every non-trivial PR. The first line buys the second line back its attention budget. That is the entire point.
How do you do this with the Claude Code subscription you already pay for?
There is no first-party Anthropic-managed PR review on Pro or Max plans. The managed Code Review service is Team and Enterprise only and currently in research preview (Anthropic docs). The good news is the building blocks for a serious DIY reviewer ship in the box: gh pr diff, claude -p headless mode, the official pr-review-toolkit subagents, and anthropics/claude-code-action@v1 for CI.
The duct-tape one-liner is the place to start. It pipes a PR diff into claude -p with structured output and the smallest possible toolset:
gh pr diff 1234 \
| claude -p "Review this diff for bugs, security issues, and missing tests. Return JSON with findings: [{severity, file, line, message}]." \
--output-format json --bare \
--allowedTools "Read,Bash(gh *)"
Three flags carry the work. --bare skips local hooks, MCP, plugins, and CLAUDE.md so the run is deterministic in CI (Claude Code docs, 2025). --output-format json returns a structured object you can pipe to jq and post back via gh api repos/$OWNER/$REPO/pulls/$PR/reviews. --allowedTools is the smallest set you can run with; resist the urge to widen it.
The ceiling is composition. The official pr-review-toolkit plugin in anthropics/claude-code ships six specialist subagents under .claude/agents/: code-reviewer, silent-failure-hunter, pr-test-analyzer, type-design-analyzer, comment-analyzer, and code-simplifier. Add the bundled /security-review skill, write one orchestrator that fans out to all of them in parallel, and aggregate findings into a single comment. Subagents return only summaries to the parent context, which is the actual reason the composition works; you avoid drowning the orchestrator in diff-and-codebase tokens. The decision of what each subagent sees, in what order, is context engineering applied at the merge gate: same four surfaces (CLAUDE.md, skills, memory, MCP), different use site.
For CI, anthropics/claude-code-action@v1 is the official path. Minimal workflow:
permissions:
contents: read
pull-requests: write
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: anthropics/claude-code-action@v1
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
prompt: "Review this PR for correctness, security, and missing tests. Post findings as inline review comments."
claude_args: |
--max-turns 5
--model claude-sonnet-4-6
--allowedTools "Bash(gh *),Read,Grep"
Honest cost math, against Anthropic’s published rates: a medium PR (3,000 lines touched, roughly 30,000 tokens of context loaded) reviewed by Sonnet runs about $0.20 to $0.60 per pass. Heavy multi-agent runs with Opus on a large PR can exceed $5. The Claude Code subscription stops scaling around 50 to 100 PRs per week, or sooner if CI starts starving humans of tokens; that is when you set ANTHROPIC_API_KEY in CI so the bot uses pay-as-you-go and the team’s interactive Pro or Max tokens stay theirs (tracking Claude Code token usage is the signal).
The hardening checklist is non-negotiable, because Comment-and-Control showed claude-code-action was reachable through attacker-controlled PR text. Five rules. Least-privilege permissions: { contents: read, pull-requests: write }, no broader. No secrets in the review job that are not strictly required. Smallest possible --allowedTools; Bash(gh pr *) and Read is usually enough, never raw Bash without a glob. A PreToolUse hook that blocks any Bash referencing $ANTHROPIC_API_KEY or outbound HTTP. Never let the reviewer push commits, period.
For an architectural worked example, Pylon takes the same composition further: parallel specialist agents per focus area, a code-intelligence MCP for whole-repo context, worktree-isolated execution, and a structured risk model (impact x likelihood x confidence x action). Same building blocks, different shape.
When does the DIY path stop being enough?
The DIY loop scales until senior calibration becomes the bottleneck rather than the feature. Three signals flag the threshold, and they are unrelated to whether the AI is good enough.
Throughput. Around 50 to 100 PRs per week the API spend stops being a rounding error, and peak-hour CI runs start hitting per-key concurrency limits. At that point you are paying for an operational layer (queueing, retries, deduplication, status tracking, audit log), not a model.
Compliance. SOC 2 Type II, HIPAA, data residency, customer-managed keys. None of these come for free with a homemade workflow; vendors price them in, and a security review will demand them whether you want to pay or not. CodeRabbit, Qodo Enterprise, and the larger Copilot tiers have these.
Cross-repo context. When a review needs code-intel that exceeds what a local MCP can serve in a CI runner, or when impact analysis spans more than one repository, you have grown out of the local-first pattern. Hosted vendors with persistent indexes have the structural advantage here, even if their FP rates look worse.
The honest answer at that point is that you are buying the operational layer, not the model. Pick on Trust Center first, signal-vs-noise second, integration third; feature lists come last. Anything else is procurement theatre.
Frequently Asked Questions
Is AI PR review ready to replace human review?
No. DORA 2025 already measured 31% of PRs merging unreviewed, and AI review accelerates that trend rather than fixing it (DORA via Faros, 2025). AI cannot judge whether a feature fits the system, calibrate its own findings, take responsibility, or mentor juniors through review comments. Treat it as a first line, not a replacement.
Which AI PR reviewer catches the most bugs?
Greptile reports the highest catch rate in independent comparisons and the highest false-positive rate in the same comparisons. Raw catch rate is the wrong metric on its own; signal-to-noise is what determines whether developers act on the findings. A bot that flags fifteen things per PR teaches the team to dismiss all fifteen, and the real bug ships anyway.
Can I use my Claude Pro or Max subscription for CI PR review?
Technically yes. Practically, set ANTHROPIC_API_KEY in CI instead so the bot uses pay-as-you-go and the team’s interactive subscription tokens stay available for humans (Anthropic support). A medium PR review costs roughly $0.20 to $0.60 on Sonnet via API; the subscription stops scaling at 50 to 100 PRs per week or once CI starts starving humans of tokens.
Are AI PR reviewers a security risk?
Yes. Two confirmed 2025 incidents: Kudelski Security disclosed a CodeRabbit RCE chain that reached write access on roughly 1M repositories (Kudelski, 2025), and the Comment-and-Control prompt-injection family leaked an Anthropic API key from claude-code-action (CVSS 9.4). Treat any reviewer with write scope as a privileged supply-chain dependency.
What does Pylon do differently from CodeRabbit or Greptile?
Pylon is a local desktop client, not a hosted GitHub App. It runs parallel specialist agents per focus area (security, bugs, performance, code-smells, style, architecture, ux), uses a code-intelligence MCP for whole-repo context, runs each review in an isolated git worktree, and produces a structured risk model (impact x likelihood x confidence x action) instead of a flat severity. The trade-off is operational: no central index, no managed compliance posture.
So where does that leave you?
The point of an AI first line is not to replace review. It is to give your seniors back the attention budget for the parts review actually exists for: architecture, calibration, accountability, and mentorship. AI handles the volume PR queue that AI itself created; humans handle the judgment that AI structurally cannot.
The concrete next step is small. Try the duct-tape gh pr diff | claude -p one-liner on the next non-trivial PR. Layer in the pr-review-toolkit subagents once the prompt is calibrated. Move to anthropics/claude-code-action@v1 for CI when it has earned its keep on three or four real reviews. If the loop outgrows you (throughput, compliance, cross-repo context), buy on Trust Center first. Treat AI as a reviewable team member (the same way you scaffold any other), not a vendor pitch.
If it was useful, pass it along.