Testing and using AI Review Tools

AI breaks the review process by increasing arrival rate while review capacity stays fixed. The obvious fix is to automate review itself. If AI creates the bottleneck, AI reviewing AI code should solve it, right?

A 2024 study of 238 practitioners at a Turkish firm (Beko) found 73.8% of AI-generated review comments were addressed. Google’s ML-based system resolves 52% of reviewer comments at 50% precision, projecting “hundreds of thousands of hours saved annually.” But the Beko study found PR closure time increased by 2 hours 28 minutes because authors spent time addressing bot feedback before human reviewers saw the code. Load transferred, not shed.

The Attention Cost Link to heading

CI works as a load-shedding protocol. Today, automation checks syntax and tests, humans check design. AI review tools run differently in that they add suggestions during review rather than filtering beforehand. This makes sense for testing accuracy, but it changes how humans review.

A controlled study of 29 experts across 50+ hours found reviewers with AI assistance converge on AI-highlighted locations instead of searching elsewhere. More low-severity issues were found, but no increase in high-severity issues. The tools helped surface trivial problems but didn’t help finding architectural issues that require broader code exploration. They actually suspect it hindered the reviewers and authors.

The Recursive Loop Link to heading

The Beko study also found a quarter of AI comments were rejected or ignored. “Out-of-scope or irrelevant suggestions could slow down reviews,” “Sometimes the mistakes it thinks it finds are not mistakes at all,” and critically, “With each fix, a new review is generated…subsequent comments become redundant and unhelpful.”

This becomes more acute when each code change triggers a new AI review. Fix a null check, get new comments about error handling. Address those, get comments about variable naming. The recursive pattern generates useful feedback data for tuning the tool, but treating this configuration as production workflow creates an endless refinement loop. When AI review becomes a mandatory gate, developers learn to work around it. False positives erode trust in automation generally.

The Layering Problem Link to heading

This points to a deeper issue. When AI review attempts comprehensive quality evaluation, human reviewers end up verifying bot comments instead of evaluating architecture. The expensive resource (human judgment) checks the cheap resource (automation output).

AI review should pre-filter for verifiable patterns like SQL injection, missing input validation, cross-file duplication that jumped 48% in AI-generated code, obvious edge cases. So avoid saying “check for security issues” because that’s too broad, and leads to high false positives. Instead, look for specific high-confidence patterns that don’t need human interpretation: “flag unparameterized SQL queries” or “detect hardcoded credentials.”

Google’s ML system targets 50% precision, meaning it’s wrong half the time. The Beko study’s 26% rejection rate confirms this isn’t an outlier. This precision level creates a specific problem. A meta-analysis of 74 studies on automation bias found users following erroneous automated advice made 26% more errors than without automation. Systems at 70-90% reliability trigger appropriate skepticism, while higher reliability creates false confidence. The 73.8% acceptance rate sits in the skepticism zone where teams should be questioning suggestions. But the 2h 28m overhead comes from authors addressing bot feedback before human review, suggesting teams treat it as mandatory rather than advisory.

This creates a design constraint that runs counter to normal engineering instincts, though the reliability field has documented this for decades. Making AI review more accurate might degrade outcomes by increasing blind acceptance. The mitigation is making uncertainty visible: use confidence scores instead of binary judgments, optional suggestions instead of mandatory gates, and minimal visual styling instead of prominent notifications. Design for skepticism, not trust.

Pre-Filter Architecture Link to heading

Distributed systems use pre-processing layers to filter before expensive operations. A load balancer health-checks nodes and routes traffic. It doesn’t process requests. Mixing layers wastes the expensive resource on work the cheap resource should handle.

AI review works the same way. Run before human review in CI. Block on high-confidence issues like SQL injection or hardcoded credentials, and post low-confidence findings as optional suggestions along with areas where human reviews are essential like architecture. The language also makes a difference. If reviewers see “Security scan passed” or “3 optional items,” they focus on remaining things. If bot suggestions are mixed with human comments, they end up reviewing both the code and the bot’s suggestions.

The removal test reveals purpose. Remove the tool and observe what happens. If humans then spend more time on security checks and production issues increase, the tool was filtering real problems and we’re on the right layer. If reviews just move faster without more bugs escaping to production, the tool was adding overhead without adding value then it’s on the wrong layer.

Testing Configuration as Production Architecture Link to heading

The pattern repeats across maturity levels. A test of 8 tools found two couldn’t run at all. None caught a severe production bug. The Beko study used Qodo PR-Agent with GPT-4, running for six months. Google’s ML review has run for years with custom models tuned to their codebase. Whether testing immature tools or running production systems, teams configure AI review to run on every PR and gather accuracy data, but rarely update the workflow based on what they learn. The 2h 28m closure increase becomes permanent infrastructure.

Testing accuracy requires gathering rejection rates and failure modes to tune precision. That’s essential upfront. Teams need to revisit configuration frequently early on (weekly or bi-weekly), then continuously as patterns stabilize (monthly or quarterly). The data should inform where the tool fits in review flow, not become the workflow itself.

You’re ready to shift from testing to production architecture when acceptance rates stabilize around 70-75%, false positive patterns become clear (broad architecture suggestions vs. specific security checks), and high-confidence patterns emerge (SQL injection, missing validation). You’re stuck in permanent testing if closure times remain elevated at 2+ hours overhead, authors keep addressing bot feedback before human review starts, recursive loops persist (“with each fix, a new review is generated”), and the tool continues commenting on every PR during review.

The shift means moving high-confidence checks to pre-merge gates, surfacing low-confidence findings as optional suggestions, and dropping patterns that generate noise. The bottleneck from AI code persists with higher arrival rates, harder reviews, biological ceilings on capacity. AI review as currently deployed doesn’t solve this. It just moves the queue to a different layer.

The question isn’t whether AI review works. It’s how you’re using it: whether you’re still gathering accuracy data or actually using the tool to filter load. The removal test I mentioned above makes this concrete. If reviews move faster without more bugs escaping to production, you’re paying for testing theater, not architectural value.