We’ve spent years fighting flaky tests. Build reliability is a solved problem for most teams or at least, a known problem with known solutions. But what if test reliability is optimizing the wrong layer?
CI isn’t a bug detector, it’s a load-shedding protocol Link to heading
When teams adopt CI, discussion drops. A study of 685 GitHub projects found CI substantially reduced review comments in PRs. The code still got the same scrutiny effort from reviewers, but they stopped commenting on syntax errors, style violations, and test coverage because CI absorbed that verification work.
There is an implicit division of labor these days, where CI handles mechanical verification (rule-based, predictable, automatable at scale) and humans handle judgment calls (design trade-offs, context-dependent decisions, security implications). Green build means syntax is fine, tests pass, style is consistent. So reviewers look elsewhere like architecture, logic, edge cases. (Note, this assumes you have functional CI setup and you’re not wasting your reviewers).
Google’s 70/20/10 testing pyramid encodes this. Their testing blog notes that end-to-end heavy approaches let developers “offload most, if not all, of the testing to others.” The pyramid says developers own quality, automation verifies mechanics, reviewers focus on what machines can’t evaluate. The load gets shed, not eliminated.
Protocols built for one scale break at another Link to heading
The division assumed certain patterns: human-written code, predictable defect types, manageable PR volume. AI breaks all three.
GitClear’s analysis of 153 million lines shows code churn projected to double, copy-paste up 48%, refactoring down 60%. PR volume is up 23% year-over-year (43.2 million/month in 2025). That’s the scale problem. More load arriving while reviewer capacity hits biological ceilings.
And the usage pattern shifted too. AI doesn’t make syntax mistakes CI can catch. It makes plausible-but-wrong architectural choices that require human context to evaluate like nuanced security issues, copy-pasting the content it was trained on, and accumulating architectural drift from half-done sessions.
The load that today’s CI sheds isn’t the load that’s growing; and so we need better linters that tackle these kinds of issues head on now. The division of labor used to be better load-balanced for human-written code at its lower volumes with smaller defect distributions. Run that same division at today’s volumes with AI-generated defect patterns and CI gets better at checking things that are getting auto addressed upfront while humans get overwhelmed checking things automation could theoretically verify but doesn’t. More PRs arriving while each takes longer to review properly means queue depth grows faster than teams can process.
The agreement needs updating Link to heading
When CI becomes unreliable, flaky tests that pass sometimes and fail other times, and the division collapses. Teams either recheck everything CI already verified (no load shedding) or assume someone else will catch it (gaps). Flaky tests were the #1 developer happiness factor at Slack. Not because they waste CI minutes, but because they break trust in what CI checks.
This is where making the division explicit rather than implicit helps. If teams know what automation covers, reviewers would know what gaps to fill rather than guessing based on whether the build is green. The implicit version assumes everyone shares the same mental model. But now that model is at risk of going stale in subtle ways because AI changes defect patterns. If your team had no mechanism to state the division before, now would be a good time to rush on it.
Also, even reliable CI has a failure mode. Automation bias research shows very reliable automation paradoxically increases complacency. When CI is too consistent, the green build becomes an assumption rather than verified fact. Reviewers stop checking.
Green builds mean “mechanics verified,” not “ready to ship.” Track what you’re checking manually in reviews. If your team hasn’t made this division explicit yet, starting now means you’ll know what to automate when manual work overwhelms you. When that volume gets high enough to slow reviews down, automate those checks too. But also prune the old ones. If your linters aren’t finding issues anymore, they’re tuned for syntax errors AI rarely makes while missing the copy-paste debt and security vulnerabilities it does.
The flaky test problem still matters because it wastes compute, but broken trust freezes a bigger feedback loop. So you can’t put your head in sand for too long without repercussion. Besides, teams can’t update what they should shed if they don’t trust what automation already checks. Right now most teams implicitly run yesterday’s division (CI checks syntax, humans check design) against today’s AI-generated code. This division needs renegotiation as code generation tools evolve, the same way you’d version a health check protocol when backend semantics change. Otherwise, perfect build reliability means you are failing reliably at the wrong layer.