The Cognitive Load Cliff in Code Review

Large pull requests don’t just take longer to review. They fundamentally don’t work.

A 2006 study of 2,500 code reviews at Cisco found that when reviewers moved faster than 450 lines per hour, 87% of reviews had below-average defect detection. The effectiveness dropped sharply. Later research confirmed the decline is faster than linear, describing how “effectiveness drops very quickly with the growth of changeset size and remains on a very low level for big changesets.”

I’ve tracked this across teams over the past few months. Reviews under 300 lines tend to get meaningful architectural feedback. Questions about edge cases, suggestions for better abstractions, discussions about tradeoffs. Past 600 lines, comments shift almost entirely to style issues, typos, and obvious bugs. The reviewer isn’t thinking deeply anymore. They’re skimming, pattern-matching, checking boxes. The optimal range is somewhere around 200-400 lines of code per review. Go past that, and you’re not just doing a slower version of the same review but something that’s qualitatively different and less effective.

The biological ceiling Link to heading

The cliff exists because of how human cognition works, not because of tooling or process choices. Code review requires holding multiple mental models at once. The existing system architecture, the intended change, potential edge cases, security implications, performance considerations. Each file or function you’re reviewing adds to that cognitive load. When you exceed working memory capacity, comprehension breaks down.

You can see this degradation in Microsoft’s research analyzing 1.5 million review comments across five projects. About one-third of code review comments weren’t useful to the author. More tellingly, the more files in a changeset, the lower the proportion of useful feedback. Reviewers under cognitive strain leave more comments, but fewer that actually matter. Eye-tracking studies can even detect when developers hit cognitive overload with 86% accuracy by measuring behavioral changes while they engage with code. Meaning, the mental load is measurable and has real limits.

What’s striking is how consistently teams rediscover these limits. LinearB’s 2025 analysis of 6.1 million pull requests across 3,000 teams found that elite teams average under 219 lines of code per PR, with the 75th percentile under 98 lines. That’s remarkably consistent with the Cisco study’s 200-400 LOC optimal range from nearly two decades earlier despite completely different tooling, workflows, and team cultures. Teams performing at the highest level aren’t following best practices. They’re discovering the same cognitive boundaries.

The constraint shows up everywhere once you see it. Some teams enforce PR size limits through CI checks. Others define rigid ownership to partition review responsibility to limit the scope each reviewer needs to hold in their head (like the mobile team reviews mobile changes, and the API team reviews API changes). A few teams I’ve worked with treat large PRs as code walkthroughs rather than traditional reviews, using synchronous “jam” sessions where the author explains the changes while reviewers ask questions. None of these approaches eliminates the constraint. They just work around it by either reducing the size or changing the review mode to match cognitive capacity.

Time has a ceiling too Link to heading

The Cisco study also found that review effectiveness plummets after 90 minutes of code reviewing, with 60 minutes being the optimal session length which still feels too long for my liking. Whatever the size of the PRs maybe, reviewers “simply wear out and stop finding additional defects” past that point. This feels similar to the attention limits that affect deep work more broadly. The specific number varies by person and context, but everyone hits diminishing returns eventually.

You can’t review past capacity Link to heading

If you’ve designed backend services, this pattern looks familiar. You can’t fix a capacity-constrained service by telling it to “work harder.” A server maxed at 100% CPU can’t process requests faster just because the load balancer is sending more traffic. It either drops requests, queues them (increasing latency), or crashes.

Human cognitive capacity works the same way. When a reviewer is already at the limits of working memory, asking them to review a 1,000-line PR doesn’t get you a slower version of a good review. You get a degraded review where they’re pattern-matching against surface issues instead of reasoning about correctness. And if that code is also arriving in larger batches because AI makes it easier to generate volume, you’re not just increasing load on a constrained system. You’re pushing it past the threshold where it can function effectively.

I’ve seen this play out with database-adjacent migrations. One team split a schema migration into five PRs, each touching a subset of tables and their dependent code. Each PR got deep discussions (rollback strategies, edge cases in data transformations, performance implications, etc). Total review time was maybe 10~15 mins per PR. And then another team bundled everything into a single 2,000 line PR. It sat for a week. When someone finally reviewed it, they approved with “LGTM, trusting the tests.” The reviewer had hit cognitive overload before even starting.

The research establishes clear thresholds: 200-400 lines of code, under 500 LOC per hour, 60-90 minute sessions. But I’m curious about the edges. Does code complexity shift the cliff? One study found that cyclomatic complexity correlates about 90% with LOC across 1.2 million files, suggesting LOC is a reasonable proxy. But 400 lines of configuration might be easier to review than 400 lines of algorithmic code. And what about reviewer expertise? The studies don’t report different thresholds for junior versus senior developers, and the working memory limit is biological, not learned. Expert reviewers might chunk information more efficiently by recognizing a design pattern as a single unit rather than parsing it line by line, but that probably improves review quality within the threshold rather than shifting the cliff itself.

The Cisco study is from 2006, back when code review looked pretty different. Modern IDEs and AI assistants handle mechanical checks (linting, test coverage, basic security scans) which should reduce cognitive load. But the fact that elite teams in 2025 are still converging on similar size limits suggests the underlying cognitive constraints haven’t changed, even if the tools have.

What has changed is the volume of code being written, thanks to AI. If the bottleneck has shifted from writing to reviewing (as Little’s Law predicts), and reviewing has a hard ceiling based on how human brains work, then we’re not facing a process problem. We’re facing an architectural one.

The solutions aren’t about reviewing faster or working harder. They’re about decomposing work to match human cognitive capacity through smaller PRs, clearer ownership boundaries, review workflows that respect the 200-400 LOC threshold. The same way you’d design a distributed system around the throughput limits of your backend services, you need to design your development process around the cognitive limits of your reviewers. The constraint isn’t going away. The question is whether you architect around it or pretend it doesn’t exist.