Stacked PRs as Parallel Review

Stacked PRs, where you are building PR #2 on top of unmerged PR #1, promises to unblock authors who’d otherwise wait for reviews before continuing work. Semgrep adopted stacking and saw saw a 65% increase in code shipped per engineer. But I’ve also seen teams abandon stacking after a few weeks, finding that rebase cascades and coordination overhead consumed more time than the blocking waits they were trying to avoid. The difference isn’t discipline or tooling alone. It’s whether the underlying conditions support stacking as an architecture.

On one side, the 3-stack of auth refactor worked beautifully because it was broken down into database schema, then service layer, then API endpoints. Each PR stayed under 250 lines, reviews came back in hours, and we shipped the entire feature in days. But I’ve also seen 7-stack change-set that turned into a coordination nightmare. A core issue on feedback on PR #2 meant invasive rebasing PRs #3 through #7. By the time we merged the last one, the first three PRs had been rebased four times each and the team had never touched stacked PRs while I was with them.

Pipeline Parallelism for Code Review Link to heading

Stacking works like pipeline parallelism in distributed systems. Each PR is a stage in the pipeline, processing a chunk of the overall feature. Reviewers can work on different stages concurrently, one person reviewing the database migration while another reviews the business logic that depends on it. The parallelism is for reviews one as merges still happens sequentially. PR #2 can’t merge until PR #1 merges, which means the pipeline can still stalls if any stage is unreviwed.

For reviewers, stacking reduces cognitive load when each diff stays under 200-400 lines. Instead of evaluating a 1,500-line feature branch, you review five 300-line PRs that each focus on a single architectural layer. The context is narrower, the surface area smaller. But this only works if each PR stands on its own. If the database schema PR is reviewable without understanding the service layer changes coming next. When PRs are truly independent architectural layers, reviewers can evaluate each in isolation. When they’re not, you’re reading three PRs to properly evaluate one, and the cognitive load savings disappear.

For authors, stacking removes the wait. You don’t context-switch to “something else” while your PR sits in review. You keep building on top of unmerged work, maintaining momentum. Similar to how batching groups reviewer work to protect focus, stacking groups author work to maintain flow. Both trade coordination overhead for focused time.

When Stacking Breaks Down Link to heading

The rebase cascade is where velocity gains evaporate. Review feedback on PR #1 means updating not just one branch but rebasing PR #2 on the updated #1, then #3 on the updated #2, and so on. Each rebase triggers merge conflicts. Each conflict resolution requires re-running CI. A 5-PR stack with feedback on the base can mean spending an afternoon resolving conflicts across the stack and watching CI re-run five times. The time saved not waiting for reviews gets consumed by rebase overhead.

Tooling can automate some of this, but Git wasn’t designed for stacking. The Pragmatic Engineer analysis of 15 million pull requests found that while stacking keeps developers “in the flow,” it requires mastering interactive rebasing as an “almost daily habit.” Without automation or deep Git expertise, the coordination overhead exceeds the velocity gains.

Stack depth is the leverage point. Meta and Google friends tell me they keep stacks shallow. When PRs merge quickly with trunk-based development and same-day review expectations, it’s rare to have more than two or three stacked PRs at once. The stack never got deep enough to create rebase nightmares. But on teams with slower review cycles, stacks grow. A 10-PR stack means 10 rounds of CI, 10 potential rebase operations, and tracking dependencies across 10 branches. At that depth, the coordination costs compound faster than the velocity gains. It’s a metric you have to monitor when supporting stacked PRs for sure.

Also, squash-merging breaks stacking in a specific way. When you squash and merge the base PR, GitHub creates a new commit on main containing all the changes, but the dependent PR still points to the now-deleted feature branch. The second PR shows as “merged” but its changes disappear since it merged into an orphaned branch, not main. Git doesn’t recognize the relationship between the original commits and the squashed commit, so dependent branches appear diverged even though the code is theoretically merged. The workflow breaks silently. The PR interface looks fine, but the code isn’t in main.

Conditional Effectiveness Link to heading

Stacking works when you have large features that decompose naturally into independent architectural layers, fast merge cycles that keep stacks shallow, and tooling that automates rebasing. It fails when your PRs already merge quickly (the overhead isn’t justified), when your team uses squash-merge workflows (orphaned branches), or when you’re stacking without automation (manual rebasing becomes the bottleneck).

The architecture of your review process, as in how you organize work, not just how you execute it, determines whether stacking helps or hurts. Like batched review sessions, stacking is conditionally effective. It works under specific constraints and breaks down when those constraints don’t hold.

The tipping point isn’t clear. Annecdotally, 2-3 PRs are manageable, 5+ get problematic, and 10+ require dedicated tooling. But those aren’t empirical thresholds. They’re anecdotal convergence. There’s no controlled study comparing review effectiveness at different stack depths, no data on optimal stack size for different team sizes or review speeds.

So when teams report different outcomes from stacking, both are right. The practice works under specific conditions and fails when those don’t hold. The question isn’t whether stacking is good architecture. It’s whether your review velocity supports shallow stacks, whether your tooling automates the rebase cascades, and whether your features decompose naturally into independent layers. Those constraints tell you if stacking makes sense, not whether you’re disciplined enough to manage the dependencies.