The explosion of AI-generated pull requests isn’t a capacity problem, it’s a specification problem. When developers complain about AI creating massive PRs with tangled changes, the issue isn’t that the AI generated too much code. The issue is that the task itself was poorly scoped.

This pattern extends far beyond LLMs. Small-task constraints improve automation accuracy across many domains, from robotics to continuous integration to industrial process control. When automating any process, decomposition creates the clarity that automation systems need to function reliably.

Why Small Batches Force Better Automation Link to heading

When DORA research measured software delivery performance, elite teams achieved lead times under one day while low performers needed one to six months for the same work. Batch size made a bigger difference than talent or tools. From my time at AWS, I learned that small batches reduce cycle times, variability, and risk while creating forcing functions that harden automation workflows.

The mechanism is exponential, not linear. Don Reinertsen cites that for one organization when project duration doubled, slippage increased by 16x. Each additional line of code increases system relationships exponentially, making integration and deployment progressively harder. Frequent deployment doesn’t just deliver value faster; it creates pressure to automate the deployment process itself because manual steps become obvious bottlenecks.

This is why services like SQS and Kafka are so popular. Their architecture naturally encourages small batch sizes. I’ve also seen teams automate 60-90 minutes of manual work down to a single second with event-driven applications. The key wasn’t the technology itself but decomposing the multi-step process into clear, automatable units. The automation didn’t eliminate complexity, it revealed it. We had to make implicit knowledge explicit, which turned out to be the valuable part.

The LLM Granularity Gap Link to heading

Recent research quantifies what I’m hearing from developers daily. LLMs perform dramatically better with smaller, well-defined tasks. The Tree-of-Code framework, tested across 10 LLMs, found that program-level granularity with execution-based reflection achieved 18 percentage point improvement with fewer turns of back-and-forth.

The performance cliff appears around large token counts. Multiple independent studies confirm that even with perfect information retrieval, LLM performance degrades 13.9% to 85% as context length increases within models’ claimed limits. I’ve found this isn’t really a model limitation though. It’s how attention mechanisms work across long sequences in general, beyond just AI.

Task specification quality compounds these effects significantly. Analysis of 558 incorrect code snippets reveals that logical mistakes and missing conditions increased with vague task requirements. A separate study found that structured prompt patterns (think plan mode and templates) achieved 97.33 effectiveness scores while unstructured questions required substantially more iterations. The pattern I keep seeing is that vague specifications create semantic drift that no amount of model capability can overcome.

Anthropic’s agent skills framework codifies this as “progressive disclosure” where agents load information incrementally as needed rather than dumping everything into context upfront. Even with 200k token windows, practitioners find that agents “get drunk” on excessive context and perform better with focused threads. The limitation isn’t window size, it’s cognitive load. I’ve seen this play out repeatedly at Square where breaking context into smaller chunks produces measurably better results.

I’ve mentioned a bit too often that GitClear’s analysis of 211 million lines of code from Google, Microsoft, and Meta repositories shows a fourfold increase in code clones coinciding with AI adoption. They’re symptoms of developers giving AI poorly-scoped tasks instead of AI generating bad code. When the task definition is “add user authentication” instead of “add JWT token validation to the /api/login endpoint with 401 response on invalid tokens from file src/auth.go,” the AI fills the specification gap with assumptions. The resulting code works. Sometimes quite well if your codebase is predictable enough. But it doesn’t integrate cleanly because the task lacked the constraints that would force coherent design decisions.

Trunk-Based Development as Constraint System Link to heading

Trunk-based development codifies these constraints. Short-lived branches (hours to days, not weeks), frequent commits, and stable trunk requirements. The ideal is ten-second builds with commits every five minutes, which become forcing functions that make unclear specifications immediately obvious.

When empirical studies analyzed nearly 100K commits, they found that refactoring frequency actually decreased after CI adoption, suggesting that automation constraints change how developers decompose work. High-performing teams maintain high commit frequency with small incremental units, while projects with infrequent commits and long-lived broken builds struggle regardless of tool sophistication.

At AWS, we achieved a 30-80% decline in operational incidents, though not through better monitoring alone. In fact, monitoring often increased our pager count at first because we started finding previously hidden issues. What actually helped was using that data as a forcing function. When leadership pushed for smaller, testable changes, the constraints themselves drove quality improvements; at least for teams that leaned into it rather than fought it.

Beyond Software Link to heading

Beyond software, robotics research demonstrates that task decomposition enables more than 90% success rates for complex operations. Lean manufacturing (or The Toyota Way) has codified this knowledge for decades, where process decomposition with detailed time studies enables automation of value-added steps while eliminating non-value-added waste.

Robotic Process Automation (RPA) experiences high failure rates in automation initiatives, but the root cause is that organizations treat automation as an IT project instead of business transformation. When you automate a messy process, you get automated mess. The successful RPA deployments I’ve seen started with process decomposition and clarity before writing a single line of automation code. The automation doesn’t fail—the specification does. And no amount of AI capability fixes poorly-defined work.

Design Constraints as Forcing Functions Link to heading

Constraints force clarity. When you must deploy multiple times daily, you can’t leave ambiguous edge cases for “later.” When you must keep branches short-lived, you can’t defer integration decisions. When you must keep commits small, you can’t bundle unrelated changes. Each constraint eliminates degrees of freedom that obscure specification gaps.

AI amplifies this dynamic because it will happily generate code for vague specifications, it just won’t be the code you needed. There will be a large PR at the end, but that large PR isn’t the problem; it’s the symptom. The problem is that the task specification left enough ambiguity that the AI filled gaps with assumptions to build a solution out of nowhere. It’s still remarkable, but we’ll see better output if we build in more clarifying questions in these agents.

The practical implication isn’t “make everything smaller” but rather “use size constraints to reveal specification gaps.” If you can’t break a task into deployable increments (preferably sub-400 LOC), that’s a signal you don’t understand the task well enough yet. If the AI generates sprawling changes, that’s feedback your specification was incomplete. Use the constraint as a diagnostic tool.

Where This Goes Link to heading

Current AI coding tools already decompose internally. Plan modes break tasks into steps, agents fork contexts for subtasks. But they’re working with whatever specification you gave them. If that initial spec was vague, all the internal decomposition just propagates ambiguity faster.

The interesting shift isn’t building smarter AI that handles vague specs better. It’s recognizing that constraints reveal what we actually want. When Claude Code forces you to break a feature into sub-400 LOC increments, or when trunk-based development makes you split commits hourly, that friction isn’t overhead. It’s the specification work you’d have to do anyway, just surfaced earlier when it’s cheaper to fix.

I keep seeing teams treat large AI-generated PRs as a success (“look how much it built!”) when it’s actually a code smell for specification debt. The code works, sure. It just doesn’t integrate cleanly because the task lacked constraints that would force coherent decisions. Better task decomposition isn’t about productivity, even though it’s touted as such. It’s about making implicit intent explicit before automation fills gaps with assumptions.

Turns out the bottleneck isn’t AI capability. It’s how clearly we can say what we actually want.