Progressive Rollouts and the Observability Trade-off

Most engineering teams I’ve worked with treat progressive rollouts as an unqualified good. Roll out to 1% of traffic, then 5%, then 10%-it feels obviously safer than big-bang deployments. But in conjunction with couple of recent incidents, Lorin Hochstein’s analysis of fine-grained rollouts is making me realize that once again we might be optimizing rollouts for feeling safe rather than being safe.

The Observability Paradox Link to heading

There are two general strategies for doing a progressive rollout. One strategy is coarse grained, where you stage your deploys across domains. For example, deploying the new functionality to one geographic region at a time. The second strategy is more fine-grained, where you define a ramp up schedule (e.g., 1% of traffic to the new thing, then 5%, then 10%, etc.).

I’ve seen both approaches, and the difference in incident response is stark. When AWS used to rolled changes by a regions, problems were obvious. Our dashboards would light up like christmas while other regions stayed green. When we moved towards percentage-based rollouts to reduce any customer impact for the same types of changes, we saw problems became much harder to spot.

This was also prevalent with experiments before, but it was harder to spot the core problem as they carried much less risk and so it wouldn’t get flagged as much. But I’ve certainly seen this counterintuitive insight is that fine-grained rollouts can actually increase the total impact of failures:

When you do a fine-grained progressive rollout, if something has gone wrong, then the impact will get smeared out over time, and it will be harder to identify the rollout as the relevant change by looking at a dashboard.

When incident responders can’t quickly identify the cause of a problem, they burn through their cognitive capacity trying to correlate weak signals across multiple systems. The mental overhead of parsing smeared-out failure signals is enormous.

The Detection Window Problem Link to heading

The real trade-off becomes clear when you consider timing:

To determine whether fine-grained rollouts are a net win depends on a number of factors whose values are not obvious, including: • the probability you detect a problem during the rollout vs after the rollout • how much longer it takes to diagnose the problem if not caught during rollout • your cost model for an incident

I’ve noticed that most teams assume they’ll catch problems during the rollout, but that’s often not the case. Complex distributed systems failures don’t always manifest immediately. A memory leak might take hours to cause visible problems. A race condition might only trigger under specific load patterns. Database connection pool exhaustion might not surface until the next day’s peak traffic.

When these delayed problems do surface, the fine-grained rollout has already completed, and you’re left with a needle-in-haystack debugging exercise.

The Observability Tax Link to heading

it’s easy to imagine the benefits when the problems are caught earlier, it’s harder to imagine the scenarios where the problem isn’t caught until later, and how harder things get because of it.

This observation hits on something one of the common anti-pattern with incident responses where we optimize for the visible failure modes (obvious problems during rollout) while under-weighting the invisible ones (delayed detection with smeared signals).

The hard part is that you need tooling that can slice metrics by rollout cohorts, which many teams don’t have. You need alert thresholds that can detect small percentage increases in error rates, which is surprisingly difficult to tune without drowning in false positives. And, no, anamoly detection doesn’t work… yet.

In contrast, coarse-grained rollouts leverage observability infrastructure you already have. Think geographic region dashboards, data center health checks, customer segment metrics. These are table stakes for most systems.

Connection to Systems Thinking Link to heading

Fine-grained rollouts add another dimension of complexity to an already complex system. You’re not just managing the interaction between services. You’re managing the interaction between different versions of services running simultaneously.

The operational overhead compounds. Instead of “is the payment service healthy?” you’re asking “is the payment service healthy for rollout cohort A vs cohort B vs the control group?” That’s a more complex question that requires more sophisticated tooling and mental models.

I still remember seeing it in practice for AWS Lambda many years ago, and how much it blew my mind back then. I know AWS overall has an excrutiatingly high bar for operations in general, but you get to a whole another tier with Lambda even to this day.

Safety Theater vs Safety Link to heading

I suspect many teams adopt fine-grained rollouts because they feel like best practice and its better than nothing; not because they’ve thought through the specific trade-offs for their system. This is making me wonder about what else might we be taking for granted. Some questions I’m going to dig into more:

How do we quantify the observability tax of different rollout strategies?
What’s the minimum viable observability infrastructure needed to make fine-grained rollouts actually safer?
How do you communicate these trade-offs to stakeholders who intuitively prefer “more gradual” rollouts?

It will be interesting to learn what failure modes are we actually trying to prevent, and which rollout strategy best preserves our ability to detect and respond to those failures?"