AI Productivity Metrics Miss the Point

Everyone’s chasing AI productivity numbers, but I keep feeling we’re optimizing for the wrong thing. The industry is fixated on percentage gains-20% faster coding, 40% more throughput-when the real question is whether AI helps us make better technical decisions. Take this post on How Tech Companies Measure the Impact of AI on Software Development by Laura Tacho

There’s a gap between what leaders need to know and what’s being measured and talked about, and this measurement gap only widens as new tools and capabilities hit the market.

This gap feels familiar. It’s the same disconnect I’ve seen with every productivity initiative that focused on velocity over outcome quality. We measure what’s easy to count rather than what actually matters.

The Theater of Productivity Link to heading

The current AI metrics landscape reads like productivity theater:

You don’t need totally new metrics to measure AI impact. Instead, focus on what’s always mattered. Is AI helping your organization get better at those things?

This is the right instinct, but then most organizations immediately contradict it by obsessing over AI-specific metrics that ignore fundamental engineering effectiveness. It’s like measuring lines of code again, but with more sophisticated tooling.

Hannah Foxwell captured something important about this productivity obsession in “Making AI Agents work for you” talk:

I don’t want to be made better and faster. I don’t think there’s ever been a point in my career where I got to the bottom of my to-do list… For most people and most teams, I think maybe we should be focusing on doing the right work, and doing it better.

When I see claims of massive productivity gains, my first question isn’t “how much faster?” but “faster at what?” Are we shipping features that customers actually need? Are we reducing technical debt or creating more of it? Are we making decisions that will hold up under scrutiny?

What Actually Moves the Needle Link to heading

You need to collect system-metrics AND self-reported ones in order to get robust data, by which I mean data that covers dimensions of speed, quality, and maintainability.

This is getting closer to something useful. But even here, I’ve found that the most meaningful improvements come from things that are hard to measure directly:

Better technical discussions during design reviews
Fewer critical production issues because someone caught an edge case
Architecture decisions that actually make future work easier
Code that other engineers can understand and modify confidently

These outcomes matter more than whether someone used AI to write their function faster. Otherwise you’re just improving your throughput of bugs.

The Measurement Trap Link to heading

Track metrics that keep each other in check. Change Failure Rate alongside a measure of speed like PR throughput.

The balanced scorecard approach makes sense in theory, but I’ve seen it become its own form of theater. Teams consistently end gaming the metrics that are tracked so you want to track things that you’d want to improve even if they are gamed.

Microsoft uses the concept of a “bad developer day” (BDD) to assess the impact of AI tools. It’s a real-time look into the toil and friction of devs’ day-to-day work, whereas developer experience surveys provide a lagging indicator.

This feels more honest. Instead of measuring how much code got written, ask whether the day felt productive. Did you solve real problems or just fight with tools? Did you leave work feeling like you moved something meaningful forward?

The Questions Worth Asking Link to heading

Rather than “how much faster are we coding with AI?”, consider asking:

Are we catching more design problems before they become production issues?
Do code reviews surface better technical objections?
Are we making fewer decisions that we later regret?
When something breaks, do we understand why faster?

I’m curious whether organizations that focus on decision quality over decision speed end up with different AI adoption patterns. My guess is they’re probably more selective about when and how they use these tools, which might actually lead to better long-term outcomes than trying to AI-optimize everything.