Reliable Verifiers and the Testing Problem

I wrote about how infrastructure manages the ETTO trade-off through tests, checkpoints, automated verification and very soon after found Marc Brooker’s recent post on reliable verifiers:

AI is going to be most effective in problem spaces where there are what the authors call reliable verifiers. Where, in effect, we can do automated hill climbing towards a low-ambiguity solution… Much of the next decade is going to be defined by finding better techniques to build these reliable verifiers where none existed before. If you’re a software engineer, what I’m saying here is testing is going to be the most important thing.

That ETTO infrastructure (tests, automated checks) only works because it creates reliable verifiers that can offload thoroughness from humans to systems. Without them, you’re stuck doing manual verification regardless of how fast agents generate code. The trade-off doesn’t go away; it just shifts from “can I review this carefully enough?” to “do my verifiers actually check what matters?”

The challenge surfaces immediately when you try to build these verifiers. Brooker does well in highlighting the two failure modes the need addressing:

Prevent overfitting. Evaluating against narrow workloads lead to algorithm failures like overfitting, where the solutions either hard-code behaviors or overfit to specific traces. Prevent reward hacking. Reward hacking occurs when solutions exploit evaluator loopholes rather than solving the intended problem.

This is why comprehensive test suites matter more than good test coverage. You need tests that capture the actual problem space, not just the paths you happened to think of. I’ve seen agents confidently claim tests passed when they actually failed. The tests output (in RSpec and Cargo across different codebases) were ambiguous enough that the agent couldn’t tell the difference. A reliable verifier needs unambiguous signals.

Meta’s Diff Risk Score shows what production-scale reliable verification looks like. They trained an AI model to predict whether a code change will cause a severe incident:

By training the model on historical data on diffs that have caused SEVs in the past, we can predict the riskiness of an outgoing diff to cause a SEV. Diffs that are beyond a particular threshold of risk can then be gated.

The model itself is interesting (they built iDiffLlama-13B, which is change-aware and outperforms larger general models), but what makes it reliable is the infrastructure around it like rigorous SEV review processes, manually identified root causes by domain experts, and tunable gating thresholds. Microsoft’s AI code reviewer follows similar patterns: automated checks within existing workflow, with humans retaining control over suggested changes.

These systems work because they’ve solved the verifier problem for their specific domains. But Brooker identifies the harder challenge:

Much of academic systems work already seems bottlenecked on selecting which problems to pursue… It takes more experience, more insight, and more vision to choose problems than to optimize on them. It takes more taste to reject noise, and avoid following dead ends, than to follow the trend.

AI can optimize once you’ve defined the problem and built the verifier. What it can’t do is figure out which problems are worth solving in the first place. In more harsher terms, it doesn’t have taste. Meta knew they needed to predict SEV-causing diffs because they had years of incident data showing that was a bottleneck. The verifier (DRS) automated the optimization, but choosing to build it required understanding their system’s failure modes at an organizational level.

I’m noticing this pattern with agent-assisted development. The infrastructure that work (tests, checkpoints in LLM sessions, risk scoring in Meta & Microsoft’s work) all assume you’ve already figured out what correctness means for your system. Building reliable verifiers is tractable. Knowing what to verify requires the kind of judgment that comes from repeatedly seeing systems fail in production.

Maybe that’s why the ETTO principle feels so structural. You can build mechanisms to manage the your trade-offs, but only after you understand which thoroughness checks actually matter and which ones pose hallucination risks. The automation amplifies your ability to verify, but it doesn’t tell you what questions to ask.