The Human Accountability Layer

AI tools have changed what “writing code” means. Generation is cheap now. Verification is the constraint. And if your process for verification was “hope the reviewer catches it,” AI just amplified that problem.

Simon Willison frames this clearly in Your Job Is to Deliver Code You Have Proven to Work:

the … engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers-or open source maintainers-and expects the “code review” process to handle the rest.

This is rude, a waste of other people’s time, and is honestly a dereliction of duty as a software developer.

Your job is to deliver code you have proven to work.

Anyone can prompt an LLM to produce a thousand-line patch now. The value has moved entirely to proving it works.

Proof Has Two Non-Optional Steps Link to heading

I like this point-blank framing:

The first is manual testing. If you haven’t seen the code do the right thing yourself, that code doesn’t work. If it does turn out to work, that’s honestly just pure chance.

Manual testing means getting the system into a known state, exercising the change, and demonstrating it works. Paste terminal commands and output into the PR. Record a screen capture. Show your reviewers you’ve actually seen this code function correctly. At minimum, confirm that you saw the test pass which leads well into his next point of bundling the change with a test that proves it works:

The second step in proving a change works is automated testing. This is so much easier now that we have LLM tooling, which means there’s no excuse at all for skipping this step.

That test should fail if you revert the implementation. That being said, automation lets you scale your testing, it shouldn’t replace basic manual testing. In my observation, teams that skip manual testing because “the automated tests are working fine” will almost always regret it. The manual step forces you to actually see the behavior, not just assert it exists.

Make the Agent Prove It First Link to heading

Coding agents change the economics of verification. Tools like Claude Code can execute code, run tests, take screenshots, and iterate on failures in a tight loop. The verification steps that used to be too expensive or ambiguous to do systematically (“does this UI change work on mobile?” or “does this error message make sense in context?”) become tractable when the agent can execute them repeatedly until they pass.

To master these tools you need to learn how to get them to prove their changes work as well.

This is where specifications as compiler targets becomes more important. When you specify “add feature X and prove it works by showing me the test output and a screenshot of the behavior,” the agent doesn’t finish until it provides evidence.

Agents already extend test suites without much prompting if your project has tests. They reuse patterns from existing tests. Keep your test code organized with patterns you like, and the agent will follow them.

The Human Provides Accountability Link to heading

A computer can never be held accountable. That’s your job as the human in the loop.

Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review. That’s no longer valuable. What’s valuable is contributing code that is proven to work.

AI amplifies your existing process. If your process is “write code, submit PR, let review catch problems,” you now get more unverified code at lower cost. If your process is “prove the code works before submitting,” you get faster verification through the agent’s execution loop.

You ensure the code has been proven to work before anyone else has to look at it. That’s the human’s job now.

Where the Bottleneck Moves Link to heading

I’m curious whether this shifts how we think about code review capacity. If verification moves earlier, before the PR, does that reduce the cognitive load cliff reviewers hit? Or does it just shift the bottleneck to “reviewing the proof” instead of “finding the bugs”?

And what does “good taste in testing code” look like when the agent writes most tests? Simon Willison mentions this as a differentiating senior skill, but I don’t think its exclusive to seniors. I find it’s more about thinking in systems and designing test architectures that agents can extend. Meaning choosing the right abstraction level, the right fixtures, the right assertion patterns.

In summary, with the advent of agents, the testing isn’t a bottleneck anymore; but that doesn’t mean you go and create a new one for the reviewers by going full beans on code gen without confirming you’ve got working code. Next time you submit a PR, show your homework.