Simon Willison on Showboat and Rodney:
I’m still a little startled at how much of my coding work I get done on my phone now, but I’d estimate that the majority of code I ship to GitHub these days was written for me by coding agents driven via that iPhone app.
Same. As a test, over the past week, I’ve merged 12 PRs on quamina-rs through Claude Code for Web, barely ever sitting at a laptop. Some of these changes were nasty. NFA epsilon closure precomputation, arena budget enforcement hardening across 15 files, a correctness fix for JSON type distinction, performance work closing a 35% gap on field matching. Several PRs touch 8+ files across multiple architectural layers.
I credit my confidence in merging them to CI more than the model, the prompts, or the agentic loops. I have few parallel jobs run on every PR. Miri catches undefined behavior, Kani formally verifies encoding invariants, four fuzz targets hammer the rule pattern lifecycle, and the standard benchmarks for cpu + memory. This reduces the things I have to check, though I do check now. I do check though. I tried pushing code without checks and found lots of sloppy behaviour and correctness bugs that have been “fun” to untangle and so going back to the humble but trusty code-review has been incredibly valuable.
I do find Showboat interesting since it touches on the verification tax, but I think there’s still a gap in how soon can you complete the closed loop and how much does it cost you in token time. Still, its a good peek in where software development is headed.
On a related note, StrongDM’s Software Factory “scenarios” definitely hits hard for me:
The word “test” has proven insufficient and ambiguous. A test, stored in the codebase, can be lazily rewritten to match the code. The code could be rewritten to trivially pass the test. We repurposed the word scenario to represent an end-to-end “user story”, often stored outside the codebase (similar to a “holdout” set in model training), which could be intuitively understood and flexibly validated by an LLM.
Their explicit position is “code must not be reviewed by humans” is too extreme to follow in my profession space, but its directionally accurate. I don’t think its novel. For example, Simon labelled this in past as “vibe engineering”. The idea is sticky because its works and is very relatable. I repeatedly find AI tools reward practices we already knew were good. Testing, CI, version control. The stuff that was engineering hygiene before is now getting more and more table-stakes.
What surprised me was how it compounds. A fuzz target I added in week one caught a real issue in week two that neither the agent nor I would have spotted during review. Each CI investment makes the next agent session a bit cheaper to verify, which means more sessions, which means more reasons to add scaffolding. And the more I’m running through these checks, the less tokens I’m wasting going in circles. It’s scaffolding that works for everyone, agents and humans.