What My Agent Missed

I keep quamina-rs in sync with the Go-based quamina upstream through a semi-automated process. There’s a marker file that tracks the last-synced commit. A weekly CI job queries GitHub for what’s new on the Go side and files an issue when we fall behind. Because I’m limited on time and keen to experiment, I use AI agents for the actual work. Most of the time they handle it fine. They read the Go commit, compare to the Rust code, port or skip. It’s a fairly reliable routine as long as you don’t turn a blind eye.

The Example Link to heading

Take the Go version’s PR #482. It’s on “precomputed epsilon closures” in the finite automata engine. The agent checked the diff, saw that Go was using a generation counter for its epsilon closure, noted that Rust had a different construct (a SparseSet) for the same purpose, and reported back that it was going to skip because we’d already implemented this.

Except PR #482 had nothing to do with which data structure tracks the closure. It was about when you compute it. Go had moved the entire epsilon closure calculation from match time to build time, replacing a depth-first search on every byte per state, with a one-time precomputation. My Rust code was still doing that DFS on every single traversal during matching, and the agent read right past it because it matched on the noun (“epsilon closure”) instead of the verb (“precompute”).

Then I typed six words: “initial assessment may have been superficial.” The agent re-read the PR, found what it had missed, and implemented the precomputation. I was “absolutely right!” and the NFA-heavy benchmarks dropped ~10%. The agent’s first pass had confidently waved it through as not necessary.

I’ve been chewing on that interaction because it wasn’t a fluke. The agent made a specific kind of error that sits in the subtext of many critiques of AI agents today. “Epsilon closure” -> “SparseSet” -> “already handled” is a fast and automatic association. It’s pattern matching on surface features. When folks tested LLMs against semantic-preserving mutations like dead code, renamed variables, misleading comments; the models that had previously “solved” these problems failed 78% of the time after the mutations. They read code like you skim a headline. They match on tokens, not on meaning. My agent did the same thing with a PR diff.

It reminds me of Thinking, Fast and Slow by Daniel Kahneman. In Kahneman’s dual process theory, System 1 is the fast, automatic, pattern-matching part of your brain that finishes the sentence before you’ve read it. System 2 is the slow, deliberate, effortful part that catches when the fast answer is wrong. I think the agents are System 1. I was System 2. And those six words were a System 2 override.

But the Kahneman analogy is vocabulary, not the mechanism. When the model produces “skip” with high confidence, its internal state treats that output as resolved. Self-critique asks it to evaluate its own output with the same context that produced it. So it has the same biases, same surface-level pattern match. It’s asking the system that confidently said “already handled” to check whether it should have said “already handled.” An external signal like “superficial” doesn’t ask the model what to look for. It tells the model its confidence was miscalibrated. That shifts the model from confirmation mode to search mode. It re-reads the PR diff with a different prior and looks for what it missed rather than confirming what it found.

This is the same asymmetry Kahneman describes where System 2 doesn’t redo System 1’s work, it catches when System 1’s answer needs redoing. There’s research backing this up where model performance degrades significantly on the tasks when asked to self-critique. But external verification, even minimal external verification, produces significant gains. When I said “superficial,” I wasn’t asking the agent to self-reflect. I was providing an external signal that its confident answer was wrong, and it should look again.

I’ve seen this hold up across 200+ quamina-rs sessions that I’ve archived using a personal tool. Every time I flagged work as superficial, incomplete, or misread; it triggered deeper re-investigation and improved the overall accuracy. It only pushed back, when it was right otherwise it would eventually figure out where it went wrong.

You can design for this. Link to heading

The sync infrastructure is one example. The marker file, the CI job, just upstream to kick off a session. Each removes a decision the agent would waste context on. Early in the project, sessions would open with “let me figure out what we’re doing.” Now they open with the first unreviewed commit. It sounds trivial, but each baked-in decision is one fewer thing the agent burns context on. It’s the difference between wasting the first ten minutes of a meeting on logistics vs getting straight to work. You can also preempt shallow reads in the session prompt by saying “read Go source directly, don’t trust past interpretations”.

Chunk sizing continues to matters more than I expected. I learned this when I started reviewing Go commits individually rather than in batches. An agent reviewing one commit reads the full diff. An agent reviewing five skims. The epsilon closure miss happened during a batch review. The agent had enough context to pattern-match on the noun but not enough attention budget to catch the verb. Chunking is a System 2 decision about what granularity keeps System 1 reliable.

How you phrase a nudge matters more than you’d expect. “Superficial” reliably triggered deeper analysis. But I’ve also seen agents over-correct. Perhaps they’re increasingly trained to treat any question as a leading question, so asking “is this deep enough?” can trigger a full redo when all you needed was a second look. I’ve had better luck with statements that clarify my intent than with questions.

The type of task changes the equation entirely. When the agent gets profiling data and I say “reduce these allocations,” that’s pattern completion. I can delegate with high confidence, and the results are often good enough to ship. But judgment is different. Should we precompute? What does that do to the build/match tradeoff? The agent generates an answer that sounds exactly as confident as the right answer, which is worse than generating nothing. Looking back at the Go sync, the optimizations the agent correctly skipped had clear structural reasons. Rust already had a better data structure, or the optimization target didn’t exist in Rust. The one it missed was the one requiring architectural judgment. So I let agents run on pattern-matching tasks like allocation reduction, boilerplate generation, test porting; but I stay in the room for ambiguous judgment calls like “does this optimization apply to our architecture?”

For what its worth, automating System 1 work isn’t new. Code generation, boilerplate scaffolding, linters, formatters have been taking mechanical pattern-matching off developers’ plates for decades. What’s changed is how much territory agents now cover. Code review, test generation, porting, refactoring used to require enough judgment that they stayed with humans. Now agents can handle the pattern-matching parts of those tasks too, which pushes more of the human’s remaining value into System 2, more often. Think architectural judgment but also taste and sarcasm detection. And the boundary keeps moving. The build-time vs match-time question that tripped up my agent today might be routine for a future model, which just means I’ll need to find the next layer of judgment that still requires me. It’s worth reiterating that moving the layers doesn’t move the responsibilities. The human still owns the outcome, but I think the System 1/2 frame is more useful than other framings that try to treat all outputs as equally suspect.

For my sync-with-go work, the AI assistance works because I’ve deliberately stopped doing System 1 work. I haven’t read most of the Go diffs myself other than for sanity checks and a few critical files. I’m not faster because I code faster, I’m faster because I don’t read diffs I don’t need to read.

Where I’m less sure is where the boundary sits for other projects, and how fast the goal post moves as the tooling improve. But the next time your agent confidently says “skip,” it might be worth typing six more words and getting your system 2 involved.