Optimizing Prompts Blind

Teams iterate on prompts the way we used to debug code before debuggers existed. Change something, run it, see if it’s better. No isolation. No clear signal about which change mattered. Just trial and error until something seems to work.

The Prompt Report documents left an impression on me with their case study: 47 recorded steps of iteration, where accidentally duplicating an email in the prompt improved performance. It’d be nice if we knew why it helped, but for now I assume no one knows. In this case, the engineer was optimizing for the wrong metric anyway because they lacked domain expertise. Tight iteration, but blind to which signals matter.

This is the pattern I keep seeing. In ‘AI adoption’ related efforts, teams rewrite entire prompts and evaluate “did it get better?” But prompts are interconnected systems where changing one part ripples through the entire interaction. The model interprets later instructions through the lens of earlier ones. Add a constraint in one section, and the interpretation of a previous instruction shifts. When you change multiple things at once, you can’t tell which change caused the improvement or regression.

Even when teams do isolate changes, they often iterate at the wrong level. They tweak words when the structure is broken, switching “analyze” to “examine” when the real problem is using zero-shot prompting for a task that needs few-shots to work massage the context. Or they restructure from few-shot to chain-of-thought when the issue is ambiguous task intent. It’s like debugging a race condition by renaming variables, or fixing an algorithm bug by reformatting the code. The feedback is scoped, but pointed at the wrong abstraction layer.

Martin Fowler’s work on anchoring AI to reference applications shows what scoped feedback looks like. Initial attempts at drift detection “went a bit overboard with lots of irrelevant comparisons.” Narrowing the feedback loop (through scoped examples) helped. Changing prompts, or models didn’t (or at least it didn’t for me in a nearly identical situation). Instead of checking every possible difference, they made the success criteria bimodal. As in either the pattern exists or it doesn’t. Once the boundaries get clearer, you end up reducing the noise in your feedback loops.

Simon Willison frames this as designing for problems with “clear success criteria” where trial and error makes sense. The key phrase is “clear success criteria”. You need unambiguous verification. Git history provides that. Test suites provide that. “Did it get better?” doesn’t. This is why reliable verifiers matter. Without clear signals, you’re optimizing blind even if you’re iterating fast. So using AI to write tests gives mixed results, but having AI write code against a well tested code goes much more smoothly.

The Prompt Report’s case study ended with DSPy’s automated prompt optimization outperforming manual iteration. DSPy works because it runs bounded experiments with clear metrics. Change one variable, measure the impact, iterate. Same pattern as A/B testing or feature flags with proper instrumentation.

I’ve noticed command compression follows this pattern. Early on you’re verbose because you don’t know which parts matter. You can’t scope feedback when you don’t know the variables. As you observe outputs and refine, you compress to the essential signals. The compression is learned efficiency from experience where you end up running scoped experiments hundreds of times. You’ve calibrated which signals actually matter and which are noise.

Most teams iterate prompts the way we used to debug before we had tools, changing everything and hoping. The discipline isn’t writing better prompts. It’s treating prompt iteration like debugging: isolate the variable, measure the impact, verify the signal. Scope your feedback or stay blind.