Back to articles

Count before you fix

The four-step discipline that stops your next LLM prompt rewrite from silently regressing production.


The cheapest diagnostic always beats the cleverest fix.

You ship a probabilistic system. A user files a bug. The reply on the screenshot is obviously wrong, so you rewrite the prompt and deploy.

A week later, three new bugs land in different shapes.

You are now in the worst place an AI engineer can be: patching long tails by hand, one screenshot at a time, with no idea whether your fixes are net positive across the rest of traffic.

I shipped exactly this mistake on a live project last month. A broad prompt rule designed to fix one extraction failure regressed two unrelated cases and dropped the overall pass rate from 74% to 69%. A narrower, precondition-gated version of the same rule recovered to 75%.

This post gives you the four-step discipline that costs 15 minutes per bug and prevents every silent regression for the rest of the project: count, instrument, narrow, evaluate. In that order.

The bug you are looking at might not be a bug

Deterministic systems have edge cases. Probabilistic systems have long tails. Long-tail bugs are normal — your first job is to figure out whether the screenshot in front of you is one of them or whether something is actually broken.

Recent example. An agent had a terminal route that emitted a specific URL to the user. On a live test, the URL was not appearing. I ran the conversation to what I thought was the final turn, saw no URL, and concluded the routing fix had not landed.

It had. I had missed an interstitial confirmation step that inserts one extra turn before any terminal route fires.

When I dumped the live state, the story was clear:

SignalExpectedActual
Terminal signal extractedtruetrue
Route conditions mettruetrue
Route chosencompletecomplete
Visible URL at this turnyesno — confirmation prompt

Extraction was correct. Routing was correct. The reply at that turn was correct. I was reading the wrong turn.

The rule: inspect downstream state, not surface output. The model's reply is the last thing that happens. Everything that decided it is sitting in a database. If you cannot dump per-turn state on demand, build that before you fix anything else. Fifteen minutes of plumbing pays back across every false-bug investigation for the rest of the project.

One bad classification poisons every turn after it

Routed agents have a structural failure mode worth naming.

A routed agent picks a "what kind of conversation is this" branch, then specialises the next reply. An early misclassification does not just hurt the current turn — it biases every reply that follows. The route, once chosen, becomes the dominant context for the model.

The fix is architectural, not promptual. Re-derive the route on every turn from the latest merged state, not from history. A strong signal at turn five — say, an explicit terminal signal from the user — must dominate a weak signal at turn zero that originally chose the route.

If you find yourself writing prompts to remind the model to reconsider, you are papering over a graph that locks in too early.

A test for the invariant: pick a strong-signal phrase from your domain — one that should decisively flip the route regardless of prior context. Inject it as the final user turn over 20 different transcripts. Check that the resulting route matches the strong signal in every case. If the route varies based on what came before, your routing function is reading history when it should be reading state.

The fix you are tempted to ship is almost certainly wrong

You have just spent an hour staring at one bad transcript. Your hands want to rewrite a prompt. The rewrite will feel productive. It will not be.

A prompt rewrite to fix one bad transcript has three problems:

  1. You do not know how often the transcript shape appears in real traffic. It might be 0.1% of cases.
  2. You cannot measure the rewrite's effect without an eval set that covers the rest of the distribution.
  3. Probabilistic systems regress in non-obvious places. A rule that fixes case A often breaks cases B through F because you have added context the model now applies everywhere. That is the 74 → 69 story above.

The safe move: the smallest fix that addresses the precondition of the bug, not the bug itself.

Ship a log line, not a fix

When you spot a long-tail bug, the highest-leverage move is almost always instrumentation, not a fix. One line of structured logging at the suspect branch, with enough context to make the bug countable later:

[route-diag] quota_exhausted {"tenant_tier":"pro","feature":"bulk_export","remaining":0}

Stable prefix so it is greppable. JSON payload so it is parseable. No behaviour change. No risk to routing.

Then wait. Two weeks of real traffic will tell you what no amount of staring at one transcript can: how often this happens, in what shapes, on what cohort.

  • 12 hits across a million sessions → defer.
  • 12% of one segment → fix, and now you know the cohort well enough to test the fix.

Threshold I have internalised: do not spend more than 15 minutes designing a fix for a bug you have not counted. If you are drafting prompt language for case one of N where N is unknown, stop. Ship the log. Come back in 14 days.

Why this matters more for AI products than traditional software

In a deterministic system, one bad screenshot is usually one bug. Fix it, write a regression test, move on. The cost of a wrong fix is bounded — the test catches it.

In a probabilistic system, one bad screenshot might be:

  • a real bug,
  • a measurement artefact,
  • a long-tail event that will never recur, or
  • the visible tip of a structural problem affecting half your users.

You cannot tell which from one transcript. The cost of a wrong fix is unbounded — regressions appear silently in distributional shifts you do not have an eval for.

The discipline therefore:

  • Every fix needs an eval.
  • Every fix is gated on counting the cohort first.
  • Instrumentation ships before fixes do.

The prediction: if you inherit an AI product where every bug fix is a prompt rewrite based on one screenshot, you are inheriting a system that will regress on its next deploy. I have not seen one exception.

The real lesson

Three sentences.

When a probabilistic system looks broken, inspect the state, not the output. Do not patch a long tail you have not counted. Build the boring infrastructure that makes you slow on individual bugs and fast on overall quality.

The cheapest diagnostic always beats the cleverest fix.


If you are firefighting one bad screenshot at a time right now, send me the transcript and the prompt rule you are tempted to ship, and I will tell you which of the four steps above you are skipping. [email protected].