Stop fixing AI bugs by rewriting the prompt

Great prompts shape behaviour. They cannot replace permissions, tool boundaries, execution policy, or observability.

You have tuned the system prompt for three weeks and the product still breaks in production. A user hits an edge case, the model invents a fact, an action fires that should have asked for confirmation, and your team is back in the doc adding another paragraph that starts with "Never." In 2023, a lawyer filed a brief in Mata v. Avianca containing six fabricated case citations produced by ChatGPT and was sanctioned by a federal judge, as reported by CBS News. No prompt rewrite would have caught that — there was no verification layer, no tool boundary between draft and submission, and no audit trail. By the end of this post, you will know the four layers a production AI system needs, the three predictions that decide when prompts stop scaling for your team, and the single question to ask before you touch the prompt again.

The Mata v. Avianca pattern is the pattern

The lawyer in Mata v. Avianca did not lose because the prompt was bad. He lost because the system around the model had no checks. The model produced citations. The lawyer pasted them into a filing. The filing went to a court. Nothing in that chain verified that the cases existed. That is a four-layer failure compressed into one workflow:

Prompt layer: told the model to research case law.
Tool layer: no tool to verify citations against a real database.
Execution layer: no approval gate between model output and court submission.
Observability layer: no log that would have flagged "model produced citations, no verification step ran."

Every AI product that breaks in production breaks in one of those four layers. The lawyer's case is the public version of what happens quietly inside companies every week.

Prompts are instructions, not infrastructure

A prompt tells a model how to speak, what to prioritise, what format to use, and what tasks it should attempt. That is real work and worth doing well. A prompt cannot reliably enforce permissions, tool boundaries, approval requirements, retry rules, failure containment, observability, or resource isolation. Those live outside the model.

The anti-pattern: "We told the model not to do risky actions unless the user confirms." That sounds responsible. If your system can actually execute the action, the only thing between a good outcome and a bad one is model compliance — and model compliance is a probability distribution, not a control. The control belongs in the execution layer: the system knows which tools mutate state, the system enforces the confirmation, the system limits which tools the model can call at all. That is how policy becomes behaviour.

The four layers a production AI system needs

If you want an AI product that holds up under real traffic, you need all four working together.

1. Prompt layer. Voice, priorities, response structure, behavioural guidance.

2. Tool layer. What the assistant can actually do, how inputs are validated, what outputs look like. Clear names, tight scopes, validated inputs, predictable output structures, explicit side-effect boundaries.

3. Execution layer. Sequencing, parallelism, retries, timeouts, approval gates. Retry only on retryable failures, stop on auth errors, detect repeated error loops, surface a safe explanation to the user when something fails.

4. Observability layer. Which tools ran, in what order, with what result, under which policy, for how long. This is how you debug reality instead of vibes.

Invest in only the first layer and you keep rediscovering the same three production problems: silent regressions when you change the prompt, ungoverned tool calls, and incidents you cannot reconstruct after the fact.

Prompts cannot fix bad tool design

A great prompt still struggles when the tools themselves are vague. A broad, underspecified tool gives the model too much room. That is not a prompting problem — it is an interface problem.

Some teams see dramatic gains from what they call "prompt improvements" when the actual change was tightening the tool surface around the prompt. Narrowing a tool from update_record(any_field, any_value) to update_record_status(record_id, status_enum) removes whole classes of failure the prompt was trying to talk the model out of. Once the tools are tight, the prompt has much less ambiguity to manage and fewer "Never do X" lines to maintain.

Prompts cannot arbitrate conflicting goals

Real production systems carry competing goals: be fast but accurate, be autonomous but ask before risky actions, be helpful but do not hallucinate certainty, be concise but show enough evidence to be trusted. A prompt can express the tradeoff. It cannot fully arbitrate it alone.

Mature systems encode part of the answer structurally. "Ask before risky actions" becomes an execution-layer rule that intercepts any tool tagged mutates_state. "Do not hallucinate certainty" becomes a tool-layer requirement that citation tools return source URLs the UI renders inline. The prompt still nudges. The system enforces.

Context management is a systems problem, not a wording problem

Long conversations are the clearest case. When performance deteriorates over a 30-turn session, teams rewrite the system prompt. Sometimes that helps a little. If the real issue is context overload, the actual fix is context management: trim noise, preserve recent decisions, compress older exchanges, protect key facts, stop flooding the model with low-value output. No prompt fully compensates for a bad working set. This is a systems problem wearing a wording-problem costume.

Failure behaviour needs code, not prose

A prompt can say "retry carefully" or "avoid loops." That is not failure handling. Production needs explicit rules: retry only on retryable failures, stop on validation or auth errors, detect repeated error loops, short-circuit failing batches, surface a safe user explanation. Outsource these to prompt wording and you build brittle systems that look fine in the demo and page you at 2am.

What prompt engineering is still excellent for

None of this means prompts are overhyped. They are the right tool for tone control, response structure, instruction hierarchy, domain framing, delegation guidance, and user-facing consistency. They work best when they sit inside a well-designed harness.

The framing to internalise: prompt engineering is how you shape model behaviour. System design is how you make that behaviour dependable. A mediocre prompt inside a disciplined system beats a brilliant prompt inside a loose one, every time, in production.

What you will hit if you keep over-investing in prompts

Three predictions for the team that treats every bug as a prompt rewrite:

Your prompt will hit a ceiling around 800 lines and behaviour will start regressing on changes you did not intend. The system prompt becomes a compressed encoding of every past incident. New instructions interact with old ones in ways nobody on the team can hold in their head. Pass rate plateaus. Each new "Never do X" line costs you somewhere else.
Your incident reviews will keep concluding "we will tighten the prompt." Six months in, you will notice the same root cause appearing in three post-mortems with three different prompt patches. None of them held. The cause was always a missing tool boundary or approval gate; the prompt patch was the cheapest visible response.
You will not be able to answer "what did the agent do and why" for last Tuesday. A customer asks. You open the logs. You have the final reply and the user message. Everything that decided the reply — the route, the tools considered, the policy applied — is gone. That is the moment you discover the observability layer was never built.

If two of those sound familiar, the prompt is not the next thing to change.

The one question to ask before you touch the prompt

Take your current AI workflow and ask: which failures would still happen if the prompt were perfect? Those are your systems problems. Solve those, and prompt work starts compounding instead of compensating. Ignore them, and you will keep asking prompts to do jobs they were never meant to do — until a Mata v. Avianca moment makes the bill come due in public.

If you are tuning a system prompt that has crossed 500 lines and still ships regressions, send me the prompt and one recent failure transcript, and I will tell you which three responsibilities should live in your tool, execution, or observability layer instead. [email protected].