Stop making your prompt do the work your code should be doing

If you are building a chat-based AI product, the demo version is easy and the production version is a different job entirely. Users answer the third question before the first, ask side questions in the middle of a flow, change their mind, and expect the system to stay calm. On a recent build, that pressure crushed every "just stuff more into the prompt" instinct I had — until I moved workflow state, routing, and side effects out of the model entirely. The reusable shape is a single 8-step turn handler, a TurnResult contract the model must return on every turn, and a clear rule about what code owns versus what the prompt owns.

By the end of this post, you will know how to:

Move workflow state out of the prompt and into structured fields the application owns
Force the model to return a turn contract, not just reply prose
Keep routing policy and side effects in code where they can be tested
Predict the next three failure modes your team will hit after you ship this architecture

The problem is state management, not text generation

The first failure mode in these systems is prompt drift. The team keeps stuffing more conversation history, more business rules, more product FAQs, more routing exceptions, and more tone instructions into one prompt to cover edge cases. That works until small prompt changes start producing weird behavioural changes and the assistant becomes a single opaque blob you cannot debug.

That is a reliability problem and a maintenance problem at the same time. If every new edge case makes the prompt longer, the system gets more fragile as it grows.

The first design decision is therefore non-negotiable: the model is not the source of truth for workflow state.

Keep workflow state outside the model

Treat the conversation as a stateful workflow with a transcript attached, not as a long free-form chat. The transcript is useful context. The durable state lives separately as structured fields and stage markers, so each turn can answer:

What do we already know?
What is still missing?
Which stage is currently incomplete?
Has this workflow already reached a terminal outcome?

Once you make that shift, the model has a narrower job. It does not need to infer the whole workflow from raw history every turn — it only needs enough context to handle the current turn well. Ambiguity drops, and the workflow stops being hostage to long-context prompt sprawl.

Use stages, not vibes

Break the flow into explicit stages. The user never sees them; the system needs them. Each stage defines the set of fields the workflow is trying to collect before moving on, which lets the application compute two things at every turn: the current stage and the remaining missing fields.

That does two jobs. It gives the prompt a stable summary of what matters right now, and it gives the application a deterministic way to decide progress without asking the model to make that call from scratch. The user experience still feels conversational. Underneath, the system is marching through a real state machine.

One turn handler for every channel

It is tempting to build one path for messaging, another for internal review, another for testing, and let each evolve separately. That feels fast early and creates drift later. Push everything through one turn handler that owns the same sequence every time:

Load the current conversation.
Rebuild message history from the stored transcript.
Compute the current stage and remaining fields.
Build the system prompt from that state.
Ask the model for structured extraction plus the next reply.
Merge any newly extracted data into state.
Decide whether this turn should continue, answer a side question, or terminate into a route.
Persist the updated transcript and state.

That gives you one operational surface instead of three half-similar ones. The payoff is not elegance — it is consistency. Once messaging and internal review run on the same engine, bugs reproduce in one place and fixes apply everywhere.

Make the model return a contract, not just prose

The single most useful choice in the build was forcing the model to return a structured turn result on every turn. Conceptually:

type TurnResult = {
  extracted: Record<string, unknown>
  reply: string
  retrievalHint: string | null
  terminalIntent: "none" | "escalate" | "complete" | "defer"
}

The exact field names are less important than the contract shape. The model should not only speak. It should also classify the turn in a way the application can act on.

A real workflow needs more than prose. If the user casually reveals a missing field, the system should capture it. If the user asks a side question, the system should know retrieval may be needed. If the user is effectively asking to exit the main flow, the system should know that too. Without that contract, the application ends up scraping its own assistant output or running extra inference steps to recover intent — avoidable complexity.

Keep routing and side effects in code

The model can suggest intent. It should not own final routing policy or trigger side effects. That logic lives better in normal code, because routing is business logic: it changes, accumulates exceptions, needs to be readable by humans, and when it breaks you want to debug a function, not reinterpret a prompt paragraph.

In practice the model says something equivalent to "this user wants a human," "this user wants a lower-friction path," or "this user is not ready yet." The application makes the actual decision. If routing behaviour needs to change, you update code and tests without touching the conversational layer. That is a much safer maintenance boundary than burying route selection inside prompt text, and it shrinks the blast radius of model behaviour changes.

Treat retrieval as a detour, not the main road

Users do not stay on script. They ask about pricing, options, timelines, and edge cases in the middle of the main flow. Ignore those questions and the assistant feels brittle; let them hijack the workflow and progress stalls.

The compromise is treating retrieval as a detour. The main turn still tries to do its normal job. When the model flags a side question via retrievalHint, the application can fetch grounded information from a trusted source, pass that context into a second bounded model call, answer the question, and continue the workflow from where it was. The assistant stays workflow-led. Retrieval only appears when the turn actually needs it, which keeps latency and complexity bounded.

Build the review surface on day one

The easiest mistake in AI products is waiting to build the review surface. You end up able to see outputs but not the workflow state that produced them, which makes practical questions impossible to answer: What stage was the conversation in? Which fields had already been collected? What route did the system choose? Did the assistant answer the side question and still make progress?

Expose the transcript, the current stage, the collected fields, the remaining fields, and the terminal disposition from the start. That sounds like an operational detail. It is a product quality feature. If you cannot inspect workflow state cleanly, you will struggle to improve the system with confidence.

The anti-pattern in one sentence

Do not make the prompt carry responsibilities that belong in application code. That includes tracking workflow progress, enforcing route policy, deciding whether the conversation is complete, governing side effects, and holding the only durable representation of state. Prompts matter. But when the prompt is simultaneously workflow engine, router, validator, and retrieval coordinator, the system feels smart in demos and slippery in production.

What you will hit next (the oracle section)

Three predictions for the team that ships this architecture:

Your TurnResult schema will start growing new fields every sprint. Someone will want a confidence score, then a "user sentiment" enum, then a free-text reason field. Within a quarter the contract is bloated and inconsistently populated. Treat the contract like an API: version it, write tests against it, and reject additions that the application does not actively branch on.
Your stages will calcify before your domain does. The first stage definitions feel obvious, then a real cohort shows up that does not fit any of them. Teams patch by adding optional fields to existing stages instead of redrawing the state machine, and stage transitions silently stop meaning anything. Plan for one stage refactor per quarter and budget the eval work to prove it.
The review surface will become the most-used internal tool in the company. Sales, support, and compliance will all want their own filters and exports. If you built it as an afterthought, you will rebuild it under deadline pressure within six months. Give it a real owner now.

What I would reuse next time

If I were building another chat workflow tomorrow, I would start with this checklist:

Define structured state before writing the main prompt.
Break the workflow into explicit stages with required fields.
Make the model return a structured turn contract, not just reply text.
Keep routing policy outside the model.
Use retrieval as a bounded side path, not the default for every turn.
Reuse one turn handler across channels wherever possible.
Build an internal review surface on day one.

The most useful AI workflows are not the ones with the most impressive prompt. They are the ones where the engineering keeps the prompt from having to do everything.

If you are building a chat workflow where the prompt feels like it is doing too much, send me the system prompt and I will tell you which responsibility belongs in code instead of the prompt. [email protected].