Back to articles

Your AI gets dumber every turn. Fix the context, not the model.

Your AI worked great in the first three turns and got slower and less precise by turn thirty. Here is the context discipline that fixes it without a model swap.


Your AI was sharp in the first three turns and vague by turn thirty.

Your AI demo worked. Then real users had real sessions, and by turn twenty the assistant started missing things it knew at turn three. Approvals got buried. Tool outputs crowded out the goal. Replies got slower and less specific. The model did not get worse — your working context did. Slack named the same problem in its own product when it launched enterprise search: the bottleneck in real work is not "not enough information," it is finding the right information in time to make the next decision.

This post gives you the discipline that keeps long-running AI useful:

  • How to tell a context problem from a model problem in under 10 minutes
  • The "context as budget, not dump" rule and what to drop first
  • The summarisation failure that strips your audit trail and how to catch it
  • Three predictions about what breaks when you implement this
  • The first thing to ship on Monday

A context problem looks like a model problem

The first time long-session quality drops, your team will blame the model. They will swap to a newer version, tweak temperature, rewrite the system prompt. Some of that will help a little. Most of it is misdirection.

The actual cause is structural. Context gets messy when a conversation accumulates too much of the wrong material: repeated instructions, old assumptions that are no longer true, large tool outputs the user has already moved past, buried approvals, attachments that mattered three turns ago, and conflicting fragments from sub-tasks. The model still has "more information." It has less working clarity.

The rule: more context is not better context. More relevant context is.

A 10-minute diagnostic before you change anything: take one long session your users complained about, scroll to the turn where quality dropped, and ask two questions. What is still in the context that the system no longer needs? What does the system need right now that it can no longer see easily? If both answers are non-empty, you have a context problem, not a model problem.

Treat context like a budget

The mindset shift that fixes the rest of the design: context is a budget, not a dump.

Once you see it that way, the design questions stop being "what should we include" and start being "what earns its place." For every item currently in the prompt window you should be able to answer: must it remain verbatim, can it be summarised, can it be truncated, should it be dropped, does it deserve to stay in active memory.

The rule: every token in the active context should be paying rent on the next decision.

Teams that skip this end up with assistants that carry a lot of weight and very little clarity. The conversation is "complete" and the next reply is still wrong.

Preserve decisions, compress chatter

Not all turns are equal. The cheapest split that survives contact with production:

High-value, preserve verbatim: user goals, approved actions, key constraints, the current state summary, important tool outputs that affect future steps, and final decisions from prior steps. These are the things the next reply has to be consistent with.

Low-value, compress or drop: repeated phrasing, intermediate reasoning that has been superseded, large raw outputs that have already been acted on, and turns that added no new information.

The rule: preserve decisions, compress chatter. If you cannot point at a turn and say which bucket it goes in, your context strategy is "keep everything," which is the same as "no strategy."

Summaries must preserve operational meaning

Summarisation is the most common fix and the easiest to get wrong. The failure mode is not that the summary is too short — it is that the summary is thematic instead of operational.

Compare:

"They discussed the project status."

versus

"User confirmed the project should remain paused until legal approval arrives. Owner: Priya. Record: PRJ-481."

The first one is a description of a conversation. The second is something the next reply can act on. A good summary keeps who approved what, which record was affected, which source established a fact, and what changed in the workflow. A bad summary keeps the vibe and drops the audit trail.

The rule: summaries are operational, not thematic. If the next action cannot be taken from the summary alone, the summary is wrong.

Fresh reads beat stale memory for live state

Memory is good for continuity. It is bad for truth.

If the assistant is reasoning about anything that can change outside the conversation — inbox contents, record status, current assignments, latest activity, workflow progress — it should re-read the source instead of trusting what it "remembers" from earlier in the session. Conversational memory is a cache. Live state is the database.

The rule: re-read live state for any decision that mutates a system or commits to a user. Trust memory only for continuity.

Reliable assistants need both: memory to stay coherent across turns, fresh reads to stay correct against reality.

Tool output discipline

The quietest cause of context decay is unbounded tool output. A search returns 40 results and the full JSON goes back into the conversation. A document loader returns 12 pages and every page goes in. By turn ten the model is reasoning over its own exhaust.

What useful systems do instead: structure tool outputs so the relevant fields are obvious, summarise long outputs at insertion time, truncate when oversized with a pointer to the full result, and reduce to the parts that the rest of the workflow can actually use.

The rule: tool outputs enter the context summarised, not raw. This is not about hiding information from the model. It is about keeping the active context usable for the next decision.

Bias toward recency for operational state

In an operational workflow, the latest instruction, the latest verified state, the latest approval, and the latest successful action should dominate older signal in the same shape. Without that bias, the assistant stays anchored to earlier branches of the conversation after reality has moved on, which is why long sessions feel "off" even when the model is still fluent.

The rule: when two signals conflict, the newer one wins unless you have a documented reason otherwise.

What you will hit when you implement this

Three predictions for the team that takes context discipline seriously:

  1. Your first summarisation pass will strip the approval audit trail, and you will only notice when a customer disputes a change. The summariser will keep the conversational arc and drop "approved by", "record id", and "source URL" because they look like metadata noise. Write a test that asserts every approved-action turn produces a summary containing the actor, the object, and the source. Run it before the summariser ships, not after the dispute.

  2. One stale tool output will outlive the truth and the assistant will defend it. Someone will ask "what is the status of record X" early in the session, the tool will return "open", and twenty turns later the assistant will keep saying "open" after the record has been closed in the source system. The fix is not a smarter model. The fix is treating live state as un-cacheable and re-reading it on every decision that depends on it.

  3. Your context budget will get re-bloated within four weeks of shipping the discipline. A new feature will inline a verbose tool output "just for debugging." A prompt update will paste in three new examples. Cache hit rate and answer quality on long sessions will drift down and nobody will know why. You need a per-turn dashboard for active-context size, summary coverage, and the share of tokens that are tool output, or this regresses silently.

If you are already feeling any of these, you have run the discipline far enough to need the next iteration.

A practical operating model

Six steps, in order, on one long-running workflow:

  1. Preserve recent, high-value turns verbatim.
  2. Summarise older decisions into compact, operational state.
  3. Compress oversized tool outputs at insertion time.
  4. Re-read live state when factual certainty matters.
  5. Keep approvals and mutations prominent in the active context.
  6. Remove noise before it becomes reasoning material.

This is small enough to run on one workflow this week and measure against the same eval set you were going to use for a model swap. Most teams find the context fix beats the model swap on quality and beats it badly on cost.

The real lesson

Three sentences.

Long-running AI does not get less useful because the model forgot how to reason — it gets less useful because the working context stopped being usable. Manage context as an active operating resource, not a passive transcript, and the system stays sharp far longer than the demo did. Do this before you swap models, because a model swap on a polluted context just buys you a more expensive version of the same problem.


If you are running an AI workflow that gets less useful as the session gets longer, send me the longest conversation your assistant handled last week and I will tell you which context the system can no longer see clearly. [email protected].