Cut your LLM latency by 99% and input cost by 71% without changing models
The 3-step playbook for cutting LLM latency and cost without a model swap, rewrite, or new framework — cache stable context, move internal work off the request, and fetch context before the model call.
If your LLM product takes 8+ seconds per turn, the model is not the problem.
Last quarter, one of our conversation paths carried 10,121ms of blocking overhead before the user saw a reply. Another path added 5,673ms every time the user asked a question that needed retrieved context. It looked like a model problem. It was a workflow problem.
Three decisions later, that worst path went from 10,121ms to 136ms (a 98.7% cut) and effective input cost dropped 71%. No model swap. No rewrite. No new framework.
By the end of this post, you will know how to:
- Find the 10x latency wins hiding inside your current stack in under a day
- Cut effective input cost by 50–80% with prompt caching that does not break behaviour
- Stop calling the model twice when one call would have worked
- Predict the next two latency walls you will hit before you hit them
This is the playbook to run on day one.
Who this is for
You are running a production LLM workflow. You have three problems hiding inside one slow response:
- You resend the same instructions every turn.
- You make the user wait for internal work they never see.
- You call the model twice when one call would have worked.
This is not a code walkthrough. It is a playbook your team can run before you spend a week arguing about models, frameworks, or infra.
The short version:
- Cache context that should not change.
- Move internal work out of the user request.
- Fetch context before the model call instead of after it.
The win is not any one trick. The win is forcing the team to decide what the user is actually waiting for.
Start with the wait, not the average
Before you change anything, separate total latency from user-blocking latency.
Total latency is how long the whole system takes to finish. User-blocking latency is how long the user waits before they see the next message. Most teams mix these and optimise the wrong thing.
In our case the slow paths looked like this:
- End-of-conversation turns: 10,121ms of blocking overhead
- Turns that answered a user question: 5,673ms extra p99 latency
- Normal turns: about 1,200ms
p99 matters because it is the bad experience your engaged users actually hit. Average latency will make you feel better than your product feels.
The latency flywheel
Run this loop, in order, on one slow turn:
- Trace one slow turn. Not the average dashboard. One real slow turn, broken into stages.
- Label each stage. Does the user need this before seeing the reply? Yes or no.
- Move or remove one thing. Cache stable context, prefetch missing context, or defer internal work.
- Run the same eval again. Compare p50, p95, p99, token cost, cache reads, and cache writes.
That loop is small enough to run in a day. It stops the team from guessing.
1. Cache what should not change
Every LLM workflow has two kinds of context.
Stable context:
- who the assistant is
- what rules it must follow
- what step of the workflow it is in
- examples of good behaviour
Changing context:
- what the user just said
- what fields you have already collected
- what remains unknown
- retrieved context for this turn
Teams blend these into one large prompt because it is easier at the start. That works until cost and latency show up in the bill.
The rule: make the stable part cacheable. Keep the changing part uncached.
For Anthropic prompt caching, the stable prefix has to stay byte-identical across requests. Move user state, timestamps, or retrieved chunks into that prefix and you break the cache.
This is less about syntax and more about ownership. Your team should be able to point at each part of the prompt and say:
- This is policy. It should be stable.
- This is workflow state. It changes.
- This is retrieved evidence. It changes.
- This is conversation history. It belongs in messages, not the system prompt.
One trap to avoid: do not make the cache key too broad.
When we removed stage-specific examples from the cached part, route accuracy dropped 11 percentage points, from 26/29 correct routes to 23/29. Examples are not just cost. They are behaviour.
The right boundary was not "one prompt for everything." It was stable context per stage and tenant.
Result: 77% cache hit rate, 71% drop in effective input cost.
Do not ask, "Can we cache the prompt?" Ask, "Which parts of this prompt are supposed to be true for the next five minutes?"
2. Move internal work out of the request
A lot of LLM latency is self-inflicted.
Teams put work on the request path because it feels safer. Generate the summary. Sync the CRM. Write the audit record. Update the internal notification. Then reply to the user.
That sounds responsible. The user does not care that your internal notification finished before they saw "Thanks, I have that."
Ask one question before adding work to a turn: does the user need this before they see the reply?
If the answer is no, it does not block the reply.
In our slowest path, the system generated a structured completion summary and synced it to a downstream system before sending the final message. The user never saw the summary.
We split the work:
- Save the turn state.
- Return the user-facing reply.
- Run summary generation and sync in the background.
The queue, the worker, the function name — none of that is the point. The contract is the point. The request path should only include work needed to answer the user now. Everything else needs its own retry, logging, and failure policy.
After this change, terminal-turn blocking overhead dropped from 10,121ms to 136ms. That 136ms was the database write.
Make a list of every step in your request path. Next to each step, write either "user needs this now" or "system needs this eventually." Only the first group belongs in the request.
3. Fetch context before the model call
The slowest LLM workflows often have this shape:
- Call the model.
- Realise the user asked a question.
- Retrieve context.
- Call the model again.
That second model call is expensive because it is serial. The user waits for both.
In our case, question-answering turns carried 5,673ms of extra p99 latency from the second call.
The fix: move retrieval earlier. Detect the question before the model call. Fetch the context. Put the retrieved evidence into the changing part of the prompt. Let one model call answer the question and continue the workflow.
A two-call path became a one-call path.
The fallback still matters. If retrieval fails or returns nothing, fall back to the older path. But fallback, not default.
Result: extra question-answering latency cut from 5,673ms to 0ms.
Look for model calls that exist only because your system learned something too late. If you can know it before the first call, do that.
What you will hit next
Three predictions for the team that runs this playbook:
- Your p99 will get worse before it gets better. Once you cut the obvious blocking work, the long tail will be dominated by retrieval timeouts and one slow tool call you never noticed. You will need per-tool p99 budgets, not just a global one.
- Your prompt cache hit rate will silently rot. Someone will inline a timestamp, a user ID, or a feature flag into the cached prefix. Cache hit rate will drop from 77% to 30% in a week and nobody will know why. You need a dashboard for cache reads and cache writes per route, not just request count.
- Background work will become your next incident class. The moment you move summary generation, CRM sync, and audit writes off the request path, they stop having a user yelling at them when they fail. You need retries, dead-letter queues, and alerting on the background path or you will discover six weeks of silently-failed syncs the hard way.
If you are already feeling the pull of any of these, that is the signal you ran the playbook far enough to need the next one.
The measurement loop
None of this works if your team only logs total request time.
Log one structured line per turn with:
- total time
- prompt build time
- model call time
- retrieval time
- routing or workflow time
- input tokens
- output tokens
- cache read tokens
- cache write tokens
Then compute p50, p95, p99, and max for each stage.
- p50 is what most users feel.
- p95 is what engaged users hit often enough to notice.
- p99 is the support-ticket path.
- max is a weird story, not a roadmap.
Without this map, every latency conversation becomes a taste debate. With it, the next move is obvious.
Run this on Monday
You do not need a platform rewrite. You need 90 minutes:
- Pick 20 real conversations or eval runs.
- Log one line per turn with stage timings and token usage.
- Sort by p99 stage time, not total average time.
- Pick the slowest stage that is not required for the user reply.
- Move it, cache it, or reorder it. Re-run the eval.
One slow path. One stage. One change. Then repeat.
LLM performance is not only about faster inference. It is about deciding what the user is actually waiting for. Once you make that call clearly, the system gets faster, cheaper, and easier to reason about at the same time.
If you are running a slow LLM product right now, send me one trace from your worst turn and I will tell you which of the three problems above is eating you alive. [email protected].