Stop calling your harness a chatbot

The moment your AI system touches tools, memory, approvals, and live state, you are no longer evaluating chat quality. You are evaluating operational design.

Your AI product is called a chatbot. You are trying to make it act like an agent. It keeps breaking, and you cannot tell whether the model is wrong or the system around it is. The distinction is simple once you see it: a chatbot is a conversation, a harness is controlled execution. When Asana introduced AI teammates, the company described them as working alongside human employees with explicit visibility into goals, workflows, and status — not as invisible magic in the background, as covered by TechCrunch. That framing is the tell: the product category is not "chat" but bounded workflow execution.

By the end of this post, you will know how to:

Tell whether your AI product is a chatbot or a harness from one example interaction
Name the six layers a harness adds around a model that a chatbot does not have
Predict the next three production incidents you will hit if you keep calling a harness a chatbot
Ask the one evaluation question that cuts through agent demos in under a minute

A chatbot talks. A harness coordinates, grounds, and governs.

A chatbot takes an input and returns an output. Its job is conversational: answer a question, draft a reply, explain a concept, summarize information, help a user navigate. That is enough in many cases, and you should not over-engineer past it.

The chatbot model breaks the moment the task requires multiple steps, multiple tools, real-world state, risk controls, reversible actions, or long-running context. At that point you are not building a smarter chatbot. You are building an execution system that happens to use language at the edges.

The rule: if the user's request changes state in another system, you are building a harness whether you called it that or not.

What a harness actually is

A harness is the layer that surrounds the model and makes multi-step execution dependable. It decides what tools the model can use, how those tools are described, which actions run in parallel, which must happen sequentially, how context is carried forward, how retries and failures are handled, when human approval is required, and what gets logged and verified. Without a harness you have text generation with optional tool calling. With one, you have a workflow system.

This is the idea most teams miss: the model is not the product, it is a component inside the product. Two products built on the same foundation model can feel radically different in practice, and the difference is almost never the model — it is the harness around it. The harness adds the six layers raw model output cannot provide on its own: tool orchestration, context management, session grounding, error classification, approval boundaries, and execution policy.

The rule: if you cannot point at where each of those six layers lives in your code, you do not have a harness yet.

Chatbot behavior vs harness behavior, in one exchange

A chatbot says: "I can help you update that."

A harness says: "I found the record. Here is what would change. Approve this update?"

The second experience feels trustworthy because the system is doing more than generating language — it is managing action. It resolved the target, read current state, proposed a diff, and paused at an approval boundary. That is four harness layers visible in one reply. If your "agent" cannot produce that shape of response on the request you care about, you have a chatbot with a tool-calling SDK bolted on.

Orchestration, context, and session grounding are the hidden work

When a user asks for something moderately complex, the system needs to resolve the target, read current state, search supporting information, decide which tools are needed, execute them in the right order, handle failures cleanly, and return a grounded result. A chatbot can imitate this verbally. A harness can do it. That difference becomes obvious the first time a request touches live systems.

Context is the second hidden layer. A basic chatbot treats context as previous text in the conversation. A harness treats it as user identity, tenant scope, current task state, prior tool results, approval history, recent decisions, and relevant attachments — and it compresses, summarizes, or trims that context over time so the system stays useful in long sessions. The rule: context is a systems problem, not a prompt problem. If your context lives entirely in the message array, you will hit the wall the first time a session runs past ten turns.

Session grounding is the third. Conversational demos use generic requests. Production requests are scoped to this user, this account, this tenant, this dataset, this inbox, this project. A harness that is not grounded in session context will still produce fluent output, but it will not be operating reliably inside the right boundary. That is the failure mode that creates the worst class of enterprise incident: confidently wrong, in the wrong scope, on real data.

Retries are not error handling

A naive stack says: "if something fails, try again." A harness asks harder questions. Is this failure retryable, or is it a validation error that should stop immediately? Is this the same error repeating in a loop? Has the batch failed enough times that execution should halt? Should the user see a safe explanation or a technical one?

These are execution questions, not language questions. The rule: every failure mode needs a class, a policy, and a user-visible contract — not a retry count. A harness without that classification will eventually loop on a non-retryable error in front of a paying customer, and the postmortem will not be about the model.

What buyers should actually be evaluating

When you evaluate an AI system for real business use, the questions "how smart is the model?" and "how good are the answers?" are necessary but not sufficient. Ask the harness questions instead: how are tools routed? How are risky actions gated? How is context managed over long sessions? How are errors classified and handled? How are outputs grounded in live state? What happens when the system is unsure?

Those six questions tell you whether you are looking at a chatbot with extra buttons or a harness that is ready to do work. If the vendor answers any of them in language rather than in mechanism, the reliability is not there yet.

What you will hit if you keep calling your harness a chatbot

Three predictions for the team that builds a harness around a model without admitting that is what they are doing:

Your first six tools will work. The seventh will reveal you have no tool routing policy, and you will discover this in a customer incident. Up to six tools, the model picks correctly often enough that it feels solved. At seven to ten, accuracy collapses and you cannot tell why because you never wrote down which tool wins in which state. The fix is a routing function, not a longer system prompt.
Your first long session will silently lose state, and you will blame the model. Without explicit context management — compression, summarization, scoped retrieval — your assistant will hit the context window mid-session and start fabricating fields it had thirty turns ago. The model is doing exactly what it was asked to do with the context you gave it. The harness is what is missing.
Your first reversible action will go out without an approval boundary, and the customer will see it before you do. Every harness eventually grows an action that should have paused for confirmation. If approval is not a first-class concept in your execution policy, it will be retrofitted under incident pressure, badly, and the audit trail will not exist yet.

If you are already feeling the pull of any of these, you are further along than your product description admits.

The clarifying question

The next time someone demos an "agent," ask one question: what harness exists around the model? If the answer is vague — "we use function calling," "the model decides," "we prompt it carefully" — the reliability is vague too. If the answer names the six layers (tool orchestration, context management, session grounding, error classification, approval boundaries, execution policy) and points at where each one lives, you are looking at a system that can do work.

Calling a harness a chatbot undersells the engineering. Calling a chatbot a harness oversells the reliability. The clean framing: a chatbot is conversation. A harness is conversation plus controlled execution. The next wave of practical AI value lives in the second category, and only the teams who design for it honestly will ship it.

If you are not sure whether your AI product is a chatbot or a harness, send me one example interaction — the user message, the system's reply, and what happened in your backend between them — and I will tell you which one you are running and what the next step toward harness behavior is. [email protected].