Don't build a second agent

The most useful AI CLI is often not a second agent. It is a thin, truthful path into the real runtime.

You are about to add a CLI to your AI system. The tempting move is to make it smart — its own local tool registry, its own agent loop, its own self-contained behaviour. That path looks flexible and ends up being the most expensive debugging surface you will ever own.

I built the opposite recently: a thin, remote-only CLI over an existing production runtime. Along the way I caught a contract-drift bug where the CLI was silently sending old tool aliases the server filtered out, an auth bridge that needed a server-side entitlement check before it could be trusted, and a live write that succeeded without a follow-up read confirming the state actually persisted. Each of those was only catchable because the CLI shared the real runtime.

By the end of this post you will know:

Why a "smart" local CLI doubles your debugging surface and hides parity bugs
The three contract gaps a thin client will surface inside a week
How to turn an operator CLI into a healthcheck surface for agents
The two failures you will hit next once you go thin-client

Why the CLI should not become a second runtime

The system already had a real runtime. The app used it. The server already owned authentication, request context, permissions, tenant scope, and tool execution.

So the design question was not "what should the CLI do?" It was: should the CLI be a second agent, or should it be a thin client over the first one?

A second agent means a second tool registry, a second policy layer, a second place for behaviour to drift. You do not want two runtimes. You want one runtime and one more way to reach it.

That sounds like an architectural preference. It is actually an operational decision. The moment a CLI becomes its own local agent, every debugging session gets harder. When behaviour differs between terminal and app, you now have too many possible explanations: app runtime, CLI runtime, permission mismatch, tool registration, request shape, policy applied differently for terminal runs. Your debugging tool becomes another source of drift, which is the opposite of what an operator surface should do.

I wanted the CLI to tell the truth about the real system. That meant it had to stay boring — same authenticated path, same execution runtime, same tools, same policies.

Thin clients reveal real contract gaps

That choice paid off almost immediately, in three places.

The auth bridge. We supported a browser-assisted login so an internal operator could reuse an authenticated app session to sign the CLI in. We treated that path as a security boundary, not a shortcut. The bridge page was hidden behind a frontend flag, called a server-side entitlement check before exporting any session material, and failed closed unless the caller was a verified internal operator. Operator tooling is powerful by definition — convenience without control is a future incident.

Per-environment auth state. We stored saved sessions separately per target. That removed a whole class of confusing context bugs where the terminal had inherited the wrong tenant or auth scope between runs. Boring, but it stops half the "is this real?" conversations before they start.

The contract drift that almost shipped. The CLI authenticated, submitted a request, streamed a response. It looked like it worked. It was not doing the real work. The CLI was sending stale tool aliases, while the shared runtime expected canonical tool names and filtered them literally. The request completed. The response streamed. The actual tool path was never invoked.

That was the useful failure. A successful response is not the same as a successful capability. To prove parity, you need live runs that emit real tool-step activity through the shared runtime — not just plausible-looking text on stdout.

We fixed it where it belonged: normalized the stale aliases at the CLI edge before sending the request. We did not loosen the server runtime. We did not broaden policy to hide the mismatch. Patching the symptom in the most central part of the system would have made the contract less honest, not more.

Once that landed, live CLI runs showed real streamed tool activity through the shared runtime. That was the moment the CLI stopped being a convenience and became a healthcheck surface.

Shared runtimes turn debugging into healthchecks

This is the underrated part.

A thin CLI over the same runtime as the app changes what debugging means. You are not testing a fake local setup. You are not validating a second agent that only exists for developer convenience. You are testing the real path — real authentication, real permissions, real tool execution, real reads, real writes, real persistence.

It gets better when you give an agent access to that CLI. The agent is no longer debugging against a toy environment or a mocked local harness. It is using the same authenticated, production-like path the application uses. The question changes from "does my local setup look okay?" to "did the real system authenticate correctly, invoke the real tool, touch the real data, and leave the real state behind?"

That second question exposed something during one live check: a write operation appeared to succeed, but the follow-up read did not clearly confirm the expected persisted state. A thin CLI surfaces exactly that kind of gap. It does not just tell you a request completed. It tells you whether the full system behaved the way you thought it did.

What you will hit next (the oracle section)

Three predictions for the team that goes thin-client:

Your CLI will catch a tool-registry drift inside two weeks. The moment the server runtime renames a tool, adds a required argument, or tightens a schema, the CLI will surface it before any user does. That is the feature, not a bug — but you need a "list registered tools" command on the CLI that hits the server's registry, not a local copy. The first time someone caches the registry client-side "for performance," you lose the early warning.
Your auth bridge will be the most attacked surface you own. Anything that exports an authenticated session into a terminal is an obvious target. Expect to add: a server-side entitlement check (you already have one), short-lived bridge tokens, audit logging on every bridge exchange, and a kill switch independent of the frontend flag. If the bridge can only be revoked by redeploying the frontend, it is not really revocable.
Write-without-confirmed-read will become your default test shape. Once you have one live check that catches a silent persistence gap, every meaningful CLI test will grow a read-back step. Build a small assertion helper now — expect_persisted(entity, predicate) — or every operator will reinvent it inconsistently and you will end up with thirty bespoke read-backs nobody trusts.

If you are already feeling the pull of any of these, that is the signal the pattern is working.

The design rule I would reuse

Keep the CLI thin. Let the server own execution. Let the shared runtime own tools and policy. Treat auth and operator access as real security boundaries, not developer conveniences. Use live runs to prove capability, not transport. And if agents are going to help debug the system, give them access to the same truthful path humans use.

You do not need a clever local agent. You need a reliable path into the real one.

The real lesson

A thin CLI is not a debugging convenience. It is an operational surface. When it shares the real runtime, it tells you whether the system actually works across auth, tools, data access, writes, and persistence. When an agent uses that same CLI, it stops doing fake debugging and starts running real healthchecks.

That is the pattern worth reusing. Not a second agent. A truthful path into the first one.

If you are about to build a CLI for your AI system, send me your tool registry and the list of commands you are planning, and I will tell you which tools will silently drift between local and server. [email protected].