Back to articles

Don't build a second agent

What I learned from adding a thin CLI over a shared AI runtime, and why giving an agent access to that CLI turns debugging into real healthchecks.


The most useful AI CLI is often not a second agent. It is a thin, truthful path into the real runtime.

I recently worked on a pattern that taught me more than I expected.

We wanted a CLI for a production AI system.

At first, that sounds simple.

Add a command. Send a prompt. Stream a response. Move on.

But the real design choice was not about terminal UX.

It was about whether the CLI should become its own local agent.

I think that is where a lot of teams go wrong.

The tempting move is to make the CLI smart.

Give it its own local tool stack.

Let it run its own agent loop.

Make it feel self-contained.

That sounds flexible.

It also creates a second system to debug.

We chose the opposite path.

We kept the CLI thin and remote-only. It authenticated, selected an environment, sent requests to the existing server-side runtime, and streamed the response back. It did not become its own agent.

That decision ended up being the most important part of the work.

Why the CLI should not become a second runtime

The system already had a real runtime.

The app used it.

The server already handled authentication, request context, permissions, tenant scope, and tool execution.

So the question was simple:

Should the CLI be a second agent, or should it be a thin client over the first one?

I did not want a second agent.

I did not want a second tool registry.

I did not want a second place for behavior to drift.

I wanted one real runtime and one more way to reach it.

That sounds like a small architectural preference.

It is actually an operational decision.

Drift makes every failure harder to explain

The moment a CLI becomes its own local agent, every debugging session gets harder.

Now when behavior differs, you have too many possible explanations.

Is the problem in the app runtime?

Is it in the CLI runtime?

Is it a permission mismatch?

Is it a tool registration problem?

Is the request shape different?

Is policy being applied differently for terminal runs?

Your debugging tool becomes another source of drift.

That is the opposite of what a good operator surface should do.

I wanted the CLI to tell the truth about the real system.

That meant it had to stay boring.

It had to use the same authenticated path, the same execution runtime, the same tools, and the same policies.

Thin clients reveal real contract gaps

That choice paid off almost immediately.

The first lesson was about authentication.

We supported a browser-assisted login flow so an internal operator could reuse an authenticated app session to sign the CLI in. But we treated that path as a security boundary, not a shortcut.

The bridge was hidden behind a frontend flag.

The bridge page called a server entitlement check before exporting any session material.

The server gate failed closed and only allowed verified internal operator access.

That mattered because operator tooling is powerful by definition. Convenience without control is just a future incident.

The next lesson was smaller but practical.

We stored auth state separately per environment.

That removed a whole class of confusing context bugs.

Once the CLI kept the right saved session for each target, it became much easier to run real tests without wondering whether the terminal had inherited the wrong auth or tenant context.

Then came the more important lesson.

The CLI seemed to work.

It could authenticate.

It could submit a request.

It could stream a response.

But successful transport did not prove tool parity.

That was the real trap.

A run can succeed and still not be doing the real work you think it is doing.

In our case, the problem turned out to be contract drift at the CLI edge. The CLI was still sending old tool aliases, while the shared runtime expected canonical tool names and filtered them literally.

So the request completed. The response streamed. But the real tool path was not actually being invoked.

That was a very useful failure.

It proved something I now believe strongly:

A successful response is not the same as a successful capability.

To prove parity, we needed live runs that emitted real tool-step activity through the shared runtime.

We fixed the issue where it belonged.

We normalized the stale aliases at the CLI edge before sending the request.

We did not loosen the server runtime.

We did not broaden policy to hide the mismatch.

We did not patch around the symptom in the most central part of the system.

That mattered to me.

It kept the architecture clean.

It also made the contract more honest.

Once that was in place, live CLI runs showed real streamed tool activity through the shared runtime, not just text output that looked plausible.

That was the moment the CLI became more than a convenience.

It became a healthcheck surface.

Shared runtimes turn debugging into healthchecks

This is the part I think is most underrated.

A thin CLI over the same runtime as the app changes what debugging means.

You are no longer testing a fake local setup.

You are no longer validating a second agent that only exists for developer convenience.

You are testing the real path.

That means you can verify:

  1. Real authentication.
  2. Real permissions.
  3. Real tool execution.
  4. Real reads.
  5. Real writes.
  6. Real persistence.

That is much closer to operations than to demo-style debugging.

It gets even better when you give an agent access to that CLI.

Now the agent is not debugging against a toy environment.

It is not using a mocked local harness.

It is using the same authenticated, production-like path that the application uses.

That turns debugging into real healthchecks.

Instead of asking, "Does my local setup look okay?"

You can ask, "Did the real system authenticate correctly, invoke the real tool, touch the real data, and leave the real state behind?"

That is a much more valuable question.

This also exposed something important.

In one live check, a write operation appeared to succeed, but a follow-up read did not clearly confirm the expected persisted state.

That is exactly the kind of gap a thin CLI can surface.

It does not just tell you that a request completed.

It helps you test whether the full system behaved the way you thought it did.

The design rule I would reuse

If I were building this pattern again, I would keep the same rules.

Keep the CLI thin.

Let the server own execution.

Let the shared runtime own tools and policy.

Treat auth and operator access as real security boundaries.

Use live runs to prove capability, not just transport.

And if agents are going to help debug the system, give them access to the same truthful path humans use.

I think this is one of the better ways to make AI systems more operable without making them more confusing.

You do not need a clever local agent.

You need a reliable path into the real one.

The real lesson

The biggest lesson for me was not that CLIs are useful.

It was that a thin CLI can become a serious operational surface.

When it shares the real runtime, it can tell you whether the system actually works across auth, tools, data access, writes, and persistence.

And when an agent can use that same CLI, it stops doing fake debugging.

It starts running real healthchecks.

That is the pattern I would reuse.

Not a second agent.

A truthful path into the first one.