Back to articles

Building a CLI for AI agents. The hard part was the contract.

A case study in designing deterministic tool contracts for AI agents: machine-readable outputs, bounded waits, and failure modes a caller can recover from.


Building a CLI for AI Agents. The Hard Part Was the Contract.

The interesting part of this project was not the HTTP wrapper. It was defining a subprocess contract an agent could not easily misunderstand.

This started as findymail-cli, an unofficial CLI for the Findymail API.

At first glance, it looked like a small project.

Wrap the API. Add a few commands. Print some JSON. Ship it.

That was not the hard part.

The durable lesson was how to define a subprocess contract that a headless, timeout-bound caller could actually rely on.

That is what changes when the caller is not a person at a terminal.

Why a normal CLI becomes ambiguous under automation

Most CLIs are designed for a human operator.

That usually works fine.

A person can notice a prompt, read around noisy output, and rerun a command if the first result is unclear.

An agent does not get that luxury.

If a command hangs on interactive stdin, mixes prose into stdout, quietly chooses one input source over another, or polls forever, the tool is not just annoying. It is unreliable infrastructure.

That is the core problem.

In this kind of agent-facing CLI, the hard part was not wrapping an API.

It is defining a contract that remains reliable when the caller is headless, timeout-bound, and occasionally wrong.

In other words, the design target changes.

The goal is no longer just “make this pleasant in a terminal.”

The goal is “make this safe to call from automation.”

That same idea extends beyond CLIs.

This repo is small, but the same pressure shows up anywhere a process boundary sits inside automation.

At a process boundary, tool use starts to look like protocol design. Inputs have to be unambiguous. Outputs have to be machine-readable. Failures have to be classifiable. Waits have to be budgeted.

Where ambiguity becomes failure

Under automation, small CLI quirks turn into system failures.

With auth configured, for example:

$ findymail verify --stdin
{"ok":false,"error":{"type":"usage","message":"Refusing to read interactive stdin; pipe data or use --json/--input"}}

That is the kind of failure I want from an automation boundary.

Immediate, machine-readable, and recoverable.

In this repo, the failure modes were not dramatic.

They were ordinary:

  • Commander wanted to print human-oriented errors to stderr and exit on its own terms
  • a POST command could have accepted both a flag and piped input unless I made that illegal
  • --stdin could have blocked on a TTY
  • intellimatch search --wait could have kept polling until the caller's budget was gone

None of those feel severe when a person is sitting at the keyboard.

Under automation, each is a contract bug.

The cost is not only inconvenience.

The cost is broken parsing, wasted time budgets, unclear retry behavior, and a tool boundary another system cannot trust.

That is why structured failures matter.

They do not only report what went wrong. They give the caller enough signal to apply a recovery policy without scraping text.

A caller can map the current error classes to a policy like this:

Failure classExampleExit codeRetry?Caller action
usageinvalid JSON, missing flag2Nofix the tool call
configmissing API key2Norepair the environment
timeoutwaited job exceeded budget1Mayberetry with a larger budget or fall back
networkconnection reset, DNS issue1Usuallyretry with backoff
api429, 5xx, malformed upstream response1Dependsinspect status and apply policy

That is one of the biggest differences between a human-facing tool and an automation-facing one.

The job is not only to fail clearly.

The job is to fail in a way another system can recover from without guessing.

Choosing where to be strict

This project looked small on paper, but the real design work was deciding where not to be permissive.

The first instinct was to be more forgiving.

Accept input from either flags or pipes. Let Commander print its normal errors. Keep polling until a job finishes. All of that feels reasonable from a terminal.

In practice, each helpful default turned into ambiguity or wasted time a caller would have to guess through. That was the point where the project stopped feeling like a wrapper and started feeling like a contract.

The most useful choices were the strict ones.

No interactive stdin

src/core/input.ts makes --stdin pipe-only.

If stdin is attached to a TTY, the CLI fails fast instead of waiting around for input that may never arrive.

That is the right tradeoff for automation.

A clean failure is recoverable.

A silent hang is not.

Exactly one input source

That same input layer enforces a one-of rule.

Every POST-style command must receive exactly one of:

  • --json
  • --input
  • --stdin

That removes an entire class of ambiguity.

There are no hidden precedence rules between flags and pipes, no accidental dual-input cases, and no guessing which source the CLI trusted.

Success on stdout, errors on stderr

src/core/output.ts is intentionally boring.

On data-returning success paths, commands print a single JSON payload to stdout.

On failure paths, they print structured JSON to stderr.

That split matters because it keeps normal command execution from mixing prose into data-returning stdout paths, and keeps failures machine-readable on stderr.

It also means an agent never has to scrape surrounding prose to figure out what actually happened.

Async work gets a time budget

Findymail's Intellimatch flow made the same point in a different form.

For a human, “start a job and keep checking until it finishes” can feel acceptable.

For an agent, indefinite waiting is a bug.

So src/commands/intellimatch.ts and src/core/polling.ts make waiting explicit.

  • polling only happens with --wait
  • --poll-interval must be valid
  • --max-wait must be valid
  • the total wait is bounded
  • terminal failure states become structured errors

One detail matters more than it first appears to.

The waited status checks track the remaining wall-clock budget and pass that budget into each status request timeout.

That does not bound the initial /api/intellimatch/search call. FINDYMAIL_TIMEOUT_MS still governs that.

But it does keep the polling phase inside one predictable envelope, which is the part most likely to waste an agent's budget.

Framework defaults needed to be overridden

Another important part of the struggle was that mature CLI libraries are usually optimized for humans.

In src/cli.ts, I had to override Commander's default behavior so the CLI could own its output and error format instead of letting the library print text to stderr and exit on its own terms.

That is a small code change with a large contract impact.

What the repo actually guarantees

Once those constraints were explicit, the project stopped feeling like “a wrapper around an API” and started behaving like a protocol boundary.

What this repo stabilizes is the subprocess behavior around the API, not the upstream response schema itself.

The most important artifact in the repo is not a single command.

It is the contract written down in docs/cli-contract.md and enforced by the test suite:

  1. No interactive prompts.
  2. Success payloads go to stdout as JSON.
  3. Errors go to stderr as structured JSON.
  4. Auth comes from FINDYMAIL_API_KEY.
  5. POST-style commands accept exactly one of --json, --input, or --stdin.
  6. --stdin rejects interactive terminal input.
  7. Intellimatch polling is opt-in and bounded.
  8. Usage and config failures exit differently from remote/API failures.

The contract also includes the operational knobs an automated caller actually needs: FINDYMAIL_API_KEY for auth, FINDYMAIL_BASE_URL for testability, FINDYMAIL_TIMEOUT_MS for per-request bounds, exit code 2 for local usage and config failures, and exit code 1 for remote, API, network, and timeout failures.

Successful payloads still mirror upstream JSON.

The determinism here is in the process contract around that payload.

That is the real transformation.

The CLI became deterministic enough that another process could safely reason about it.

There was one abstraction that helped a lot here.

src/commands/shared.ts centralizes the behavior every new command should inherit:

  • JSON input options
  • runtime config loading
  • HTTP request execution
  • success output handling
  • command examples in help text
  • Commander setup that avoids noisy default behavior

That abstraction mattered because it standardized guarantees, not just because it saved lines of code.

Cursor's cli-for-agents skill was useful as a review checklist, but the contract decisions came from the repo's actual failure modes and tests.

The strongest proof that this transformation held was the test suite.

tests/e2e/contract.test.ts verifies things like:

  • missing API keys fail as structured config errors
  • successful requests emit machine-readable JSON on stdout
  • invalid input produces structured usage errors
  • empty --json and empty --input stay deterministic
  • top-level help remains discoverable

tests/e2e/intellimatch.test.ts then locks down the async path:

  • waited searches poll in the expected sequence
  • failed jobs become structured API errors
  • jobs that never reach a terminal state time out cleanly
  • missing hashes are treated as upstream contract failures
  • invalid polling arguments fail before network work begins

tests/unit/core.test.ts covers lower-level boundaries too, including rejecting interactive stdin and enforcing polling deadlines.

For an agent-facing CLI, that is what the product looks like.

The process contract is the behavior.

A checklist I would reuse

If I were building another CLI for agents tomorrow, I would start here:

  1. Make every important input expressible non-interactively.
  2. Keep successful machine output on stdout only.
  3. Keep structured errors on stderr only.
  4. Define exit code semantics early.
  5. Accept stdin where it helps, but reject interactive hangs.
  6. Avoid ambiguous combinations of input modes.
  7. Put a budget around every async wait.
  8. Keep command shapes predictable.
  9. Put real examples in --help.
  10. Write end-to-end tests for the contract, not just helpers.

That is the reusable part of this project.

The command set matters, but the contract matters more.

Some of the best decisions here were about what not to add.

I kept successful output as a single JSON payload instead of wrapping everything in another envelope.

I kept the exit code model simple.

I made waiting opt-in instead of automatic.

I used environment-variable auth because that was enough for the scope.

And I did not add retries, streaming output, or a more elaborate abstraction layer just to make the project look more complete.

That was not minimalism for its own sake.

It was scope control.

What I would add next

If I were taking this beyond the current repo scope, I would add a few things before treating it as dependable infrastructure inside a larger agent loop.

The highest-value next steps are:

  • response schema validation to catch upstream drift earlier
  • bounded retry policy, including explicit 429 handling
  • idempotency or safe replay semantics for mutating requests after timeouts
  • broader deadline and cancellation propagation across the whole request lifecycle

None of those are necessary to prove the idea in this repo.

All of them matter once the tool becomes part of a larger automation system.

The broader lesson is not only about making the happy path clean.

It is about making drift, retries, partial failures, and budget exhaustion survivable.