Back to articles

Building a CLI for AI agents. The hard part was the contract.

Why your agent-facing CLI keeps hanging, mis-parsing, and burning time budgets — and the eight-clause subprocess contract that fixes it.


The interesting part was not the HTTP wrapper. It was the subprocess contract.

You are wrapping an API so an agent can call it. The plan looks small. Add a few commands. Print some JSON. Ship it.

Then the agent hangs on a TTY prompt you never see, parses prose out of stdout, retries a 4xx forever, and burns its wall-clock budget polling a job that will never finish.

I hit every one of those failure modes building findymail-cli, an unofficial CLI for the Findymail API. This post gives you the eight-clause contract that came out of that work, the failure-class table an agent can actually act on, and the test suites in the repo that prove the contract holds.

By the end you will know:

  • Which CLI quirks are silent contract bugs under automation
  • Where to be strict so the caller never has to guess
  • How to classify failures so retries and recoveries are deterministic
  • The next three problems you will hit once the contract is in place

Why a normal CLI becomes ambiguous under automation

Most CLIs are designed for a human at a terminal. A person can notice a prompt, read around noisy output, and rerun the command. An agent cannot.

If your command hangs on interactive stdin, mixes prose into stdout, quietly picks one input source over another, or polls forever, the tool is not annoying. It is unreliable infrastructure inside an automation loop.

The design target changes. The goal stops being "make this pleasant in a terminal" and becomes "make this safe to call from automation." At that boundary, tool use starts to look like protocol design. Inputs have to be unambiguous, outputs machine-readable, failures classifiable, waits budgeted.

Where ambiguity becomes failure

Under automation, ordinary CLI quirks turn into system failures. In this repo the candidates were:

  • Commander wanted to print human-oriented errors to stderr and exit on its own terms
  • a POST command could have accepted both a flag and piped input
  • --stdin could have blocked on a TTY
  • intellimatch search --wait could have kept polling until the caller's budget was gone

None of those feel severe with a person at the keyboard. Each one is a contract bug under automation — broken parsing, wasted budgets, unclear retry behaviour, a tool boundary another system cannot trust.

With auth configured, the kind of failure you want looks like this:

$ findymail verify --stdin
{"ok":false,"error":{"type":"usage","message":"Refusing to read interactive stdin; pipe data or use --json/--input"}}

Immediate, machine-readable, recoverable. The caller can map error classes to a policy:

Failure classExampleExit codeRetry?Caller action
usageinvalid JSON, missing flag2Nofix the tool call
configmissing API key2Norepair the environment
timeoutwaited job exceeded budget1Mayberetry with a larger budget or fall back
networkconnection reset, DNS issue1Usuallyretry with backoff
api429, 5xx, malformed upstream1Dependsinspect status and apply policy

The job is not only to fail clearly. It is to fail in a way another system can recover from without scraping text.

Choose where to be strict

The real design work was deciding where not to be permissive. Each helpful default — accept input from flags or pipes, let the framework print its own errors, poll until done — turns into ambiguity or wasted time the caller has to guess through.

No interactive stdin. src/core/input.ts makes --stdin pipe-only. If stdin is attached to a TTY, the CLI fails fast. A clean failure is recoverable. A silent hang is not.

Exactly one input source. Every POST-style command must receive exactly one of --json, --input, or --stdin. No hidden precedence rules, no accidental dual-input cases, no guessing which source the CLI trusted.

Success on stdout, errors on stderr. src/core/output.ts prints a single JSON payload to stdout on success and structured JSON to stderr on failure. The agent never has to scrape prose to figure out what happened.

Async work gets a time budget. For a human, "start a job and keep checking" is fine. For an agent, indefinite waiting is a bug. src/commands/intellimatch.ts and src/core/polling.ts make waiting explicit: --wait is opt-in, --poll-interval and --max-wait must be valid, the total wait is bounded, and the remaining wall-clock budget is passed into each status request timeout. Terminal failure states become structured errors.

Override framework defaults. Mature CLI libraries optimise for humans. In src/cli.ts I had to override Commander's default behaviour so the CLI owned its output and error format instead of letting the library print text to stderr and exit on its own terms. Small code change, large contract impact.

What the repo actually guarantees

Once the constraints were explicit, the project stopped feeling like "a wrapper around an API" and started behaving like a protocol boundary. The most important artifact is not a single command — it is the contract written down in docs/cli-contract.md and enforced by the test suite:

  1. No interactive prompts.
  2. Success payloads go to stdout as JSON.
  3. Errors go to stderr as structured JSON.
  4. Auth comes from FINDYMAIL_API_KEY.
  5. POST-style commands accept exactly one of --json, --input, or --stdin.
  6. --stdin rejects interactive terminal input.
  7. Intellimatch polling is opt-in and bounded.
  8. Usage and config failures exit differently from remote and API failures.

Successful payloads still mirror upstream JSON. The determinism is in the process contract around that payload.

The proof the contract holds is in the tests. tests/e2e/contract.test.ts verifies that missing API keys fail as structured config errors, successful requests emit machine-readable JSON on stdout, invalid input produces structured usage errors, and empty --json and --input stay deterministic. tests/e2e/intellimatch.test.ts locks down the async path: waited searches poll in the expected sequence, failed jobs become structured API errors, jobs that never reach a terminal state time out cleanly, and invalid polling arguments fail before any network work begins.

For an agent-facing CLI, that test suite is what the product looks like. The process contract is the behaviour.

What you will hit next (the oracle section)

Three predictions for the team that applies this contract discipline:

  1. Upstream schema drift will be your next silent regression. The contract stabilises the subprocess. It does not stabilise the API response shape. The day Findymail (or your equivalent vendor) renames a field, your agent will get a syntactically valid JSON payload that is semantically wrong, and nothing in the contract will catch it. You need response schema validation at the boundary before you treat this as dependable infrastructure.

  2. You will need a retry policy with idempotency rules, not a retry flag. The failure-class table tells the caller whether to retry. It does not tell it whether the request was safe to retry. A timeout on a mutating call is the worst case — you do not know if the side effect landed. Before you turn on retries for api or timeout classes, you need explicit 429 handling, bounded backoff, and idempotency keys on every mutating request.

  3. Deadline propagation will become the next leak. Right now --max-wait bounds the polling phase. FINDYMAIL_TIMEOUT_MS bounds individual requests. Nothing bounds the whole lifecycle of a command from the agent's perspective. The first time an agent cancels a turn and the CLI keeps running, you will wish you had cancellation propagation everywhere.

If you are already feeling any of these, that is the signal the contract is doing its job — the failures left are the structural ones, not the surface ones.

A checklist you can reuse

If you are building another CLI for agents tomorrow, start here:

  1. Make every important input expressible non-interactively.
  2. Keep successful machine output on stdout only.
  3. Keep structured errors on stderr only.
  4. Define exit-code semantics early.
  5. Reject interactive stdin hangs.
  6. Forbid ambiguous combinations of input modes.
  7. Put a budget around every async wait.
  8. Write end-to-end tests for the contract, not just the helpers.

The command set matters. The contract matters more.


If you are wrapping an API as a tool for an agent right now, send me one command line from your CLI and the JSON it emits on failure, and I will tell you which clause of the eight-point contract above it breaks. [email protected].