Building a CLI for AI agents. The hard part was the contract.
Why your agent-facing CLI keeps hanging, mis-parsing, and burning time budgets — and the eight-clause subprocess contract that fixes it.
The interesting part was not the HTTP wrapper. It was the subprocess contract.
You are wrapping an API so an agent can call it. The plan looks small. Add a few commands. Print some JSON. Ship it.
Then the agent hangs on a TTY prompt you never see, parses prose out of stdout, retries a 4xx forever, and burns its wall-clock budget polling a job that will never finish.
I hit every one of those failure modes building findymail-cli, an unofficial CLI for the Findymail API. This post gives you the eight-clause contract that came out of that work, the failure-class table an agent can actually act on, and the test suites in the repo that prove the contract holds.
By the end you will know:
- Which CLI quirks are silent contract bugs under automation
- Where to be strict so the caller never has to guess
- How to classify failures so retries and recoveries are deterministic
- The next three problems you will hit once the contract is in place
Why a normal CLI becomes ambiguous under automation
Most CLIs are designed for a human at a terminal. A person can notice a prompt, read around noisy output, and rerun the command. An agent cannot.
If your command hangs on interactive stdin, mixes prose into stdout, quietly picks one input source over another, or polls forever, the tool is not annoying. It is unreliable infrastructure inside an automation loop.
The design target changes. The goal stops being "make this pleasant in a terminal" and becomes "make this safe to call from automation." At that boundary, tool use starts to look like protocol design. Inputs have to be unambiguous, outputs machine-readable, failures classifiable, waits budgeted.
Where ambiguity becomes failure
Under automation, ordinary CLI quirks turn into system failures. In this repo the candidates were:
- Commander wanted to print human-oriented errors to
stderrand exit on its own terms - a POST command could have accepted both a flag and piped input
--stdincould have blocked on a TTYintellimatch search --waitcould have kept polling until the caller's budget was gone
None of those feel severe with a person at the keyboard. Each one is a contract bug under automation — broken parsing, wasted budgets, unclear retry behaviour, a tool boundary another system cannot trust.
With auth configured, the kind of failure you want looks like this:
$ findymail verify --stdin
{"ok":false,"error":{"type":"usage","message":"Refusing to read interactive stdin; pipe data or use --json/--input"}}
Immediate, machine-readable, recoverable. The caller can map error classes to a policy:
| Failure class | Example | Exit code | Retry? | Caller action |
|---|---|---|---|---|
usage | invalid JSON, missing flag | 2 | No | fix the tool call |
config | missing API key | 2 | No | repair the environment |
timeout | waited job exceeded budget | 1 | Maybe | retry with a larger budget or fall back |
network | connection reset, DNS issue | 1 | Usually | retry with backoff |
api | 429, 5xx, malformed upstream | 1 | Depends | inspect status and apply policy |
The job is not only to fail clearly. It is to fail in a way another system can recover from without scraping text.
Choose where to be strict
The real design work was deciding where not to be permissive. Each helpful default — accept input from flags or pipes, let the framework print its own errors, poll until done — turns into ambiguity or wasted time the caller has to guess through.
No interactive stdin. src/core/input.ts makes --stdin pipe-only. If stdin is attached to a TTY, the CLI fails fast. A clean failure is recoverable. A silent hang is not.
Exactly one input source. Every POST-style command must receive exactly one of --json, --input, or --stdin. No hidden precedence rules, no accidental dual-input cases, no guessing which source the CLI trusted.
Success on stdout, errors on stderr. src/core/output.ts prints a single JSON payload to stdout on success and structured JSON to stderr on failure. The agent never has to scrape prose to figure out what happened.
Async work gets a time budget. For a human, "start a job and keep checking" is fine. For an agent, indefinite waiting is a bug. src/commands/intellimatch.ts and src/core/polling.ts make waiting explicit: --wait is opt-in, --poll-interval and --max-wait must be valid, the total wait is bounded, and the remaining wall-clock budget is passed into each status request timeout. Terminal failure states become structured errors.
Override framework defaults. Mature CLI libraries optimise for humans. In src/cli.ts I had to override Commander's default behaviour so the CLI owned its output and error format instead of letting the library print text to stderr and exit on its own terms. Small code change, large contract impact.
What the repo actually guarantees
Once the constraints were explicit, the project stopped feeling like "a wrapper around an API" and started behaving like a protocol boundary. The most important artifact is not a single command — it is the contract written down in docs/cli-contract.md and enforced by the test suite:
- No interactive prompts.
- Success payloads go to
stdoutas JSON. - Errors go to
stderras structured JSON. - Auth comes from
FINDYMAIL_API_KEY. - POST-style commands accept exactly one of
--json,--input, or--stdin. --stdinrejects interactive terminal input.- Intellimatch polling is opt-in and bounded.
- Usage and config failures exit differently from remote and API failures.
Successful payloads still mirror upstream JSON. The determinism is in the process contract around that payload.
The proof the contract holds is in the tests. tests/e2e/contract.test.ts verifies that missing API keys fail as structured config errors, successful requests emit machine-readable JSON on stdout, invalid input produces structured usage errors, and empty --json and --input stay deterministic. tests/e2e/intellimatch.test.ts locks down the async path: waited searches poll in the expected sequence, failed jobs become structured API errors, jobs that never reach a terminal state time out cleanly, and invalid polling arguments fail before any network work begins.
For an agent-facing CLI, that test suite is what the product looks like. The process contract is the behaviour.
What you will hit next (the oracle section)
Three predictions for the team that applies this contract discipline:
-
Upstream schema drift will be your next silent regression. The contract stabilises the subprocess. It does not stabilise the API response shape. The day Findymail (or your equivalent vendor) renames a field, your agent will get a syntactically valid JSON payload that is semantically wrong, and nothing in the contract will catch it. You need response schema validation at the boundary before you treat this as dependable infrastructure.
-
You will need a retry policy with idempotency rules, not a retry flag. The failure-class table tells the caller whether to retry. It does not tell it whether the request was safe to retry. A timeout on a mutating call is the worst case — you do not know if the side effect landed. Before you turn on retries for
apiortimeoutclasses, you need explicit429handling, bounded backoff, and idempotency keys on every mutating request. -
Deadline propagation will become the next leak. Right now
--max-waitbounds the polling phase.FINDYMAIL_TIMEOUT_MSbounds individual requests. Nothing bounds the whole lifecycle of a command from the agent's perspective. The first time an agent cancels a turn and the CLI keeps running, you will wish you had cancellation propagation everywhere.
If you are already feeling any of these, that is the signal the contract is doing its job — the failures left are the structural ones, not the surface ones.
A checklist you can reuse
If you are building another CLI for agents tomorrow, start here:
- Make every important input expressible non-interactively.
- Keep successful machine output on
stdoutonly. - Keep structured errors on
stderronly. - Define exit-code semantics early.
- Reject interactive
stdinhangs. - Forbid ambiguous combinations of input modes.
- Put a budget around every async wait.
- Write end-to-end tests for the contract, not just the helpers.
The command set matters. The contract matters more.
If you are wrapping an API as a tool for an agent right now, send me one command line from your CLI and the JSON it emits on failure, and I will tell you which clause of the eight-point contract above it breaks. [email protected].