Why CLI Tools, Not Raw API Calls

Part 2 of 9 — the deterministic tool belt, and why it is the load-bearing wall of the whole design.

The tempting shortcut

An obvious objection arrives early: modern agents can already make HTTP requests. Why not just hand the agent the API documentation and a credential and let it call the endpoints directly? No tools to build, no scripts to maintain.

You can do this. It works in a demo and fails in production, for reasons that are worth understanding because they explain the entire architecture.

What goes wrong with raw API access

Non-determinism at the worst layer. An agent improvising HTTP calls will, occasionally, get a header wrong, paginate incorrectly, or misread which field is the order total. When the thing being improvised is a write — a refund, a cancellation, a message to a real customer — "occasionally wrong" is unacceptable. You want the model's creativity in deciding what to do, not in how to format the request.
No safety rail lives in the API. The eBay API will happily issue a $10,000 refund if asked correctly. "Never refund more than $100 without confirmation" is your policy, not eBay's. It has to live somewhere deterministic.
Tokens and secrets get exposed. Raw access means the credential is in the agent's context or its shell history. Wrapped in a tool, the secret is read by the tool from a secure store and never enters the conversation.
It is unreviewable. "The agent made 40 HTTP calls" is not an audit trail anyone can read. "The agent ran ebay-refund --order 4471 --amount 18.00" is.
It is unrepeatable. If the same task produces different request shapes each run, you cannot test it, diff it, or trust it. Tools turn a fuzzy capability into a fixed contract.

The alternative: small tools as a contract

Instead, you wrap each meaningful action in a small command-line program with a stable name, documented arguments, and predictable output:

ebay-list-orders --status awaiting-shipment --json
ebay-get-message --id 88231
ebay-reply-message --id 88231 --body-file reply.txt
ebay-refund --order 4471 --amount 18.00 --reason item-not-received

Now the division of labour is clean. The agent decides that order 4471 should be refunded and how much. The tool decides how to call eBay correctly, enforces that the amount is within policy, requires a reason code, refuses if a flag is missing, and prints exactly what it did. The agent reasons; the tool executes deterministically.

This is the oldest good idea in computing wearing new clothes. It is the Unix philosophy — small programs that do one thing well and compose through text — applied to an agent instead of a shell pipeline. The agent is the shell, except it can read a manual and make judgment calls.

The properties a good operator tool has

Property	Why it matters to an agent
Single purpose	One verb, one object. `ebay-refund`, not `ebay-manage-order` with a mode flag. Easy for the agent to choose correctly.
Self-documenting	`--help` explains what it does, its arguments, and its side effects, in plain language the agent can read.
Structured output	A `--json` mode so the agent parses results reliably instead of scraping prose.
Read/write honesty	The name and help text make clear whether the tool only reads or also changes the world. Read tools are safe to run freely; write tools are not.
Dry-run for writes	`--dry-run` prints what would happen without doing it — lets the agent (and you) preview before committing.
Policy enforcement	Hard limits live in the tool, not in a hope that the agent behaves. The tool refuses out-of-policy requests with a clear error.
Loud, structured errors	On failure it exits non-zero and prints why, so the agent can react instead of assuming success.
Idempotency where possible	Running "acknowledge order 4471" twice should not send two messages. Tools that can be safely retried are tools an agent can use confidently.

Why CLI specifically, and not, say, MCP?

A fair question, since the agent ecosystem has richer integration points than shelling out. CLI tools earn their place for three reasons:

Universality. Every coding agent — Claude Code, Codex, Pi, and the rest — can run a shell command. Build your tool belt as CLIs and it is portable across agents and even usable by a human directly. Build it as a vendor-specific plugin and you are locked in.
Inspectability. A command and its output are human-readable and log themselves naturally. You can run any tool by hand to check it.
Testability. A CLI is just a program. You unit-test it, you run it in CI, you version it. It has a life independent of any agent.

Build note. None of this forbids also exposing the tools through MCP or a function-calling interface later — a CLI and an MCP server can share the same underlying library. The point is that the contract should be expressible as plain commands first. If you can drive your operator from a terminal, any agent can drive it too.

The mental model to carry forward

The tool belt is the fence. Everything the agent can do to the outside world, it does through a tool you wrote. If a capability isn't in the belt, the agent doesn't have it. Adding a tool is a deliberate act of granting power.

This is what makes the whole pattern safe enough to trust. The agent's autonomy is bounded not by its good intentions but by the finite, reviewable set of levers in front of it. In Part 3 we look at the most important property those levers carry: whose identity they act under.