AI agent cost control: how to stop your bill from exploding

2026-05-27 · 11 min read · By Real Problem AI

The week the Anthropic invoice arrives is the week most agent teams discover their cost model is fiction. A product priced at $29/month suddenly costs $84 per active user to run. A single trapped agent loop, undetected for 48 hours, can multiply that by ten.

This essay is the playbook we use to keep agent inference inside its lane. Four failure modes, four guardrails, three dashboards.

How agents actually overspend

1. Tool-call loops

The most common failure mode. The agent calls a tool, gets a result it cannot quite use, calls the tool again with a small variation, and so on. Each loop adds 2-8K tokens of context to the next request. The bill compounds quadratically. We have seen a customer-support agent burn $40 on a single ticket because it kept re-fetching the same conversation history.

2. Silent context bloat

Agents accumulate context across turns. A well-designed agent prunes; most do not. By turn 30 the system prompt, tool definitions, prior tool results and conversation history can be 80K tokens. Every subsequent request pays the full bill.

3. Wrong-model defaults

Teams ship with the strongest model as the default ("we'll optimise later") and never optimise. Routing 90% of tasks to Sonnet when 70% would succeed on Haiku is a 4-5x overspend baked into your architecture.

4. Misclassified retries

Network blips, tool failures and tokenization issues all look like "transient errors" to retry logic. Without idempotency tracking, a single user request can spawn six full LLM calls before anyone notices.

The four guardrails

Guardrail 1: per-request budget

Every request enters with a hard token budget. Track input + output + retried tokens. When the request crosses 80% of budget, downgrade the model. When it crosses 100%, terminate and return a degraded but real answer.

Sane defaults: 8K tokens for a sync chat reply, 40K for a research task, 120K for a structured agent run. Tune from your actual P95.

Guardrail 2: per-user, per-day cap

Set a hard daily cost cap per user. For consumer products, $2-3 covers 95% of real usage. The remaining 5% are either power users (charge them) or runaway loops (block them). Either way, the cap turns a $1,400 surprise into a $200 surprise.

Guardrail 3: loop detector

Maintain a fingerprint of recent tool calls (tool name + arg hash). If the same fingerprint appears 4 times in 10 calls, you are in a loop. Break the conversation and ask the user to clarify, rather than burning tokens hoping the model figures it out.

Guardrail 4: model router

Route by task class, not by default. A classifier (small, cheap, fine-tuned) picks the cheapest viable model. Run it as a pre-filter to every agent step. The router itself costs almost nothing; the savings downstream are real.

Three dashboards worth checking weekly

Cost per active user, by cohort. If this number trends up while you ship features, you are accumulating debt. If it trends up without new features, you have a leak.

Top 10 most expensive requests, by user. Look at the actual transcripts. Nine times in ten, one of them is a loop. Patch the underlying logic, not the user.

Token-to-action ratio. How many tokens did you spend per useful customer action (message sent, ticket resolved, lead qualified)? This number alone tells you whether your unit economics work.

The tooling gap

The tools that exist today (Helicone, LangSmith, OpenRouter) are tracing-first, not budget-first. They show you what happened. They do not stop the next runaway loop. The opportunity we listed as AI12 is a true cost firewall: drop-in middleware, per-user budgets, auto-downgrade router, forensic playback.

Until that exists as a single product, the playbook above is what you implement yourself in roughly three engineering days. Worth every hour.

The full agent-cost-firewall opportunity plus 193 other AI startup ideas worth building in 2026.

Browse the directory