// Glossary · technical

Inference Cost

Also: token cost · API cost · model serving cost

The per-token cost of running an AI model in production, tracked as dollars per million input tokens and dollars per million output tokens, defining the unit economics of any AI workflow.

Inference cost is what you pay every time the model produces an answer. The bill comes in two parts. Input tokens are what you send to the model: the user message, the system prompt, the retrieved context, the conversation history. Output tokens are what the model generates back. Both meter separately, and output tokens cost more than input tokens at roughly every major provider. Claude 3.5 Sonnet runs $3 per million input tokens and $15 per million output tokens. GPT-4o runs $2.50 per million input and $10 per million output. Smaller models like Claude Haiku and GPT-4o mini run 10 to 30 times cheaper. The price gap between frontier and small models drives most production architecture decisions.

Production economics depend on getting inference cost right per workflow. A support deflection bot handling 8,000 tickets a month with 2,000 input tokens and 500 output tokens per interaction costs roughly $250 a month on GPT-4o mini, or $2,500 a month on GPT-4o. The output gap matters at scale. The AI Support Department standard architecture routes routine tickets to small models and escalates only complex queries to frontier models, which keeps the blended cost per ticket under $0.05 while preserving quality on the cases that matter. Funded teams ignoring this pattern run six-figure monthly model bills for workloads that should cost a fraction.

Reducing inference cost has three levers. Pick a smaller model for the workload, which usually requires evaluation infrastructure to confirm quality holds. Compress prompts and retrieved context with techniques like LLMLingua or simple token budgeting. Cache repeated context with Anthropic prompt caching or OpenAI prompt caching, which knocks 50 to 90% off repeated context tokens at fractional cost. Self-hosting a local LLM on your own GPUs converts variable per-token cost into fixed monthly compute, which pays off above roughly 50 to 200 million tokens a month depending on model size. Below that volume, paying per token is cheaper than running your own infrastructure.

// Examples
  • A support deflection system runs 8,000 tickets a month at $0.04 per ticket on GPT-4o mini, totaling $320 a month against a previous quote of $3,200 on GPT-4o.
  • A research copilot caches a 40,000-token reference document across user queries using prompt caching, cutting input cost by 87% on repeat lookups.
  • A high-volume internal copilot crosses 80 million monthly tokens and moves to self-hosted Llama 3.1 70B on rented A100s, fixing cost at $1,800 a month against $5,500 in API fees.
// Common questions
How do I forecast inference cost before launch?
Take expected daily volume, multiply by average tokens per interaction across input and output, multiply by the model price. Add 20% for retries and tool calls. A 5,000-interaction-per-day workload at 3,000 tokens per interaction on Claude 3.5 Sonnet runs about $300 a day or $9,000 a month before optimization.
Why are output tokens more expensive than input tokens?
Generation is sequential and serialized while input processing is parallel. Each output token requires a full forward pass through the model with the full attention computation over all prior tokens, which makes generation the bottleneck at serving time. Pricing reflects the compute imbalance directly.
When does prompt caching pay off?
When the same large block of context appears in many requests. A 30,000-token reference document used across 500 queries an hour pays for itself almost immediately because the cached tokens cost a fraction of fresh input tokens. Caching does nothing for unique-per-request content.
When does self-hosting beat per-token API pricing?
Roughly above 50 to 200 million tokens a month depending on model size and GPU rental rates. Below that, paying per token is cheaper than running your own infrastructure. Above it, fixed compute cost beats variable API fees and gives you data privacy as a side effect. The crossover point moves with model price changes.
// Related terms
// Ready to ship?

EOI runs fractional AI departments for funded teams under 50. Sales, Content, Ops, Support. Live in 14 days on a monthly retainer.