Question 1

How do I forecast inference cost before launch?

Accepted Answer

Take expected daily volume, multiply by average tokens per interaction across input and output, multiply by the model price. Add 20% for retries and tool calls. A 5,000-interaction-per-day workload at 3,000 tokens per interaction on Claude 3.5 Sonnet runs about $300 a day or $9,000 a month before optimization.

Question 2

Why are output tokens more expensive than input tokens?

Accepted Answer

Generation is sequential and serialized while input processing is parallel. Each output token requires a full forward pass through the model with the full attention computation over all prior tokens, which makes generation the bottleneck at serving time. Pricing reflects the compute imbalance directly.

Question 3

When does prompt caching pay off?

Accepted Answer

When the same large block of context appears in many requests. A 30,000-token reference document used across 500 queries an hour pays for itself almost immediately because the cached tokens cost a fraction of fresh input tokens. Caching does nothing for unique-per-request content.

Question 4

When does self-hosting beat per-token API pricing?

Accepted Answer

Roughly above 50 to 200 million tokens a month depending on model size and GPU rental rates. Below that, paying per token is cheaper than running your own infrastructure. Above it, fixed compute cost beats variable API fees and gives you data privacy as a side effect. The crossover point moves with model price changes.