The Real Cost of Running LLMs in Production

The pricing page says $15 per million input tokens. You do the math, budget $200/month, and launch. Three months later, your AI feature is costing $3,400/month and you have no idea why.

This is not hypothetical. It happened to a startup I know well, and the causes were entirely predictable in retrospect. Here is the actual breakdown of LLM production costs that the pricing pages don’t cover.

The Token Math Everyone Gets Wrong

Most engineers estimate token costs by calculating their average prompt length and multiplying by expected requests. This is wrong in at least three ways.

System prompts run on every request. If your system prompt is 800 tokens and you’re doing 100,000 requests/month, that’s 80 million tokens before a user types a single character. At $3/million (GPT-4o input), that’s $240/month just for the system prompt.

Chat history compounds. In a conversational interface, you send the entire conversation history with each message. A 20-turn conversation means turn 20 sends roughly 10x the tokens that turn 1 did. Your average tokens per request is not the length of a single message - it’s the weighted average across the full conversation lifecycle.

Retries are invisible in dashboards. If you implement retry logic (and you should), failed requests that get retried charge you twice. A 5% failure rate with one retry means 5% of your input tokens are billed twice.

The Full Cost Stack

Here is the real cost breakdown for a production LLM application handling 100,000 conversations/month (average 10 turns each):

Cost Component	Rough Monthly Cost
Input tokens (prompts + history)	$800
Output tokens (completions)	$400
System prompt tokens	$240
Retries and failures	$80
Embedding API calls	$60
Vector database (Pinecone/pgvector)	$100
Stream processing / queue	$50
Observability (LangSmith, Helicone)	$50
Engineering time for prompt tuning	$2,000+
Realistic total	$3,780+

The last line is the one teams consistently underestimate. Prompt engineering is not a one-time cost. Every model update potentially breaks your prompts. Every new use case requires prompt iteration. A mid-level engineer spending 20% of their time on LLM-related issues costs you $2,000-4,000/month in labor even before salary overhead.

Model Selection Matters More Than You Think

The instinct to use the best model for everything is expensive. Claude Opus 4.6 and GPT-4o are genuinely better than smaller models, but 80% of your production requests probably don’t need them.

Task	Recommended Model	Cost vs GPT-4o
Classification, routing	GPT-4o mini / Haiku	10-20x cheaper
Simple extraction	GPT-4o mini	10x cheaper
Code generation	GPT-4o or Sonnet	2-5x cheaper
Complex reasoning	GPT-4o / Opus / Gemini Ultra	baseline
Summarization	Sonnet / GPT-4o mini	3-10x cheaper

Running a cascade - use a cheap model first, escalate to expensive only when needed - can cut your LLM bill by 60-70% with minimal quality loss. The implementation takes about a week and pays for itself in month two.

Latency Is a Cost Too

A GPT-4o streaming response starts returning tokens in 800ms-2s. Claude Sonnet is often faster. But the time-to-first-token directly affects your server costs: if each request ties up a connection for 3 seconds, you can serve far fewer concurrent users before you need to scale.

For a web application that needs to feel responsive, the LLM call is almost always your bottleneck. Techniques that help:

Streaming: Return tokens as they generate. Users perceive lower latency even if total time is the same.
Caching: Semantic caching with a vector similarity check before hitting the LLM. GPTCache and similar tools report 30-50% cache hit rates for conversational apps.
Speculative execution: For multi-step pipelines, start step 2 with the expected output of step 1 before step 1 completes. Correct 70% of the time means 70% of your pipeline runs in parallel.

Context Window Economics

Longer context windows are useful but expensive. Feeding a 50-page document into every request to answer a simple question is wasteful. Better architectures:

RAG (Retrieval-Augmented Generation): Embed documents, store in a vector database, retrieve only the relevant 5-10 chunks at query time. Reduces input tokens by 5-20x for document-heavy applications.

Structured summarization: For long conversations, periodically summarize earlier turns and compress them. Store the summary, discard the raw history. Memory usage and token cost both improve.

Tool calling: Instead of fetching data and stuffing it into the prompt, let the model request exactly what it needs. You only pay for the tokens the model actually uses.

Observability Is Not Optional

You cannot optimize what you cannot measure. The minimum viable observability stack for an LLM application:

Track tokens per request (input and output separately)
Track latency (time-to-first-token and total)
Track model selection distribution
Track cache hit rates
Log every prompt/completion pair (sample at 1-5% in production for cost)
Alert on sudden cost spikes

Helicone and LangSmith both add meaningful insight for $50-100/month. The ROI of catching a prompt regression or a cost spike early is 10x their cost.

Bottom Line

The real cost of running LLMs in production is 3-5x the raw token price. System prompts, conversation history, retries, vector storage, observability, and engineering time all add up faster than you expect.

The biggest lever is model selection - most requests don’t need your most expensive model. Build a cascade, implement semantic caching, optimize your system prompts, and measure everything. Teams that do this cut their effective LLM cost by 50-70% without user-visible quality degradation.

The Token Math Everyone Gets Wrong#

The Full Cost Stack#

Model Selection Matters More Than You Think#

Latency Is a Cost Too#

Context Window Economics#

Observability Is Not Optional#

Bottom Line#

Comments