The cheapest optimization most teams skip is not routing to smaller models or trimming the context window. It is not calling the model at all. When ten users ask “how do I reset my password?” your app pays to generate that answer ten times. Every one of those generations after the first is pure waste.
Caching LLM responses is not a new idea, but most implementations are either too naive (exact-match string hashing that misses 90% of cacheable requests) or too aggressive (semantic matching that returns wrong answers). The gap between those two extremes is where production systems live, and it is worth understanding the mechanics before you build.
The Three Caching Layers
There are three distinct caching opportunities in a typical LLM app:
Exact-match caching hashes the full prompt string and stores the response. Trivially correct, but near-zero hit rate in practice. Users phrase the same question differently every time.
Semantic caching embeds the query, finds cached queries that are semantically similar above a threshold, and returns the stored response without calling the model. Hit rates of 20-50% for support-style workloads. This is the main subject of this post.
Prompt caching (API-level) is different from both of the above. Providers like Anthropic and OpenAI cache the KV attention state of your prompt prefix on their servers and charge you 90% less for cache-hit input tokens. You still call the model - it just runs faster and cheaper on the suffix. This is a provider feature, not something you build.
All three layers are compatible and stack together. A well-designed system uses all three.
Prompt Caching - The Easy Win You Might Already Have
If you use Claude or GPT-4o and your system prompt is longer than a few hundred tokens, you are probably leaving money on the table.
Prompt caching works at the token level. When a request arrives, the provider checks whether the prompt prefix matches a recently cached state. If it does, the model picks up computation from that checkpoint rather than reprocessing the prefix from scratch.
For Anthropic’s models, cached tokens cost 90% less than standard input pricing. A 2,000-token system prompt cached across 50,000 requests per day saves roughly:
Standard: 50,000 requests x 2,000 tokens x $3.00/1M = $300/day
Cached: 50,000 requests x 2,000 tokens x $0.30/1M = $30/day
Savings: $270/day = $8,100/month
What qualifies for caching
The prefix must be identical across requests. Any change in the prefix breaks the cache. Common patterns that cache well:
- System prompts (instructions, persona, rules)
- Few-shot examples placed before the user turn
- Long reference documents prepended to every request (e.g. a full spec sheet)
- RAG context that is the same within a session
Conversation history does not cache well because each user turn changes the prefix.
To use Anthropic’s prompt caching, mark cache breakpoints explicitly:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support agent for Acme Corp...\n[2000 tokens of instructions]",
"cache_control": {"type": "ephemeral"} # Mark this prefix for caching
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check cache usage in response
print(response.usage.cache_read_input_tokens) # Tokens served from cache
print(response.usage.cache_creation_input_tokens) # Tokens written to cache (first miss)
The cache TTL is 5 minutes on Anthropic, refreshed on each hit. For high-traffic apps this means near-100% cache hit rate on stable prefixes.
Semantic Caching - The Bigger Opportunity
Prompt caching saves money on the token processing side. Semantic caching saves money by skipping the model call entirely.
The naive implementation looks obvious:
from sentence_transformers import SentenceTransformer
import numpy as np
class NaiveSemanticCache:
def __init__(self, threshold=0.95):
self.model = SentenceTransformer("all-MiniLM-L6-v2")
self.threshold = threshold
self.entries = [] # (embedding, query, response)
def get(self, query: str):
q_emb = self.model.encode(query)
for emb, _, response in self.entries:
similarity = np.dot(q_emb, emb) / (np.linalg.norm(q_emb) * np.linalg.norm(emb))
if similarity > self.threshold:
return response
return None
def set(self, query: str, response: str):
emb = self.model.encode(query)
self.entries.append((emb, query, response))
This works in a demo and breaks in production in three specific ways.
Where the naive cache breaks
Wrong-hit at threshold edges. At 0.95 cosine similarity, “how do I cancel my subscription?” and “how do I pause my subscription?” often score above threshold. Same semantic neighborhood, different intent, wrong answer returned. Users see the wrong information and your support tickets go up.
Stale responses. You return a cached response from 3 days ago. In the meantime, your pricing changed. The cache has no concept of recency or staleness - it will serve the old answer indefinitely.
No context awareness. “What is my current balance?” from user A gets cached. User B sends the same query and gets user A’s balance. Obvious example, but the general pattern is: anything that depends on who is asking, what session they are in, or what data they have cannot be cached across users or sessions without namespacing.
Linear scan does not scale. Scanning all cached embeddings for similarity is O(n). At 10,000 cache entries, each lookup takes hundreds of milliseconds. The cache becomes slower than the model call.
Building It Right
A production semantic cache needs four things the naive version lacks: proper similarity thresholds (tuned per use case), TTL-based expiry, cache key namespacing, and a vector index instead of linear scan.
Threshold tuning
There is no universal threshold. Start here:
| Use case | Recommended threshold | Reason |
|---|---|---|
| FAQ / customer support | 0.97-0.98 | High precision needed, slight phrasing differences matter less |
| Product search queries | 0.92-0.95 | “red running shoes” and “running shoes in red” should hit |
| Code generation prompts | 0.99+ | Rarely cacheable, slight differences in spec matter a lot |
| Document summarization | 0.98-0.99 | Same document, same summary - but verify by doc hash too |
Lower threshold = more hits = more wrong answers. Higher threshold = fewer wrong answers = closer to exact match with little benefit. Pick the threshold where wrong-hit rate drops below 1% and measure it in production.
TTL and cache invalidation
Every cached entry needs an expiry. For most support-style content, 24-48 hours is reasonable. For anything that references pricing, policies, or inventory, tie TTL to the underlying data change:
import time
from dataclasses import dataclass, field
from typing import Optional
import redis
import numpy as np
from sentence_transformers import SentenceTransformer
@dataclass
class CacheEntry:
query: str
response: str
embedding: list
created_at: float
ttl_seconds: int
namespace: str
metadata: dict = field(default_factory=dict)
class SemanticCache:
def __init__(self, redis_client, threshold=0.97, default_ttl=86400):
self.redis = redis_client
self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
self.threshold = threshold
self.default_ttl = default_ttl
def _cache_key(self, namespace: str) -> str:
return f"semcache:{namespace}"
def get(self, query: str, namespace: str = "global") -> Optional[str]:
q_emb = self.encoder.encode(query)
now = time.time()
# Pull all entries for this namespace
raw_entries = self.redis.hgetall(self._cache_key(namespace))
best_score = 0
best_response = None
stale_keys = []
for key, raw in raw_entries.items():
entry = CacheEntry(**json.loads(raw))
# Expire stale entries
if now - entry.created_at > entry.ttl_seconds:
stale_keys.append(key)
continue
emb = np.array(entry.embedding)
score = float(np.dot(q_emb, emb) / (np.linalg.norm(q_emb) * np.linalg.norm(emb)))
if score > best_score:
best_score = score
best_response = entry.response
# Clean up stale entries in background
if stale_keys:
self.redis.hdel(self._cache_key(namespace), *stale_keys)
if best_score >= self.threshold:
return best_response
return None
def set(self, query: str, response: str, namespace: str = "global",
ttl: Optional[int] = None, metadata: dict = None):
emb = self.encoder.encode(query).tolist()
entry = CacheEntry(
query=query,
response=response,
embedding=emb,
created_at=time.time(),
ttl_seconds=ttl or self.default_ttl,
namespace=namespace,
metadata=metadata or {}
)
import json
cache_id = f"{hash(query)}"
self.redis.hset(self._cache_key(namespace), cache_id, json.dumps(entry.__dict__))
def invalidate_namespace(self, namespace: str):
self.redis.delete(self._cache_key(namespace))
The namespace parameter is the key to safety. User-specific queries use the user ID as namespace. Session-specific queries use the session ID. Shared knowledge base queries use a shared namespace. This prevents cross-user cache pollution.
Scaling the lookup with a vector index
For anything above a few thousand cache entries, replace the hash-scan with a vector index. pgvector on Postgres is the pragmatic choice if you already run Postgres:
CREATE TABLE semantic_cache (
id BIGSERIAL PRIMARY KEY,
namespace TEXT NOT NULL,
query TEXT NOT NULL,
response TEXT NOT NULL,
embedding VECTOR(384),
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL,
metadata JSONB DEFAULT '{}'
);
CREATE INDEX ON semantic_cache USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
CREATE INDEX ON semantic_cache (namespace, expires_at);
Then the lookup becomes a single indexed query:
SELECT response, 1 - (embedding <=> $1) AS similarity
FROM semantic_cache
WHERE namespace = $2
AND expires_at > NOW()
ORDER BY embedding <=> $1
LIMIT 1;
-- Filter similarity >= threshold in application code
This scales to millions of entries with sub-millisecond lookup.
The Complete Cache Layer
A production LLM app with all three caching layers looks like this:
User Query
|
v
[Exact Match Cache] -- Redis hash on normalized query string
|-- HIT: return response (0ms, $0)
|
|-- MISS:
v
[Semantic Cache] -- pgvector or Redis with vector index
|-- HIT: return response (5-20ms, $0)
|
|-- MISS:
v
[LLM API Call] -- with prompt caching enabled for stable prefix
|-- CACHE HIT on prefix: 90% cheaper input tokens, ~30% faster
|-- COLD: full token cost
|
v
[Store in semantic + exact cache] -- async, non-blocking
|
v
Response to User
The exact-match cache handles repeated identical queries (rare but free to check). The semantic cache handles rephrased questions (the main value). The prompt cache operates inside the API call itself and is always active at zero implementation cost once you mark your cache breakpoints.
Where NOT to Cache
Semantic caching is wrong for a category of queries that looks tempting but causes real harm:
Personalized responses. “What should I buy next?” or “Summarize my account activity” depend on user state. Even with user namespacing, if the user’s state changed since the cache was written, you serve stale personalized data. Namespace these by user AND session, with short TTLs (a few minutes at most), or do not cache them.
Code generation and debugging. “Fix this bug in my code” queries are superficially similar across users but the actual code differs. At 0.99 threshold you would need to include the full code in the cache key, at which point you are back to exact-match. Skip semantic caching for code tasks.
Real-time or stateful queries. “Is the database replication lag above threshold?” or “What is the stock price of X?” must not be cached, or only for seconds. Most LLM apps do not query for real-time data, but if yours does, exclude those query patterns from caching entirely.
Queries with ambiguous scope. “Tell me about the latest update” is a trap. Cached from last week, it returns information about last week’s update, silently. Either expand the query at ingestion time to include temporal anchors or set TTLs measured in hours.
Measuring Cache Performance
Track these metrics in production:
| Metric | Target | What it tells you |
|---|---|---|
| Semantic cache hit rate | 20-50% for support, 5-15% for code | Below 10% on support workloads means threshold is too high |
| Wrong-hit rate | < 1% | Sample cache hits and manually score correctness monthly |
| P95 cache lookup latency | < 20ms | If higher, your vector index needs tuning |
| Cost per request (cached vs uncached) | Track separately | Quantifies actual savings |
| Cache entry age distribution | Most hits on entries < 24h | Old hits are higher staleness risk |
Wrong-hit rate is the hardest to measure automatically. The best proxy is downstream: if semantic cache hit rate goes up but user satisfaction or task completion drops, wrong hits are happening. Set up a periodic sampling pipeline that pulls recent cache hits and scores them with a cheap model.
The Honest Assessment
Prompt caching (API-level) is a no-brainer with near-zero implementation cost and 10-40% savings on any app with a long system prompt. Do it first.
Semantic caching has genuine leverage for support, FAQ, and knowledge base workloads - 20-50% cost reduction with acceptable correctness if you tune the threshold carefully. It is harder to implement correctly than it looks, mostly because of the wrong-hit and staleness problems.
What does not work: applying semantic caching uniformly across all query types without namespace isolation, setting the threshold by intuition without measuring wrong-hit rate, or caching without TTLs and hoping stale data never surfaces.
The teams I have seen get the most from caching all share one trait: they measure wrong-hit rate explicitly, not just hit rate. Chasing hit rate without tracking correctness is how you build a fast system that silently gives users wrong answers.
Build the prompt caching layer this week. Build semantic caching for your FAQ and support flows. Skip it for everything personalized or real-time. The 40% cost reduction in the title is achievable, but only if the cached responses are actually correct.
Comments