An FAQ bot I built for a SaaS company was burning $4,200/month on Claude API calls. After implementing semantic caching, that dropped to $1,800. The code took two days. The ROI was immediate.
The core idea: users ask the same question in different ways. "What's your refund policy?" and "how do I get my money back?" are semantically identical. Exact-match string caching misses the second. Semantic caching catches both.
Exact-match vs semantic caching
Exact-match caching stores the raw query string as a cache key. If two requests are byte-for-byte identical, you return the cached response. Works great for programmatic callers with templated queries. Fails for any natural language interface where users paraphrase.
Semantic caching embeds the query into a vector, searches for similar cached queries, and returns a cached response if similarity exceeds a threshold. This is the one you want for chatbots, support agents, and search interfaces.
When semantic caching applies:
- FAQ and support bots where users phrase the same question differently
- Search interfaces with synonymous queries
- Structured data extraction tasks (same underlying data, slightly rephrased request)
When it doesn't:
- Personalized responses (user account data, history-dependent answers)
- Time-sensitive queries ("what's the stock price now?")
- Creative tasks where variation is the point
- Anything with user-specific context that changes the correct answer
GPTCache: the fast path
GPTCache is a Python library that wraps your LLM client and handles semantic caching automatically. Setup is fast:
pip install gptcache anthropic
from gptcache import cache
from gptcache.adapter.api import init_similar_cache
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
import anthropic
# Initialize GPTCache with local SQLite + FAISS (good for dev/testing)
init_similar_cache(
cache_obj=cache,
embedding=Onnx(),
data_manager=get_data_manager(
CacheBase("sqlite"),
VectorBase("faiss", dimension=512)
),
evaluation=SearchDistanceEvaluation(),
)
# Now use anthropic normally — GPTCache intercepts the call
client = anthropic.Anthropic()
def ask(question: str) -> str:
# GPTCache checks for similar cached responses before hitting the API
response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=500,
messages=[{"role": "user", "content": question}]
)
return response.content[0].text
# First call — hits API
print(ask("What is your return policy?"))
# Second call with paraphrase — cache hit (no API call)
print(ask("How do I get a refund?"))
GPTCache's default similarity threshold is around 0.9. You can tune it — lower threshold means more aggressive caching (more hits, more risk of returning slightly wrong answers).
For production, swap the local SQLite/FAISS backend for Redis:
from gptcache.manager import VectorBase
vector_store = VectorBase(
"redis",
host="your-redis-host",
port=6379,
password="your-password",
dimension=512
)
Upstash offers managed Redis with a free tier and pay-per-request pricing — a good fit for semantic cache workloads with bursty traffic. Try Upstash →
Prefer self-hosting? A Hostinger KVM 1 VPS (~₹300/month) running Redis directly gives you unlimited requests with no per-call pricing.
Building a semantic cache from scratch
If you want full control, skip GPTCache and build it yourself. This is more code but gives you explicit control over the similarity threshold, TTL, and cache invalidation.
import numpy as np
import redis
import json
import anthropic
from anthropic import Anthropic
client = Anthropic()
class SemanticCache:
def __init__(
self,
redis_url: str = "redis://localhost:6379",
similarity_threshold: float = 0.92,
ttl_seconds: int = 86_400, # 24 hours
embedding_model: str = "text-embedding-3-small"
):
self.r = redis.from_url(redis_url)
self.threshold = similarity_threshold
self.ttl = ttl_seconds
self.cache_prefix = "semcache:"
def _embed(self, text: str) -> np.ndarray:
"""Get embedding for a query. Using OpenAI here for embeddings;
swap with any embedding API or local model."""
import openai
response = openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(response.data[0].embedding, dtype=np.float32)
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def get(self, query: str) -> str | None:
"""Check cache. Returns cached response if similar query found, else None."""
query_embedding = self._embed(query)
# Scan all cached entries (use Redis SCAN in production, not KEYS)
keys = self.r.scan_iter(f"{self.cache_prefix}*")
best_similarity = 0.0
best_response = None
for key in keys:
entry = self.r.get(key)
if not entry:
continue
data = json.loads(entry)
cached_embedding = np.array(data["embedding"], dtype=np.float32)
similarity = self._cosine_similarity(query_embedding, cached_embedding)
if similarity > best_similarity:
best_similarity = similarity
best_response = data["response"]
if best_similarity >= self.threshold:
print(f"Cache hit (similarity={best_similarity:.3f})")
return best_response
return None
def set(self, query: str, response: str) -> None:
"""Store query-response pair in cache."""
embedding = self._embed(query)
# Use a hash of the query as the key
import hashlib
key = f"{self.cache_prefix}{hashlib.md5(query.encode()).hexdigest()}"
data = {
"query": query,
"response": response,
"embedding": embedding.tolist()
}
self.r.setex(key, self.ttl, json.dumps(data))
def ask(self, query: str, system_prompt: str = "") -> str:
"""Check cache first, hit API if miss."""
cached = self.get(query)
if cached:
return cached
# Cache miss — call the API
messages = [{"role": "user", "content": query}]
kwargs = {"model": "claude-haiku-3-5", "max_tokens": 500, "messages": messages}
if system_prompt:
kwargs["system"] = system_prompt
response = client.messages.create(**kwargs)
answer = response.content[0].text
self.set(query, answer)
return answer
# Usage
cache = SemanticCache(similarity_threshold=0.92, ttl_seconds=86_400)
answer1 = cache.ask("What is your return policy?")
answer2 = cache.ask("How do I request a refund?") # Cache hit if similar enough
The 0.92 threshold is a good starting point. Below 0.88 you'll get false positives — returning cached answers for questions that are similar but not equivalent. Above 0.95 you'll miss obvious paraphrases. Tune it by logging similarity scores for a week and examining borderline cases.
Using Redis Vector Search (RediSearch) for scale
The implementation above scans all cache entries linearly — it's O(n) and won't scale past a few thousand entries. For production, use RediSearch's vector similarity search:
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import numpy as np
r = redis.Redis(host="localhost", port=6379)
# Create vector index (run once at startup)
schema = [
TextField("query"),
TextField("response"),
VectorField(
"embedding",
"HNSW",
{"TYPE": "FLOAT32", "DIM": 1536, "DISTANCE_METRIC": "COSINE"}
)
]
r.ft("semcache_idx").create_index(
schema,
definition=IndexDefinition(prefix=["semcache:"], index_type=IndexType.HASH)
)
def vector_search(query_embedding: np.ndarray, top_k: int = 1):
"""Find most similar cached query using HNSW index."""
query_bytes = query_embedding.astype(np.float32).tobytes()
q = (
Query(f"*=>[KNN {top_k} @embedding $vec AS score]")
.sort_by("score")
.return_fields("query", "response", "score")
.dialect(2)
)
results = r.ft("semcache_idx").search(q, query_params={"vec": query_bytes})
return results.docs
HNSW index search is O(log n) and handles millions of entries. This is what you want for high-traffic production deployments.
Real cache hit rates and cost savings
Cache hit rates vary by use case:
| Use case | Expected cache hit rate |
|---|---|
| FAQ bot (support) | 30-60% |
| E-commerce search | 20-40% |
| General chat | 5-15% |
| Structured extraction | 40-70% |
| Creative writing | <5% |
Cost savings math: if you're running 10,000 queries/day on Claude Haiku at $0.80/million input tokens (average 500 tokens per query = $0.0004 per query), and you achieve 40% cache hit rate:
- Without cache: 10,000 × $0.0004 = $4.00/day → $1,460/year
- With 40% cache hits: 6,000 × $0.0004 = $2.40/day → $876/year
- Annual savings: $584
At Sonnet pricing ($3/million input tokens), those numbers multiply by ~4x — $2,190/year savings from a two-day implementation.
Cache infrastructure costs matter too. A Redis instance on Upstash or AWS ElastiCache costs $15-50/month depending on size. You'll recoup that quickly at any meaningful traffic level.
For the embedding API calls: you're spending ~$0.0001 per query to generate embeddings (text-embedding-3-small). That's negligible against API savings. You can also use a local embedding model (all-MiniLM-L6-v2 via sentence-transformers) to eliminate that cost entirely.
TTL strategy
Don't cache indefinitely. Your underlying information changes, and stale cached answers are worse than no cache.
- FAQ/policy content: 24-48 hour TTL. Refresh when you update your docs.
- Product information: 1-4 hour TTL if prices/availability change frequently
- General knowledge queries: 7-day TTL
- Time-sensitive queries: Don't cache at all
When you update your knowledge base or system prompt significantly, flush the cache. A simple cache version prefix in your Redis keys makes this trivial:
CACHE_VERSION = "v3"
key = f"semcache:{CACHE_VERSION}:{hashlib.md5(query.encode()).hexdigest()}"
# Bump CACHE_VERSION to invalidate all entries
What NOT to cache
Some queries look cacheable but aren't:
- "What's my order status?" — answer depends on user identity
- "Summarize the document I uploaded" — answer depends on the attached file
- "Is this medication safe for me?" — requires personalization; returning a generic answer is dangerous
- "What's happening in the news today?" — time-sensitive
- Any query where the correct answer requires the user's account data
Add a query classifier that routes personalized queries past the cache entirely. A simple keyword list works as a first pass:
SKIP_CACHE_PATTERNS = [
"my order", "my account", "my subscription", "my usage",
"today", "right now", "current", "latest",
"i uploaded", "the file i", "the document i"
]
def should_skip_cache(query: str) -> bool:
q = query.lower()
return any(pattern in q for pattern in SKIP_CACHE_PATTERNS)
Combine this with the token counting guide to build a complete cost optimization layer: cache hits eliminate API calls entirely, and for cache misses, token budgeting ensures you're not spending more than necessary per call.
For larger-scale cost optimization across your full agent stack, the agent cost optimization guide covers model routing, batching, and other strategies that compound with caching. The build vector store guide goes deeper on FAISS, Chroma, and Pinecone if you want more control over the vector search layer.
Semantic caching won't work for every app. But for any LLM feature with a FAQ or support use case, it's one of the highest-ROI optimizations available.



