The $9,000 Problem
You launch a chatbot powered by one of the popular LLMs like Gemini, Claude or GPT-4. It’s amazing and your users love it. Then you check your API bill at the end of the month: $15,000.
Looking into the logs, you discover that users are asking the same question in lots of different ways.
“How do I reset my password?”
“I forgot my password”
“Can’t log in, need password help”
“Reset my password please”
“I need to change my password”
Your LLM treats each of these as a completely different request. You’re paying for the same answer 5 times. Multiply that by thousands of users and hundreds of common questions, and suddenly you understand why your bill is so high.
Traditional caching won’t help, these queries don’t match exactly. You need semantic caching.
What is Semantic Caching?
Semantic caching uses vector embeddings to match queries by their meaning, not their exact text.
Traditional cache versus semantic cache
With traditional caching, we match strings and return the cached value on a match.
Query: “What’s the weather?” -> Cached
Query: “How’s the weather?” -> Cache MISS
Query: “Tell me about the weather” -> Cache MISS
Hit rate: ~10-15% for typical chatbots
With semantic caching, we create an embedding of the query and match on meaning.
Query: “What’s the weather?” -> Cached
Query: “How’s the weather?” -> Cache HIT
Query: “Tell me about the weather” -> Cache HIT
Hit rate: ~40-70% for typical chatbots
How It Works
- Convert the query to a vector: “What’s the weather?” -> [0.123, -0.456, 0.789, …]
- Store the vector in a vector database: Redis/Valkey with Search module
- Search for similar vectors: when a query comes in, find vectors with cosine similarity >= 85%
- Return cached response: if found, return instantly. Otherwise, call LLM and cache the result.
Why do you need this?
Cost savings
A real-world example from our testing for a customer support chatbot with 10,000 queries per day.
| Scenario | Daily Cost | Monthly Cost | Annual Cost |
| Claude Sonnet (no cache) | $41.00 | $1,230 | $14,760 |
| Claude Sonnet (60% hit rate) | $16.40 | $492 | $5,904 |
| Savings | $24.60 | $738 | $8,856 |
Speed Improvements
Some testing shows that an API call for Gemini can take 7 seconds. Whereas a cache hit takes a total of 27 ms made up of 23 ms embedding, 2 ms for Valkeysearch and 1 ms for the fetch of the stored response. A 250x speed-up! Users get instant responses for common questions instead of waiting multiple seconds.
Consistent quality
Getting the same answer for semantically similar questions means a better user experience as well as language independent answers.
Building a Semantic Cache
What do you need?
- Vector Database: Redis or Valkey with corresponding Search module
- Embedding Model: sentence-transformers (local, free)
- Python: 3.8+
That’s it! Just a few open source components.
Installation
|
1 2 3 4 5 |
# Install dependencies pip install valkey numpy sentence-transformers # Start Valkey bundle container (or use Redis) docker run -p 16379:6379 --name my-valkey-bundle -d valkey/valkey-bundle |
Example implementation
Step 1: Create the Vector Index
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
from valkey import Valkey from valkey.commands.search.field import VectorField, TextField from valkey.commands.search.indexDefinition import IndexDefinition, IndexType client = Valkey(host="localhost", port=16379, decode_responses=True) # Define schema with vector field schema = ( # Vector field stored in JSON at $.embedding VectorField( "$.embedding", "FLAT", # or "HNSW" for larger datasets { "TYPE": "FLOAT32", "DIM": self.vector_dim, "DISTANCE_METRIC": "COSINE" }, as_name="embedding" ), ) # Create index on JSON documents definition = IndexDefinition( prefix=[self.cache_prefix], index_type=IndexType.JSON # Use JSON instead of HASH ) self.client.ft(self.index_name).create_index( fields=schema, definition=definition ) |
Step 2: Generate Embeddings
|
1 2 3 4 5 6 7 8 9 10 11 12 |
from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('all-MiniLM-L6-v2') def generate_embedding(text: str) -> np.ndarray: """Generate normalized embedding vector.""" return model.encode( text, convert_to_numpy=True, normalize_embeddings=True ) |
Step 3: Cache Management
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
import json import hashlib import time def cache_response(query: str, response: str): """Store query-response pair with embedding.""" embedding = generate_embedding(query) cache_key = f"cache:{hashlib.md5(query.encode()).hexdigest()}" doc = { "query": query, "response": response, "embedding": embedding.tolist(), "timestamp": time.time() } client.execute_command("JSON.SET", cache_key, "$", json.dumps(doc)) client.expire(cache_key, 3600) # 1 hour TTL def get_cached_response(query: str, threshold: float = 0.85): """Search for semantically similar cached query.""" embedding = generate_embedding(query) # KNN search for similar vectors from valkey.commands.search.query import Query query_obj = ( Query("*=>[KNN 1 @embedding $vec AS score]") .return_fields("query", "response", "score") .dialect(2) ) results = client.ft("cache_idx").search( query_obj, {"vec": embedding.tobytes()} ) if results.docs: doc = results.docs[0] similarity = 1 - float(doc.score) # Convert distance to similarity if similarity >= threshold: print(f"Cache HIT! (similarity: {similarity:.1%})") return doc.response print(f"✗ Cache miss") return None |
Step 4: Integrate with Your LLM
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import google.generativeai as genai def chat(query: str) -> str: """Chat with caching layer.""" # Check cache first cached = get_cached_response(query) if cached: return cached # Cache miss - call LLM model = genai.GenerativeModel('gemini-pro') response = model.generate_content(query) # Cache for future queries cache_response(query, response.text) return response.text |
Real-world results
Demo
I built a demo using Google Gemini API and tested with various queries. Here is an example of semantic caching in action. Our first question is always going to be a cache MISS.
==============================================================
Query: Predict the weather in London for 2026, Feb 3
==============================================================
Cache MISS (similarity: 0.823 < 0.85)
Cache miss – calling Gemini API…
API call completed in 6870ms
Tokens: 1,303 (16 in / 589 out)
Cost: $0.000891
Cached as JSON: ‘Predict the weather in London for 2026, Feb 3…’
A question with the same meaning produces a cache HIT.
==============================================================
Query: Tell me about the weather in London for 2026 Feb 3rd
==============================================================
Cache HIT (similarity: 0.911, total: 25.3ms)
├─ Embedding: 21.7ms | Search: 1.5ms | Fetch: 0.9ms
└─ Matched: ‘Predict the weather in London for 2026, Feb 3…’
We can see a significant API call time of almost 7 seconds. Our cached answer is only taking 25 ms with 22 ms of that time spent on generating the embedding.
From the testing we can estimate our returns of implementing a semantic cache. Our example above gives some estimates.
- Cache hit rate: 60% and thus 60% cost savings
- Speed improvement: 250x faster (27ms vs 6800ms)
You can extrapolate these results based on the expected number of queries per day e.g. 10,000 to find your total savings and work out your ROI for the semantic cache. Additionally, the speed saving is going to significantly improve your user experience!
Configuration: important levers
1. Similarity Threshold
The magic number that determines when queries are “similar enough”:
|
1 |
cache SemanticCache(similarity_threshold=0.85) |
Guidelines
- 0.95+: very strict – near-identical queries only
- 0.85-0.90: recommended – catches paraphrases, good balance
- 0.75-0.85: moderate – more cache hits, some false positives
- <0.75: too lenient – risk of wrong answers
Trade-off
Higher = fewer hits but more accurate. Lower = more hits but potential mismatches.
2. Time-to-Live (TTL)
How long to cache responses. This follows the standard “how frequently does my data change” rule.
|
1 |
SemanticCache(ttl_seconds=3600) #1 hour |
Guidelines
- 5 minutes: real-time data (weather, stocks, news)
- 1 hour: recommended for general queries
- 24 hours: stable content, documentation
- 7 days: historical data
3. Embedding Model
Different models offer different trade-offs
| Model | Dimensions | Speed | Quality | Best For |
| all-MiniLM-L6-v2 | 384 | Fast ✓ | Good | Production |
| all-mpnet-base-v2 | 768 | Medium | Better | Higher quality needs |
| OpenAI text-embedding-3 | 1536 | API call | Best | Maximum quality |
For most applications, all-MiniLM-L6-v2 is perfect: fast, good quality, runs locally.
Storage options: HASH versus JSON
You can store cached data two ways, either using the HASH or JSON datatype.
HASH Storage (Simple)
|
1 2 3 4 5 6 7 8 9 10 |
# Store as HASH with binary vector blob cache_data = { "query": query, "response": response, "embedding": vector.tobytes(), # Binary "metadata": json.dumps({"category": "weather"}) } client.hset(cache_key, mapping=cache_data) |
Pros: Simple, widely compatible
Cons: Limited querying, vectors as opaque blobs
JSON Storage (Recommended)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Store as JSON document with native vector array cache_doc = { "query": query, "response": response, "embedding": vector.tolist(), # Native array "metadata": { "category": "weather", "tags": ["forecast", "current"] } } client.execute_command("JSON.SET", cache_key, "$", json.dumps(cache_doc)) |
Pros: Native vectors, flexible queries, easy debugging
Cons: Requires ValkeyJSON, RedisJSON module
Use JSON storage for production because of it’s flexibility and speed advantage in this scenario.
Use cases: when to use semantic caching
Perfect fit (60-80% hit rates)
Customer Support Chatbots
- Users ask the same questions many different ways
- “How do I reset my password?” = “I forgot my password” = “Can’t log in”
- High volume, repetitive queries
FAQ Systems
- Limited topic domains
- Same questions repeated constantly
- Documentation queries
Code Assistants
- “How do I sort a list in Python?” variations
- Common programming questions
- Tutorial-style queries
Not ideal (<30% hit rates)
Unique Creative Content
- Story generation
- Custom art descriptions
- Personalized content
- Every query is different
Highly Personalized Responses
- User-specific context required
- Cannot share cached responses
- Privacy concerns
Real-time Dynamic Data
- Stock prices changing second-by-second
- Live sports scores
- Breaking news
- Use very short TTLs if caching at all
Common pitfalls and how to avoid them
1. Threshold too low
If the threshold is too low, the cache can return the wrong answer. Keep the similarity threshold 80% or higher.
Query: “Python programming tutorial”
Matches: “Python snake care guide” (similarity: 0.76)
2. Vectors not normalized
Similarity scores are negative or >1.0 due to not normalizing the embeddings. Always use normalize_embeddings=True
Cache MISS (similarity: -0.023 < 0.85) # Should be ~0.95!
|
1 2 3 4 |
embedding = model.encode( text, normalize_embeddings=True # Critical! ) |
3. TTL too long
Setting the time-to-live (TTL) too high can lead to stale cached data and thus wrong answers. Match the TTL to data volatility
Query: “Who is the current president?”
Response: “Joe Biden“
4. Not monitoring hit rates
If you don’t monitor your hit rates, the cache effectiveness is unknown and any finetuning is guess work. So log every cache hit/miss, track metrics and set alerts.
Production Checklist
Before deploying to production
- Vectors are normalized (normalize_embeddings=True)
- Similarity threshold tuned (test with real queries)
- TTL set appropriately (match to data freshness needs)
- Monitoring in place (hit rates, latency, costs)
- Error handling (fallback to LLM if cache fails)
- Cache warming (pre-populate common queries)
- Privacy considered (separate caches per user/tenant if needed)
- Metadata rich (category, tags for filtering/invalidation)
Real-World Impact
Let’s recap with a real scenario of your chatbot that receives 50,000 queries per month, uses Claude Sonnet ($3 per 1M input tokens) and an average query using 200 tokens in/out.
Without semantic caching:
- Cost: ~$1,230/month
- Avg response time: 1.8 seconds
- Users wait for every response
With semantic caching (60% hit rate)
- Cost: ~$492/month ($738 saved)
- Avg response time: ~750ms (combining hits and misses)
- Users get instant answers for common questions
- Infrastructure cost: $50/month
This would give you net savings of $688/month or $8,256/year and happier users, faster support and better experience.
Conclusion
Caching has, for a long time, been the answer to speed up replies and to save on costs by not needing to generate the same query results or fetch the same result from API calls. Semantic caching is no different, and it changes how you use LLMs. Instead of treating every query as unique, you recognize that users ask the same questions in countless ways, and you only pay for the answer once.
As we’ve seen, the savings in time and money are worth it:
- 60% cache hit rate (conservative)
- 250x faster responses for cached queries
- $8,000+ annual savings at moderate scale
- 1-2 days to implement
If you’re building with LLMs, semantic caching isn’t optional, it’s essential for production applications.
Have questions? Drop them in the comments or reach out.
Found this helpful? Share it with others who might benefit from semantic caching!
Further Reading
- ValkeySearch Documentation
- Sentence Transformers Models
- Redis Vector Similarity
- HNSW Algorithm Paper