You launch a chatbot powered by one of the popular LLMs like Gemini, Claude or GPT-4. It’s amazing and your users love it. Then you check your API bill at the end of the month: $15,000.
Looking into the logs, you discover that users are asking the same question in lots of different ways.
“How do I reset my password?”
“I forgot my password”
“Can’t log in, need password help”
“Reset my password please”
“I need to change my password”
Your LLM treats each of these as a completely different request. You’re paying for the same answer 5 times. Multiply that by thousands of users and hundreds of common questions, and suddenly you understand why your bill is so high.
Traditional caching won’t help, these queries don’t match exactly. You need semantic caching.
Semantic caching uses vector embeddings to match queries by their meaning, not their exact text.
With traditional caching, we match strings and return the cached value on a match.
Query: “What’s the weather?” -> Cached
Query: “How’s the weather?” -> Cache MISS
Query: “Tell me about the weather” -> Cache MISS
Hit rate: ~10-15% for typical chatbots
With semantic caching, we create an embedding of the query and match on meaning.
Query: “What’s the weather?” -> Cached
Query: “How’s the weather?” -> Cache HIT
Query: “Tell me about the weather” -> Cache HIT
Hit rate: ~40-70% for typical chatbots
A real-world example from our testing for a customer support chatbot with 10,000 queries per day.
| Scenario | Daily Cost | Monthly Cost | Annual Cost |
| Claude Sonnet (no cache) | $41.00 | $1,230 | $14,760 |
| Claude Sonnet (60% hit rate) | $16.40 | $492 | $5,904 |
| Savings | $24.60 | $738 | $8,856 |
Some testing shows that an API call for Gemini can take 7 seconds. Whereas a cache hit takes a total of 27 ms made up of 23 ms embedding, 2 ms for Valkeysearch and 1 ms for the fetch of the stored response. A 250x speed-up! Users get instant responses for common questions instead of waiting multiple seconds.
Getting the same answer for semantically similar questions means a better user experience as well as language independent answers.
That’s it! Just a few open source components.
|
1 |
# Install dependencies<br>pip install valkey numpy sentence-transformers<br><br># Start Valkey bundle container (or use Redis)<br>docker run -p 16379:6379 --name my-valkey-bundle -d valkey/valkey-bundle |
|
1 |
from valkey import Valkey<br>from valkey.commands.search.field import VectorField, TextField<br>from valkey.commands.search.indexDefinition import IndexDefinition, IndexType<br><br>client = Valkey(host="localhost", port=16379, decode_responses=True)<br><br># Define schema with vector field<br>schema = (<br> # Vector field stored in JSON at $.embedding<br> VectorField(<br> "$.embedding",<br> "FLAT", # or "HNSW" for larger datasets<br> {<br> "TYPE": "FLOAT32",<br> "DIM": self.vector_dim,<br> "DISTANCE_METRIC": "COSINE"<br> },<br> as_name="embedding"<br> ),<br> )<br><br># Create index on JSON documents<br>definition = IndexDefinition(<br> prefix=[self.cache_prefix],<br> index_type=IndexType.JSON # Use JSON instead of HASH<br>)<br> <br>self.client.ft(self.index_name).create_index(<br> fields=schema,<br> definition=definition<br>) |
|
1 |
from sentence_transformers import SentenceTransformer<br>import numpy as np<br><br>model = SentenceTransformer('all-MiniLM-L6-v2')<br><br>def generate_embedding(text: str) -> np.ndarray:<br> """Generate normalized embedding vector."""<br> return model.encode(<br> text,<br> convert_to_numpy=True,<br> normalize_embeddings=True<br> ) |
|
1 |
import json<br>import hashlib<br>import time<br><br>def cache_response(query: str, response: str):<br> """Store query-response pair with embedding."""<br> embedding = generate_embedding(query)<br> cache_key = f"cache:{hashlib.md5(query.encode()).hexdigest()}"<br> <br> doc = {<br> "query": query,<br> "response": response,<br> "embedding": embedding.tolist(),<br> "timestamp": time.time()<br> }<br><br> client.execute_command("JSON.SET", cache_key, "$", json.dumps(doc))<br> client.expire(cache_key, 3600) # 1 hour TTL<br><br>def get_cached_response(query: str, threshold: float = 0.85):<br> """Search for semantically similar cached query."""<br> embedding = generate_embedding(query)<br> <br> # KNN search for similar vectors<br> from valkey.commands.search.query import Query<br><br> query_obj = (<br> Query("*=>[KNN 1 @embedding $vec AS score]")<br> .return_fields("query", "response", "score")<br> .dialect(2)<br> )<br> <br> results = client.ft("cache_idx").search(<br> query_obj,<br> {"vec": embedding.tobytes()}<br> )<br><br> if results.docs:<br> doc = results.docs[0]<br> similarity = 1 - float(doc.score) # Convert distance to similarity<br><br> if similarity >= threshold:<br> print(f"Cache HIT! (similarity: {similarity:.1%})")<br> return doc.response<br><br> print(f"✗ Cache miss")<br> return None |
|
1 |
import google.generativeai as genai<br><br>def chat(query: str) -> str:<br> """Chat with caching layer."""<br> # Check cache first<br> cached = get_cached_response(query)<br><br> if cached:<br> return cached<br> <br> # Cache miss - call LLM<br> model = genai.GenerativeModel('gemini-pro')<br> response = model.generate_content(query)<br> <br> # Cache for future queries<br> cache_response(query, response.text)<br><br> return response.text |
I built a demo using Google Gemini API and tested with various queries. Here is an example of semantic caching in action. Our first question is always going to be a cache MISS.
==============================================================
Query: Predict the weather in London for 2026, Feb 3
==============================================================
Cache MISS (similarity: 0.823 < 0.85)
Cache miss – calling Gemini API…
API call completed in 6870ms
Tokens: 1,303 (16 in / 589 out)
Cost: $0.000891
Cached as JSON: ‘Predict the weather in London for 2026, Feb 3…’
A question with the same meaning produces a cache HIT.
==============================================================
Query: Tell me about the weather in London for 2026 Feb 3rd
==============================================================
Cache HIT (similarity: 0.911, total: 25.3ms)
├─ Embedding: 21.7ms | Search: 1.5ms | Fetch: 0.9ms
└─ Matched: ‘Predict the weather in London for 2026, Feb 3…’
We can see a significant API call time of almost 7 seconds. Our cached answer is only taking 25 ms with 22 ms of that time spent on generating the embedding.
From the testing we can estimate our returns of implementing a semantic cache. Our example above gives some estimates.
You can extrapolate these results based on the expected number of queries per day e.g. 10,000 to find your total savings and work out your ROI for the semantic cache. Additionally, the speed saving is going to significantly improve your user experience!
The magic number that determines when queries are “similar enough”:
|
1 |
cache SemanticCache(similarity_threshold=0.85) |
Higher = fewer hits but more accurate. Lower = more hits but potential mismatches.
How long to cache responses. This follows the standard “how frequently does my data change” rule.
|
1 |
SemanticCache(ttl_seconds=3600) #1 hour |
Different models offer different trade-offs
| Model | Dimensions | Speed | Quality | Best For |
| all-MiniLM-L6-v2 | 384 | Fast ✓ | Good | Production |
| all-mpnet-base-v2 | 768 | Medium | Better | Higher quality needs |
| OpenAI text-embedding-3 | 1536 | API call | Best | Maximum quality |
For most applications, all-MiniLM-L6-v2 is perfect: fast, good quality, runs locally.
You can store cached data two ways, either using the HASH or JSON datatype.
|
1 |
# Store as HASH with binary vector blob<br><br>cache_data = {<br> "query": query,<br> "response": response,<br> "embedding": vector.tobytes(), # Binary<br> "metadata": json.dumps({"category": "weather"})<br>}<br><br>client.hset(cache_key, mapping=cache_data) |
Pros: Simple, widely compatible
Cons: Limited querying, vectors as opaque blobs
|
1 |
# Store as JSON document with native vector array<br><br>cache_doc = {<br> "query": query,<br> "response": response,<br> "embedding": vector.tolist(), # Native array<br> "metadata": {<br> "category": "weather",<br> "tags": ["forecast", "current"]<br> }<br>}<br><br>client.execute_command("JSON.SET", cache_key, "$", json.dumps(cache_doc)) |
Pros: Native vectors, flexible queries, easy debugging
Cons: Requires ValkeyJSON, RedisJSON module
Use JSON storage for production because of it’s flexibility and speed advantage in this scenario.
Customer Support Chatbots
FAQ Systems
Code Assistants
Unique Creative Content
Highly Personalized Responses
Real-time Dynamic Data
If the threshold is too low, the cache can return the wrong answer. Keep the similarity threshold 80% or higher.
Query: “Python programming tutorial”
Matches: “Python snake care guide” (similarity: 0.76)
Similarity scores are negative or >1.0 due to not normalizing the embeddings. Always use normalize_embeddings=True
Cache MISS (similarity: -0.023 < 0.85) # Should be ~0.95!
|
1 |
embedding = model.encode(<br> text,<br> normalize_embeddings=True # Critical!<br>) |
Setting the time-to-live (TTL) too high can lead to stale cached data and thus wrong answers. Match the TTL to data volatility
Query: “Who is the current president?”
Response: “Joe Biden“
If you don’t monitor your hit rates, the cache effectiveness is unknown and any finetuning is guess work. So log every cache hit/miss, track metrics and set alerts.
Before deploying to production
Let’s recap with a real scenario of your chatbot that receives 50,000 queries per month, uses Claude Sonnet ($3 per 1M input tokens) and an average query using 200 tokens in/out.
Without semantic caching:
With semantic caching (60% hit rate)
This would give you net savings of $688/month or $8,256/year and happier users, faster support and better experience.
Caching has, for a long time, been the answer to speed up replies and to save on costs by not needing to generate the same query results or fetch the same result from API calls. Semantic caching is no different, and it changes how you use LLMs. Instead of treating every query as unique, you recognize that users ask the same questions in countless ways, and you only pay for the answer once.
As we’ve seen, the savings in time and money are worth it:
If you’re building with LLMs, semantic caching isn’t optional, it’s essential for production applications.
Have questions? Drop them in the comments or reach out.
Found this helpful? Share it with others who might benefit from semantic caching!
Resources
RELATED POSTS