Downloads

Blog

Semantic Caching for LLM Apps: Reduce Costs by 40-80% and Speed up by 250x

February 4, 2026

Author

Martin Visser

Database Trends

Insight for Developers

Open Source

Share this Post:

This post covers the topic of the video in more detail and includes some code samples.

The $9,000 Problem

You launch a chatbot powered by one of the popular LLMs like Gemini, Claude or GPT-4. It’s amazing and your users love it. Then you check your API bill at the end of the month: $15,000.

Looking into the logs, you discover that users are asking the same question in lots of different ways.

“How do I reset my password?”

“I forgot my password”

“Can’t log in, need password help”

“Reset my password please”

“I need to change my password”

Your LLM treats each of these as a completely different request. You’re paying for the same answer 5 times. Multiply that by thousands of users and hundreds of common questions, and suddenly you understand why your bill is so high.

Traditional caching won’t help, these queries don’t match exactly. You need semantic caching.

What is Semantic Caching?

Semantic caching uses vector embeddings to match queries by their meaning, not their exact text.

Traditional cache versus semantic cache

With traditional caching, we match strings and return the cached value on a match.

Query: “What’s the weather?” -> Cached

Query: “How’s the weather?” -> Cache MISS

Query: “Tell me about the weather” -> Cache MISS

Hit rate: ~10-15% for typical chatbots

With semantic caching, we create an embedding of the query and match on meaning.

Query: “What’s the weather?” -> Cached

Query: “How’s the weather?” -> Cache HIT

Query: “Tell me about the weather” -> Cache HIT

Hit rate: ~40-70% for typical chatbots

How It Works

Convert the query to a vector: “What’s the weather?” -> [0.123, -0.456, 0.789, …]

Store the vector in a vector database: Redis/Valkey with Search module

Search for similar vectors: when a query comes in, find vectors with cosine similarity >= 85%

Return cached response: if found, return instantly. Otherwise, call LLM and cache the result.

Why do you need this?

Cost savings

A real-world example from our testing for a customer support chatbot with 10,000 queries per day.

Scenario	Daily Cost	Monthly Cost	Annual Cost
Claude Sonnet (no cache)	$41.00	$1,230	$14,760
Claude Sonnet (60% hit rate)	$16.40	$492	$5,904
Savings	$24.60	$738	$8,856

Speed Improvements

Some testing shows that an API call for Gemini can take 7 seconds. Whereas a cache hit takes a total of 27 ms made up of 23 ms embedding, 2 ms for Valkeysearch and 1 ms for the fetch of the stored response. A 250x speed-up! Users get instant responses for common questions instead of waiting multiple seconds.

Consistent quality

Getting the same answer for semantically similar questions means a better user experience as well as language independent answers.

Building a Semantic Cache

What do you need?

Vector Database: Redis or Valkey with corresponding Search module

Embedding Model: sentence-transformers (local, free)

Python: 3.8+

That’s it! Just a few open source components.

Installation

# Install dependencies<br>pip install valkey numpy sentence-transformers<br><br># Start Valkey bundle container (or use Redis)<br>docker run -p 16379:6379 --name my-valkey-bundle -d valkey/valkey-bundle

1	# Install dependencies<br>pip install valkey numpy sentence-transformers<br><br># Start Valkey bundle container (or use Redis)<br>docker run -p 16379:6379 --name my-valkey-bundle -d valkey/valkey-bundle

Example implementation

Step 1: Create the Vector Index

from valkey import Valkey<br>from valkey.commands.search.field import VectorField, TextField<br>from valkey.commands.search.indexDefinition import IndexDefinition, IndexType<br><br>client = Valkey(host="localhost", port=16379, decode_responses=True)<br><br># Define schema with vector field<br>schema = (<br>            # Vector field stored in JSON at $.embedding<br>            VectorField(<br>                "$.embedding",<br>                "FLAT",  # or "HNSW" for larger datasets<br>                {<br>                    "TYPE": "FLOAT32",<br>                    "DIM": self.vector_dim,<br>                    "DISTANCE_METRIC": "COSINE"<br>                },<br>                as_name="embedding"<br>            ),<br>        )<br><br># Create index on JSON documents<br>definition = IndexDefinition(<br>     prefix=[self.cache_prefix],<br>     index_type=IndexType.JSON  # Use JSON instead of HASH<br>)<br>       <br>self.client.ft(self.index_name).create_index(<br>      fields=schema,<br>      definition=definition<br>)

from valkey import Valkey from valkey.commands.search.field import VectorField, TextField from valkey.commands.search.indexDefinition import IndexDefinition, IndexType client = Valkey(host="localhost", port=16379, decode_responses=True) # Define schema with vector field schema = ( # Vector field stored in JSON at $.embedding VectorField( "$.embedding", "FLAT", # or "HNSW" for larger datasets { "TYPE": "FLOAT32", "DIM": self.vector_dim, "DISTANCE_METRIC": "COSINE" }, as_name="embedding" ), ) # Create index on JSON documents definition = IndexDefinition( prefix=[self.cache_prefix], index_type=IndexType.JSON # Use JSON instead of HASH ) self.client.ft(self.index_name).create_index( fields=schema, definition=definition )

Step 2: Generate Embeddings

from sentence_transformers import SentenceTransformer<br>import numpy as np<br><br>model = SentenceTransformer('all-MiniLM-L6-v2')<br><br>def generate_embedding(text: str) -&gt; np.ndarray:<br>    """Generate normalized embedding vector."""<br>    return model.encode(<br>        text,<br>        convert_to_numpy=True,<br>        normalize_embeddings=True<br>    )

from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('all-MiniLM-L6-v2') def generate_embedding(text: str) -> np.ndarray: """Generate normalized embedding vector.""" return model.encode( text, convert_to_numpy=True, normalize_embeddings=True )

Step 3: Cache Management

import json<br>import hashlib<br>import time<br><br>def cache_response(query: str, response: str):<br>    """Store query-response pair with embedding."""<br>    embedding = generate_embedding(query)<br>    cache_key = f"cache:{hashlib.md5(query.encode()).hexdigest()}"<br>    <br>    doc = {<br>        "query": query,<br>        "response": response,<br>        "embedding": embedding.tolist(),<br>        "timestamp": time.time()<br>    }<br><br>    client.execute_command("JSON.SET", cache_key, "$", json.dumps(doc))<br>    client.expire(cache_key, 3600)  # 1 hour TTL<br><br>def get_cached_response(query: str, threshold: float = 0.85):<br>    """Search for semantically similar cached query."""<br>    embedding = generate_embedding(query)<br>    <br>    # KNN search for similar vectors<br>    from valkey.commands.search.query import Query<br><br>    query_obj = (<br>        Query("*=&gt;[KNN 1 @embedding $vec AS score]")<br>        .return_fields("query", "response", "score")<br>        .dialect(2)<br>    )<br>   <br>    results = client.ft("cache_idx").search(<br>        query_obj,<br>        {"vec": embedding.tobytes()}<br>    )<br><br>    if results.docs:<br>        doc = results.docs[0]<br>        similarity = 1 - float(doc.score)  # Convert distance to similarity<br><br>        if similarity &gt;= threshold:<br>            print(f"Cache HIT! (similarity: {similarity:.1%})")<br>            return doc.response<br><br>    print(f"✗ Cache miss")<br>    return None

import json import hashlib import time def cache_response(query: str, response: str): """Store query-response pair with embedding.""" embedding = generate_embedding(query) cache_key = f"cache:{hashlib.md5(query.encode()).hexdigest()}" doc = { "query": query, "response": response, "embedding": embedding.tolist(), "timestamp": time.time() } client.execute_command("JSON.SET", cache_key, "$", json.dumps(doc)) client.expire(cache_key, 3600) # 1 hour TTL def get_cached_response(query: str, threshold: float = 0.85): """Search for semantically similar cached query.""" embedding = generate_embedding(query) # KNN search for similar vectors from valkey.commands.search.query import Query query_obj = ( Query("*=>[KNN 1 @embedding $vec AS score]") .return_fields("query", "response", "score") .dialect(2) ) results = client.ft("cache_idx").search( query_obj, {"vec": embedding.tobytes()} ) if results.docs: doc = results.docs[0] similarity = 1 - float(doc.score) # Convert distance to similarity if similarity >= threshold: print(f"Cache HIT! (similarity: {similarity:.1%})") return doc.response print(f"✗ Cache miss") return None

Step 4: Integrate with Your LLM

import google.generativeai as genai<br><br>def chat(query: str) -&gt; str:<br>    """Chat with caching layer."""<br>    # Check cache first<br>    cached = get_cached_response(query)<br><br>    if cached:<br>        return cached<br>    <br>    # Cache miss - call LLM<br>    model = genai.GenerativeModel('gemini-pro')<br>    response = model.generate_content(query)<br>    <br>    # Cache for future queries<br>    cache_response(query, response.text)<br><br>    return response.text

import google.generativeai as genai def chat(query: str) -> str: """Chat with caching layer.""" # Check cache first cached = get_cached_response(query) if cached: return cached # Cache miss - call LLM model = genai.GenerativeModel('gemini-pro') response = model.generate_content(query) # Cache for future queries cache_response(query, response.text) return response.text

Real-world results

Demo

I built a demo using Google Gemini API and tested with various queries. Here is an example of semantic caching in action. Our first question is always going to be a cache MISS.

==============================================================

Query: Predict the weather in London for 2026, Feb 3

==============================================================

Cache MISS (similarity: 0.823 < 0.85)

Cache miss – calling Gemini API…

API call completed in 6870ms

Tokens: 1,303 (16 in / 589 out)

Cost: $0.000891

Cached as JSON: ‘Predict the weather in London for 2026, Feb 3…’

A question with the same meaning produces a cache HIT.

==============================================================

Query: Tell me about the weather in London for 2026 Feb 3rd

==============================================================

Cache HIT (similarity: 0.911, total: 25.3ms)

├─ Embedding: 21.7ms | Search: 1.5ms | Fetch: 0.9ms

└─ Matched: ‘Predict the weather in London for 2026, Feb 3…’

We can see a significant API call time of almost 7 seconds. Our cached answer is only taking 25 ms with 22 ms of that time spent on generating the embedding.

From the testing we can estimate our returns of implementing a semantic cache. Our example above gives some estimates.

Cache hit rate: 60% and thus 60% cost savings

Speed improvement: 250x faster (27ms vs 6800ms)

You can extrapolate these results based on the expected number of queries per day e.g. 10,000 to find your total savings and work out your ROI for the semantic cache. Additionally, the speed saving is going to significantly improve your user experience!

Configuration: important levers

1. Similarity Threshold

The magic number that determines when queries are “similar enough”:

cache SemanticCache(similarity_threshold=0.85)

1	cache SemanticCache(similarity_threshold=0.85)

Guidelines

0.95+: very strict – near-identical queries only

0.85-0.90: recommended – catches paraphrases, good balance

0.75-0.85: moderate – more cache hits, some false positives

<0.75: too lenient – risk of wrong answers

Trade-off

Higher = fewer hits but more accurate. Lower = more hits but potential mismatches.

2. Time-to-Live (TTL)

How long to cache responses. This follows the standard “how frequently does my data change” rule.

SemanticCache(ttl_seconds=3600) #1 hour

1	SemanticCache(ttl_seconds=3600) #1 hour

Guidelines

5 minutes: real-time data (weather, stocks, news)

1 hour: recommended for general queries

24 hours: stable content, documentation

7 days: historical data

3. Embedding Model

Different models offer different trade-offs

Model	Dimensions	Speed	Quality	Best For
all-MiniLM-L6-v2	384	Fast ✓	Good	Production
all-mpnet-base-v2	768	Medium	Better	Higher quality needs
OpenAI text-embedding-3	1536	API call	Best	Maximum quality

For most applications, all-MiniLM-L6-v2 is perfect: fast, good quality, runs locally.

Storage options: HASH versus JSON

You can store cached data two ways, either using the HASH or JSON datatype.

HASH Storage (Simple)

# Store as HASH with binary vector blob<br><br>cache_data = {<br>    "query": query,<br>    "response": response,<br>    "embedding": vector.tobytes(),  # Binary<br>    "metadata": json.dumps({"category": "weather"})<br>}<br><br>client.hset(cache_key, mapping=cache_data)

1	# Store as HASH with binary vector blob<br><br>cache_data = {<br> "query": query,<br> "response": response,<br> "embedding": vector.tobytes(), # Binary<br> "metadata": json.dumps({"category": "weather"})<br>}<br><br>client.hset(cache_key, mapping=cache_data)

Pros: Simple, widely compatible
Cons: Limited querying, vectors as opaque blobs

JSON Storage (Recommended)

# Store as JSON document with native vector array<br><br>cache_doc = {<br>    "query": query,<br>    "response": response,<br>    "embedding": vector.tolist(),  # Native array<br>    "metadata": {<br>        "category": "weather",<br>        "tags": ["forecast", "current"]<br>    }<br>}<br><br>client.execute_command("JSON.SET", cache_key, "$", json.dumps(cache_doc))

# Store as JSON document with native vector array cache_doc = { "query": query, "response": response, "embedding": vector.tolist(), # Native array "metadata": { "category": "weather", "tags": ["forecast", "current"] } } client.execute_command("JSON.SET", cache_key, "$", json.dumps(cache_doc))

Pros: Native vectors, flexible queries, easy debugging
Cons: Requires ValkeyJSON, RedisJSON module

Use JSON storage for production because of it’s flexibility and speed advantage in this scenario.

Use cases: when to use semantic caching

Perfect fit (60-80% hit rates)

Customer Support Chatbots

Users ask the same questions many different ways

“How do I reset my password?” = “I forgot my password” = “Can’t log in”

High volume, repetitive queries

FAQ Systems

Limited topic domains

Same questions repeated constantly

Documentation queries

Code Assistants

“How do I sort a list in Python?” variations

Common programming questions

Tutorial-style queries

Not ideal (<30% hit rates)

Unique Creative Content

Story generation

Custom art descriptions

Personalized content

Every query is different

Highly Personalized Responses

User-specific context required

Cannot share cached responses

Privacy concerns

Real-time Dynamic Data

Stock prices changing second-by-second

Live sports scores

Breaking news

Use very short TTLs if caching at all

Common pitfalls and how to avoid them

1. Threshold too low

If the threshold is too low, the cache can return the wrong answer. Keep the similarity threshold 80% or higher.

Query: “Python programming tutorial”

Matches: “Python snake care guide” (similarity: 0.76)

2. Vectors not normalized

Similarity scores are negative or >1.0 due to not normalizing the embeddings. Always use normalize_embeddings=True

Cache MISS (similarity: -0.023 < 0.85) # Should be ~0.95!

embedding = model.encode(<br>    text,<br>    normalize_embeddings=True  # Critical!<br>)

1	embedding = model.encode(<br> text,<br> normalize_embeddings=True # Critical!<br>)

3. TTL too long

Setting the time-to-live (TTL) too high can lead to stale cached data and thus wrong answers. Match the TTL to data volatility

Query: “Who is the current president?”

Response: “Joe Biden“

4. Not monitoring hit rates

If you don’t monitor your hit rates, the cache effectiveness is unknown and any finetuning is guess work. So log every cache hit/miss, track metrics and set alerts.

Production Checklist

Before deploying to production

Vectors are normalized (normalize_embeddings=True)

Similarity threshold tuned (test with real queries)

TTL set appropriately (match to data freshness needs)

Monitoring in place (hit rates, latency, costs)

Error handling (fallback to LLM if cache fails)

Cache warming (pre-populate common queries)

Privacy considered (separate caches per user/tenant if needed)

Metadata rich (category, tags for filtering/invalidation)

Real-World Impact

Let’s recap with a real scenario of your chatbot that receives 50,000 queries per month, uses Claude Sonnet ($3 per 1M input tokens) and an average query using 200 tokens in/out.

Without semantic caching:

Cost: ~$1,230/month

Avg response time: 1.8 seconds

Users wait for every response

With semantic caching (60% hit rate)

Cost: ~$492/month ($738 saved)

Avg response time: ~750ms (combining hits and misses)

Users get instant answers for common questions

Infrastructure cost: $50/month

This would give you net savings of $688/month or $8,256/year and happier users, faster support and better experience.

Conclusion

Caching has, for a long time, been the answer to speed up replies and to save on costs by not needing to generate the same query results or fetch the same result from API calls. Semantic caching is no different, and it changes how you use LLMs. Instead of treating every query as unique, you recognize that users ask the same questions in countless ways, and you only pay for the answer once.

As we’ve seen, the savings in time and money are worth it:

60% cache hit rate (conservative)

250x faster responses for cached queries

$8,000+ annual savings at moderate scale

1-2 days to implement

If you’re building with LLMs, semantic caching isn’t optional, it’s essential for production applications.

Have questions? Drop them in the comments or reach out.

Found this helpful? Share it with others who might benefit from semantic caching!

Far
Enough.

Said no pioneer ever.

Get Started

Open source database software from experts who stand with you in production. Forever free from lock-in and other corporate BS.

Connect

Privacy

Legal

Security Center

MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.

Semantic Caching for LLM Apps: Reduce Costs by 40-80% and Speed up by 250x

The $9,000 Problem

What is Semantic Caching?

Traditional cache versus semantic cache

How It Works

Why do you need this?

Cost savings

Speed Improvements

Consistent quality

Building a Semantic Cache

What do you need?

Installation

Example implementation

Step 1: Create the Vector Index

Step 2: Generate Embeddings

Step 3: Cache Management

Step 4: Integrate with Your LLM

Real-world results

Demo

Configuration: important levers

1. Similarity Threshold

Guidelines

Trade-off

2. Time-to-Live (TTL)

Guidelines

3. Embedding Model

Storage options: HASH versus JSON

HASH Storage (Simple)

JSON Storage (Recommended)

Use cases: when to use semantic caching

Perfect fit (60-80% hit rates)

Not ideal (<30% hit rates)

Common pitfalls and how to avoid them

1. Threshold too low

2. Vectors not normalized

3. TTL too long

4. Not monitoring hit rates

Production Checklist

Real-World Impact

Conclusion

Further Reading

Achieving High Availability with Valkey Sentinel

Scaling Your Cache: A Step-by-Step Guide to Setting Up Valkey Replication

Percona Operator for MySQL 1.1.0: PITR, Incremental Backups, and Compression

Far Enough.

Far
Enough.