Semantic Caching for LLM Apps: Reduce Costs by 40-80% and Speed up by 250x Semantic Caching for LLM apps: reduce costs by 40-80% and speed up by 250x

This post covers the topic of the video in more detail and includes some code samples.

The $9,000 Problem

You launch a chatbot powered by one of the popular LLMs like Gemini, Claude or GPT-4. It’s amazing and your users love it. Then you check your API bill at the end of the month: $15,000.

Looking into the logs, you discover that users are asking the same question in lots of different ways.

“How do I reset my password?”

“I forgot my password”

“Can’t log in, need password help”

“Reset my password please”

“I need to change my password”

Your LLM treats each of these as a completely different request. You’re paying for the same answer 5 times. Multiply that by thousands of users and hundreds of common questions, and suddenly you understand why your bill is so high.

Traditional caching won’t help, these queries don’t match exactly. You need semantic caching.

What is Semantic Caching?

Semantic caching uses vector embeddings to match queries by their meaning, not their exact text.

Traditional cache versus semantic cache

With traditional caching, we match strings and return the cached value on a match.

Query: “What’s the weather?” -> Cached

Query: “How’s the weather?” -> Cache MISS

Query: “Tell me about the weather” -> Cache MISS

Hit rate: ~10-15% for typical chatbots

With semantic caching, we create an embedding of the query and match on meaning.

Query: “What’s the weather?” -> Cached

Query: “How’s the weather?” -> Cache HIT

Query: “Tell me about the weather” -> Cache HIT

Hit rate: ~40-70% for typical chatbots

How It Works

Convert the query to a vector: “What’s the weather?” -> [0.123, -0.456, 0.789, …]
Store the vector in a vector database: Redis/Valkey with Search module
Search for similar vectors: when a query comes in, find vectors with cosine similarity >= 85%
Return cached response: if found, return instantly. Otherwise, call LLM and cache the result.

Why do you need this?

Cost savings

A real-world example from our testing for a customer support chatbot with 10,000 queries per day.

Scenario	Daily Cost	Monthly Cost	Annual Cost
Claude Sonnet (no cache)	$41.00	$1,230	$14,760
Claude Sonnet (60% hit rate)	$16.40	$492	$5,904
Savings	$24.60	$738	$8,856

Speed Improvements

Some testing shows that an API call for Gemini can take 7 seconds. Whereas a cache hit takes a total of 27 ms made up of 23 ms embedding, 2 ms for Valkeysearch and 1 ms for the fetch of the stored response. A 250x speed-up! Users get instant responses for common questions instead of waiting multiple seconds.

Consistent quality

Getting the same answer for semantically similar questions means a better user experience as well as language independent answers.

Building a Semantic Cache

What do you need?

Vector Database: Redis or Valkey with corresponding Search module
Embedding Model: sentence-transformers (local, free)
Python: 3.8+

That’s it! Just a few open source components.

Installation

# Install dependencies
pip install valkey numpy sentence-transformers

# Start Valkey bundle container (or use Redis)
docker run -p 16379:6379 --name my-valkey-bundle -d valkey/valkey-bundle

# Install dependencies

pip install valkey numpy sentence-transformers

# Start Valkey bundle container (or use Redis)

docker run -p 16379:6379 --name my-valkey-bundle -d valkey/valkey-bundle

Example implementation

Step 1: Create the Vector Index

from valkey import Valkey
from valkey.commands.search.field import VectorField, TextField
from valkey.commands.search.indexDefinition import IndexDefinition, IndexType

client = Valkey(host="localhost", port=16379, decode_responses=True)

# Define schema with vector field
schema = (
            # Vector field stored in JSON at $.embedding
            VectorField(
                "$.embedding",
                "FLAT",  # or "HNSW" for larger datasets
                {
                    "TYPE": "FLOAT32",
                    "DIM": self.vector_dim,
                    "DISTANCE_METRIC": "COSINE"
                },
                as_name="embedding"
            ),
        )

# Create index on JSON documents
definition = IndexDefinition(
     prefix=[self.cache_prefix],
     index_type=IndexType.JSON  # Use JSON instead of HASH
)
       
self.client.ft(self.index_name).create_index(
      fields=schema,
      definition=definition
)

from valkey import Valkey

from valkey.commands.search.field import VectorField, TextField

from valkey.commands.search.indexDefinition import IndexDefinition, IndexType

client = Valkey(host="localhost", port=16379, decode_responses=True)

# Define schema with vector field

schema = (

# Vector field stored in JSON at $.embedding

VectorField(

"$.embedding",

"FLAT", # or "HNSW" for larger datasets

{

"TYPE": "FLOAT32",

"DIM": self.vector_dim,

"DISTANCE_METRIC": "COSINE"

as_name="embedding"

)

# Create index on JSON documents

definition = IndexDefinition(

prefix=[self.cache_prefix],

index_type=IndexType.JSON # Use JSON instead of HASH

)

self.client.ft(self.index_name).create_index(

fields=schema,

definition=definition

)

Step 2: Generate Embeddings

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_embedding(text: str) -> np.ndarray:
    """Generate normalized embedding vector."""
    return model.encode(
        text,
        convert_to_numpy=True,
        normalize_embeddings=True
    )

from sentence_transformers import SentenceTransformer

import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_embedding(text: str) -> np.ndarray:

"""Generate normalized embedding vector."""

return model.encode(

text,

convert_to_numpy=True,

normalize_embeddings=True

)

Step 3: Cache Management

import json
import hashlib
import time

def cache_response(query: str, response: str):
    """Store query-response pair with embedding."""
    embedding = generate_embedding(query)
    cache_key = f"cache:{hashlib.md5(query.encode()).hexdigest()}"
    
    doc = {
        "query": query,
        "response": response,
        "embedding": embedding.tolist(),
        "timestamp": time.time()
    }

    client.execute_command("JSON.SET", cache_key, "$", json.dumps(doc))
    client.expire(cache_key, 3600)  # 1 hour TTL

def get_cached_response(query: str, threshold: float = 0.85):
    """Search for semantically similar cached query."""
    embedding = generate_embedding(query)
    
    # KNN search for similar vectors
    from valkey.commands.search.query import Query

    query_obj = (
        Query("*=>[KNN 1 @embedding $vec AS score]")
        .return_fields("query", "response", "score")
        .dialect(2)
    )
   
    results = client.ft("cache_idx").search(
        query_obj,
        {"vec": embedding.tobytes()}
    )

    if results.docs:
        doc = results.docs[0]
        similarity = 1 - float(doc.score)  # Convert distance to similarity

        if similarity >= threshold:
            print(f"Cache HIT! (similarity: {similarity:.1%})")
            return doc.response

    print(f"✗ Cache miss")
    return None

import json

import hashlib

import time

def cache_response(query: str, response: str):

"""Store query-response pair with embedding."""

embedding = generate_embedding(query)

cache_key = f"cache:{hashlib.md5(query.encode()).hexdigest()}"

doc = {

"query": query,

"response": response,

"embedding": embedding.tolist(),

"timestamp": time.time()

}

client.execute_command("JSON.SET", cache_key, "$", json.dumps(doc))

client.expire(cache_key, 3600) # 1 hour TTL

def get_cached_response(query: str, threshold: float = 0.85):

"""Search for semantically similar cached query."""

embedding = generate_embedding(query)

# KNN search for similar vectors

from valkey.commands.search.query import Query

query_obj = (

Query("*=>[KNN 1 @embedding $vec AS score]")

.return_fields("query", "response", "score")

.dialect(2)

)

results = client.ft("cache_idx").search(

query_obj,

{"vec": embedding.tobytes()}

)

if results.docs:

doc = results.docs[0]

similarity = 1 - float(doc.score) # Convert distance to similarity

if similarity >= threshold:

print(f"Cache HIT! (similarity: {similarity:.1%})")

return doc.response

print(f"✗ Cache miss")

return None

Step 4: Integrate with Your LLM

import google.generativeai as genai

def chat(query: str) -> str:
    """Chat with caching layer."""
    # Check cache first
    cached = get_cached_response(query)

    if cached:
        return cached
    
    # Cache miss - call LLM
    model = genai.GenerativeModel('gemini-pro')
    response = model.generate_content(query)
    
    # Cache for future queries
    cache_response(query, response.text)

    return response.text

import google.generativeai as genai

def chat(query: str) -> str:

"""Chat with caching layer."""

# Check cache first

cached = get_cached_response(query)

if cached:

return cached

# Cache miss - call LLM

model = genai.GenerativeModel('gemini-pro')

response = model.generate_content(query)

# Cache for future queries

cache_response(query, response.text)

return response.text

Real-world results

Demo

I built a demo using Google Gemini API and tested with various queries. Here is an example of semantic caching in action. Our first question is always going to be a cache MISS.

==============================================================

Query: Predict the weather in London for 2026, Feb 3

==============================================================

Cache MISS (similarity: 0.823 < 0.85)

Cache miss – calling Gemini API…

API call completed in 6870ms

Tokens: 1,303 (16 in / 589 out)

Cost: $0.000891

Cached as JSON: ‘Predict the weather in London for 2026, Feb 3…’

A question with the same meaning produces a cache HIT.

==============================================================

Query: Tell me about the weather in London for 2026 Feb 3rd

==============================================================

Cache HIT (similarity: 0.911, total: 25.3ms)

├─ Embedding: 21.7ms | Search: 1.5ms | Fetch: 0.9ms

└─ Matched: ‘Predict the weather in London for 2026, Feb 3…’

We can see a significant API call time of almost 7 seconds. Our cached answer is only taking 25 ms with 22 ms of that time spent on generating the embedding.

From the testing we can estimate our returns of implementing a semantic cache. Our example above gives some estimates.

Cache hit rate: 60% and thus 60% cost savings
Speed improvement: 250x faster (27ms vs 6800ms)

You can extrapolate these results based on the expected number of queries per day e.g. 10,000 to find your total savings and work out your ROI for the semantic cache. Additionally, the speed saving is going to significantly improve your user experience!

Configuration: important levers

1. Similarity Threshold

The magic number that determines when queries are “similar enough”:

cache SemanticCache(similarity_threshold=0.85)

1	cache SemanticCache(similarity_threshold=0.85)

Guidelines

0.95+: very strict – near-identical queries only
0.85-0.90: recommended – catches paraphrases, good balance
0.75-0.85: moderate – more cache hits, some false positives
<0.75: too lenient – risk of wrong answers

Trade-off

Higher = fewer hits but more accurate. Lower = more hits but potential mismatches.

2. Time-to-Live (TTL)

How long to cache responses. This follows the standard “how frequently does my data change” rule.

SemanticCache(ttl_seconds=3600) #1 hour

1	SemanticCache(ttl_seconds=3600) #1 hour

Guidelines

5 minutes: real-time data (weather, stocks, news)
1 hour: recommended for general queries
24 hours: stable content, documentation
7 days: historical data

3. Embedding Model

Different models offer different trade-offs

Model	Dimensions	Speed	Quality	Best For
all-MiniLM-L6-v2	384	Fast ✓	Good	Production
all-mpnet-base-v2	768	Medium	Better	Higher quality needs
OpenAI text-embedding-3	1536	API call	Best	Maximum quality

For most applications, all-MiniLM-L6-v2 is perfect: fast, good quality, runs locally.

Storage options: HASH versus JSON

You can store cached data two ways, either using the HASH or JSON datatype.

HASH Storage (Simple)

# Store as HASH with binary vector blob

cache_data = {
    "query": query,
    "response": response,
    "embedding": vector.tobytes(),  # Binary
    "metadata": json.dumps({"category": "weather"})
}

client.hset(cache_key, mapping=cache_data)

# Store as HASH with binary vector blob

cache_data = {

"query": query,

"response": response,

"embedding": vector.tobytes(), # Binary

"metadata": json.dumps({"category": "weather"})

}

client.hset(cache_key, mapping=cache_data)

Pros: Simple, widely compatible
Cons: Limited querying, vectors as opaque blobs

JSON Storage (Recommended)

# Store as JSON document with native vector array

cache_doc = {
    "query": query,
    "response": response,
    "embedding": vector.tolist(),  # Native array
    "metadata": {
        "category": "weather",
        "tags": ["forecast", "current"]
    }
}

client.execute_command("JSON.SET", cache_key, "$", json.dumps(cache_doc))

# Store as JSON document with native vector array

cache_doc = {

"query": query,

"response": response,

"embedding": vector.tolist(), # Native array

"metadata": {

"category": "weather",

"tags": ["forecast", "current"]

}

client.execute_command("JSON.SET", cache_key, "$", json.dumps(cache_doc))

Pros: Native vectors, flexible queries, easy debugging
Cons: Requires ValkeyJSON, RedisJSON module

Use JSON storage for production because of it’s flexibility and speed advantage in this scenario.

Use cases: when to use semantic caching

Perfect fit (60-80% hit rates)

Customer Support Chatbots

Users ask the same questions many different ways
“How do I reset my password?” = “I forgot my password” = “Can’t log in”
High volume, repetitive queries

FAQ Systems

Limited topic domains
Same questions repeated constantly
Documentation queries

Code Assistants

“How do I sort a list in Python?” variations
Common programming questions
Tutorial-style queries

Not ideal (<30% hit rates)

Unique Creative Content

Story generation
Custom art descriptions
Personalized content
Every query is different

Highly Personalized Responses

User-specific context required
Cannot share cached responses
Privacy concerns

Real-time Dynamic Data

Stock prices changing second-by-second
Live sports scores
Breaking news
Use very short TTLs if caching at all

Common pitfalls and how to avoid them

1. Threshold too low

If the threshold is too low, the cache can return the wrong answer. Keep the similarity threshold 80% or higher.

Query: “Python programming tutorial”

Matches: “Python snake care guide” (similarity: 0.76)

2. Vectors not normalized

Similarity scores are negative or >1.0 due to not normalizing the embeddings. Always use normalize_embeddings=True

Cache MISS (similarity: -0.023 < 0.85) # Should be ~0.95!

embedding = model.encode(
    text,
    normalize_embeddings=True  # Critical!
)

embedding = model.encode(

text,

normalize_embeddings=True # Critical!

)

3. TTL too long

Setting the time-to-live (TTL) too high can lead to stale cached data and thus wrong answers. Match the TTL to data volatility

Query: “Who is the current president?”

Response: “Joe Biden“

4. Not monitoring hit rates

If you don’t monitor your hit rates, the cache effectiveness is unknown and any finetuning is guess work. So log every cache hit/miss, track metrics and set alerts.

Production Checklist

Before deploying to production

Vectors are normalized (normalize_embeddings=True)
Similarity threshold tuned (test with real queries)
TTL set appropriately (match to data freshness needs)
Monitoring in place (hit rates, latency, costs)
Error handling (fallback to LLM if cache fails)
Cache warming (pre-populate common queries)
Privacy considered (separate caches per user/tenant if needed)
Metadata rich (category, tags for filtering/invalidation)

Real-World Impact

Let’s recap with a real scenario of your chatbot that receives 50,000 queries per month, uses Claude Sonnet ($3 per 1M input tokens) and an average query using 200 tokens in/out.

Without semantic caching:

Cost: ~$1,230/month
Avg response time: 1.8 seconds
Users wait for every response

With semantic caching (60% hit rate)

Cost: ~$492/month ($738 saved)
Avg response time: ~750ms (combining hits and misses)
Users get instant answers for common questions
Infrastructure cost: $50/month

This would give you net savings of $688/month or $8,256/year and happier users, faster support and better experience.

Conclusion

Caching has, for a long time, been the answer to speed up replies and to save on costs by not needing to generate the same query results or fetch the same result from API calls. Semantic caching is no different, and it changes how you use LLMs. Instead of treating every query as unique, you recognize that users ask the same questions in countless ways, and you only pay for the answer once.

As we’ve seen, the savings in time and money are worth it:

60% cache hit rate (conservative)
250x faster responses for cached queries
$8,000+ annual savings at moderate scale
1-2 days to implement

If you’re building with LLMs, semantic caching isn’t optional, it’s essential for production applications.

Have questions? Drop them in the comments or reach out.

Found this helpful? Share it with others who might benefit from semantic caching!

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Semantic Caching for LLM Apps: Reduce Costs by 40-80% and Speed up by 250x

The $9,000 Problem

What is Semantic Caching?

Traditional cache versus semantic cache

How It Works

Why do you need this?

Cost savings

Speed Improvements

Consistent quality

Building a Semantic Cache

What do you need?

Installation

Example implementation

Step 1: Create the Vector Index

Step 2: Generate Embeddings

Step 3: Cache Management

Step 4: Integrate with Your LLM

Real-world results

Demo

Configuration: important levers

1. Similarity Threshold

Guidelines

Trade-off

2. Time-to-Live (TTL)

Guidelines

3. Embedding Model

Storage options: HASH versus JSON

HASH Storage (Simple)

JSON Storage (Recommended)

Use cases: when to use semantic caching

Perfect fit (60-80% hit rates)

Not ideal (<30% hit rates)

Common pitfalls and how to avoid them

1. Threshold too low

2. Vectors not normalized

3. TTL too long

4. Not monitoring hit rates

Production Checklist

Real-World Impact

Conclusion

Further Reading

About the Author

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

A Failing Unit Test, a Mysterious TCMalloc Misconfiguration, and a 60% Performance Gain in Docker

Valkey and Redis Sorted Sets: Leaderboards and Beyond

What Exactly Is the MySQL Ecosystem?

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads