This post covers the topic of the video in more detail and includes some code samples.

The $9,000 Problem

You launch a chatbot powered by one of the popular LLMs like Gemini, Claude or GPT-4. It’s amazing and your users love it. Then you check your API bill at the end of the month: $15,000.

Looking into the logs, you discover that users are asking the same question in lots of different ways.

“How do I reset my password?”

“I forgot my password”

“Can’t log in, need password help”

“Reset my password please”

“I need to change my password”

Your LLM treats each of these as a completely different request. You’re paying for the same answer 5 times. Multiply that by thousands of users and hundreds of common questions, and suddenly you understand why your bill is so high.

Traditional caching won’t help, these queries don’t match exactly. You need semantic caching.

What is Semantic Caching?

Semantic caching uses vector embeddings to match queries by their meaning, not their exact text.

Traditional cache versus semantic cache

With traditional caching, we match strings and return the cached value on a match.

Query: “What’s the weather?”     -> Cached

Query: “How’s the weather?”      -> Cache MISS 

Query: “Tell me about the weather”   -> Cache MISS 

Hit rate: ~10-15% for typical chatbots

With semantic caching, we create an embedding of the query and match on meaning.

Query: “What’s the weather?”     -> Cached

Query: “How’s the weather?”      -> Cache HIT 

Query: “Tell me about the weather”   -> Cache HIT 

Hit rate: ~40-70% for typical chatbots

How It Works

  1. Convert the query to a vector: “What’s the weather?” -> [0.123, -0.456, 0.789, …]
  2. Store the vector in a vector database: Redis/Valkey with Search module
  3. Search for similar vectors: when a query comes in, find vectors with cosine similarity >= 85%
  4. Return cached response: if found, return instantly. Otherwise, call LLM and cache the result.

Why do you need this?

Cost savings

A real-world example from our testing for a customer support chatbot with 10,000 queries per day.

Scenario Daily Cost Monthly Cost Annual Cost
Claude Sonnet (no cache) $41.00 $1,230 $14,760
Claude Sonnet (60% hit rate) $16.40 $492 $5,904
Savings $24.60 $738 $8,856

Speed Improvements

Some testing shows that an API call for Gemini can take 7 seconds. Whereas a cache hit takes a total of 27 ms made up of 23 ms embedding, 2 ms for Valkeysearch and 1 ms for the fetch of the stored response. A 250x speed-up! Users get instant responses for common questions instead of waiting multiple seconds.

Consistent quality

Getting the same answer for semantically similar questions means a better user experience as well as language independent answers.

Building a Semantic Cache

What do you need?

  1. Vector Database: Redis or Valkey with corresponding Search module
  2. Embedding Model: sentence-transformers (local, free)
  3. Python: 3.8+

That’s it! Just a few open source components.

Installation

Example implementation

Step 1: Create the Vector Index

Step 2: Generate Embeddings

Step 3: Cache Management

Step 4: Integrate with Your LLM

Real-world results

Demo

I built a demo using Google Gemini API and tested with various queries. Here is an example of semantic caching in action. Our first question is always going to be a cache MISS.

==============================================================

Query: Predict the weather in London for 2026, Feb 3

==============================================================

Cache MISS (similarity: 0.823 < 0.85)

Cache miss – calling Gemini API…

API call completed in 6870ms

   Tokens: 1,303 (16 in / 589 out)

   Cost: $0.000891

Cached as JSON: ‘Predict the weather in London for 2026, Feb 3…’

A question with the same meaning produces a cache HIT.

==============================================================

Query: Tell me about the weather in London for 2026 Feb 3rd

==============================================================

Cache HIT (similarity: 0.911, total: 25.3ms)

  ├─ Embedding: 21.7ms | Search: 1.5ms | Fetch: 0.9ms

  └─ Matched: ‘Predict the weather in London for 2026, Feb 3…’

We can see a significant API call time of almost 7 seconds. Our cached answer is only taking 25 ms with 22 ms of that time spent on generating the embedding. 

From the testing we can estimate our returns of implementing a semantic cache. Our example above gives some estimates.

  • Cache hit rate: 60% and thus 60% cost savings
  • Speed improvement: 250x faster (27ms vs 6800ms)

You can extrapolate these results based on the expected number of queries per day e.g. 10,000 to find your total savings and work out your ROI for the semantic cache. Additionally, the speed saving is going to significantly improve your user experience!

Configuration: important levers

1. Similarity Threshold

The magic number that determines when queries are “similar enough”:

Guidelines
  • 0.95+: very strict – near-identical queries only
  • 0.85-0.90: recommended – catches paraphrases, good balance
  • 0.75-0.85: moderate – more cache hits, some false positives
  • <0.75: too lenient – risk of wrong answers
Trade-off

Higher = fewer hits but more accurate. Lower = more hits but potential mismatches.

2. Time-to-Live (TTL)

How long to cache responses. This follows the standard “how frequently does my data change” rule.

Guidelines
  • 5 minutes: real-time data (weather, stocks, news)
  • 1 hour: recommended for general queries
  • 24 hours: stable content, documentation
  • 7 days: historical data

3. Embedding Model

Different models offer different trade-offs

Model Dimensions Speed Quality Best For
all-MiniLM-L6-v2 384 Fast ✓ Good Production 
all-mpnet-base-v2 768 Medium Better Higher quality needs
OpenAI text-embedding-3 1536 API call Best Maximum quality

For most applications, all-MiniLM-L6-v2 is perfect: fast, good quality, runs locally.

Storage options: HASH versus JSON

You can store cached data two ways, either using the HASH or JSON datatype.

HASH Storage (Simple)

Pros: Simple, widely compatible
Cons: Limited querying, vectors as opaque blobs

JSON Storage (Recommended) 

Pros: Native vectors, flexible queries, easy debugging
Cons: Requires ValkeyJSON, RedisJSON module

Use JSON storage for production because of it’s flexibility and speed advantage in this scenario.

Use cases: when to use semantic caching

Perfect fit (60-80% hit rates)

Customer Support Chatbots

  • Users ask the same questions many different ways
  • “How do I reset my password?” = “I forgot my password” = “Can’t log in”
  • High volume, repetitive queries

FAQ Systems

  • Limited topic domains
  • Same questions repeated constantly
  • Documentation queries

Code Assistants

  • “How do I sort a list in Python?” variations
  • Common programming questions
  • Tutorial-style queries

Not ideal (<30% hit rates)

Unique Creative Content

  • Story generation
  • Custom art descriptions
  • Personalized content
  • Every query is different

Highly Personalized Responses

  • User-specific context required
  • Cannot share cached responses
  • Privacy concerns

Real-time Dynamic Data

  • Stock prices changing second-by-second
  • Live sports scores
  • Breaking news
  • Use very short TTLs if caching at all

Common pitfalls and how to avoid them

1. Threshold too low

If the threshold is too low, the cache can return the wrong answer. Keep the similarity threshold 80% or higher.

Query: “Python programming tutorial”

Matches: “Python snake care guide” (similarity: 0.76)

2. Vectors not normalized

Similarity scores are negative or >1.0 due to not normalizing the embeddings. Always use normalize_embeddings=True

Cache MISS (similarity: -0.023 < 0.85)  # Should be ~0.95!

3. TTL too long

Setting the time-to-live (TTL) too high can lead to stale cached data and thus wrong answers. Match the TTL to data volatility

Query: “Who is the current president?”

Response: “Joe Biden

4. Not monitoring hit rates

If you don’t monitor your hit rates, the cache effectiveness is unknown and any finetuning is guess work. So log every cache hit/miss, track metrics and set alerts.

Production Checklist

Before deploying to production

  • Vectors are normalized (normalize_embeddings=True)
  • Similarity threshold tuned (test with real queries)
  • TTL set appropriately (match to data freshness needs)
  • Monitoring in place (hit rates, latency, costs)
  • Error handling (fallback to LLM if cache fails)
  • Cache warming (pre-populate common queries)
  • Privacy considered (separate caches per user/tenant if needed)
  • Metadata rich (category, tags for filtering/invalidation)

Real-World Impact

Let’s recap with a real scenario of your chatbot that receives 50,000 queries per month, uses Claude Sonnet ($3 per 1M input tokens) and an average query using 200 tokens in/out.

Without semantic caching:

  • Cost: ~$1,230/month
  • Avg response time: 1.8 seconds
  • Users wait for every response

With semantic caching (60% hit rate)

  • Cost: ~$492/month ($738 saved)
  • Avg response time: ~750ms (combining hits and misses)
  • Users get instant answers for common questions
  • Infrastructure cost: $50/month

This would give you net savings of $688/month or $8,256/year and happier users, faster support and better experience.

Conclusion

Caching has, for a long time, been the answer to speed up replies and to save on costs by not needing to generate the same query results or fetch the same result from API calls. Semantic caching is no different, and it changes how you use LLMs. Instead of treating every query as unique, you recognize that users ask the same questions in countless ways, and you only pay for the answer once.

As we’ve seen, the savings in time and money are worth it:

  • 60% cache hit rate (conservative)
  • 250x faster responses for cached queries
  • $8,000+ annual savings at moderate scale
  • 1-2 days to implement

If you’re building with LLMs, semantic caching isn’t optional, it’s essential for production applications.

Have questions? Drop them in the comments or reach out.

Found this helpful? Share it with others who might benefit from semantic caching!

Further Reading

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments