There are two hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

This classic joke, often attributed to Phil Karlton, highlights a very real and persistent challenge for software developers. We’re constantly striving to build faster, more responsive systems, and caching is a fundamental strategy for achieving that.

But while caching offers a significant performance boost, it introduces a complex new problem: how do you ensure the cached data is always fresh and accurate? This challenge is known as cache invalidation, and if it’s not handled correctly, it can lead to stale data being served to users or, in a worst-case scenario, trigger a catastrophic chain reaction called a cache stampede.

In this blog post, we will attempt to tackle these problems in Valkey using the Debezium platform.

What exactly is Cache Invalidation and Cache Stampede?

Before we explore these problems further, we must understand what “cache invalidation” and “cache stampede” are.

One common use case for Valkey is Database Query Cache, where you store the results of database queries in Valkey to improve the processing time for a request and reduce the load on your database systems.

But as Valkey is a separate system from the database, how will it know when the query result(s) it cached are updated and begin serving the new data? This is what Cache Invalidation is: when changes are being made to the dataset, for example, an UPDATE statement is executed against the database, we need a way to invalidate the data stored in Valkey to ensure that the updated data is reflected. If the cache is not invalidated, there is a risk of displaying outdated information to users, which can cause confusion or even privacy issues.

Cache Invalidation is often dealt with by setting an expiration date for the cached data. When the data is not present in the cache (a cache miss), either because the entry has expired and been removed from Valkey or it did not exist in Valkey in the first place, applications will fetch it from the database layer, and store it in the cache for future use.

But if there are too many cache misses at the same time, either because multiple cache entries expire at the same time or because too many sessions request the same expired entries, it will cause a huge spike in database load. In the worst-case scenario, this will lead to performance degradation or crashes because each connection will attempt to update the missing cache entry from the database. This problem is called Cache Stampede.

Tackling the problems with Change Data Capture

Change Data Capture, or CDC, is the process/design pattern for capturing changes to data, such as executing INSERT, UPDATE, and DELETE statements in a MySQL database. These changes can then be applied to other data stores like data warehouses and data lakes, enabling real-time data processing and delivering time-sensitive insights.

CDC can also be used for updating caches, which is the use case discussed in this blog post.

Setting up a CDC pipeline from MySQL to Valkey using the Debezium Platform

From the Debezium documentation:

Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

Debezium is a popular open-source solution for CDC. It supports capturing data from widely used database systems, such as MySQL, PostgreSQL, MongoDB, etc. Debezium provides the debezium-api module, allowing us to easily configure a Debezium connector in a Java project.

For the demo, we will set up a small Java program to stream changes from MySQL to Valkey as a JSON object.

Dependencies

To start with the demo project, we will need to install a few things on the system

  • OpenJDK: for this blog post, I’m using JDK version 17.
  • Apache Maven: for managing the Java project dependencies.
  • Docker: for deploying the MySQL and Valkey instances.

For this demo, we will need dependencies for Debezium, MySQL, and Valkey using Maven. This entails adding the following to your application’s POM, where ${version.debezium} is the version of Debezium Platform you’re using, or a Maven property whose value contains the Debezium version string, which for me is 3.3.0.Alpha2 – the latest available at the time of writing.

In the code

Defining the connection to MySQL

We will begin by defining the configuration for the MySQL connector, which connects to an instance running on localhost:3306 with the user ‘mysqluser’.

When the connector runs, it reads information from the source and periodically records “offsets” that define how much of that information it has processed. If the process is restarted, it can continue from where it left off, preventing duplicate messages, which could affect data integrity if not handled carefully.

Debezium MySQL connector reads the server’s binary logs, which include all data changes and schema changes made to the databases. Since all changes to data are structured in terms of the owning table’s schema at the time the change was recorded, the connector needs to track all of the schema changes so that it can properly decode the change events. The connector records the schema information so that, should the connector be restarted and resume reading from the last recorded offset, it knows exactly what the database schemas looked like at that offset.

In this demo, we will store both the offset information and the database schema history as local files on the system, at /tmp/offsets.dat for the offset, and /tmp/schemahistory.dat for the schema history.

Lastly, for a CDC engine to automatically and accurately sync data between different database systems, it needs to know a few things about the metadata/schema of the data it is syncing, i.e, what is the datatype of a column/field, how big should a column be, etc. As such, there needs to be a schema for the engine to identify the structure of the database, or has it been changed recently, in order to replicate the data as accurately as possible. But in cases like streaming changes to a non-RDBMS data store, we do not need those schemas, so they can be disabled/removed from the event for a smaller message and faster processing speed.

Printing the change event to the console

After specifying the configuration, we can create an instance of DebeziumEngine. This object will poll the MySQL server every 10 milliseconds and print to the console each ChangeEvent captured.

The output to the console will resemble this

Writing the change event to Valkey using JSON.SET

While JSON has become a built-in datatype for Redis, it is not the case for Valkey (yet). As such, the client library will not provide us with JSON commands.  But we can do a quick implementation of it by using the ProtocolCommand interface

Then we can write the change event to Valkey as a JSON object:

Modifying the change event before writing to Valkey

Looking at the output of the JSON.SET command, we can see that while the change event does appear on Valkey, the data is not very helpful: the key does not tell us what table/key pattern it belongs to, and the value contains unnecessary information (e.g., the application using the key most likely won’t need to know the binlog detail).

If we need more advanced processing of the ChangeEvent record before writing it to the cache (for example, transforming the record key to formats like <table-name>:<primary-key-value>, or removing the change event metadata from the record), we can use the io.debezium.engine.DebeziumEngine.ChangeConsumer<R> to do it.

In the following example, we will transform the record so that:

  • The ChangeEvent key will be in the format <table-name>:<id>
  • Remove all metadata from the ChangeEvent value.
  • Delete the key from cache if the event we are processing is a DELETE statement (optype: “d”).

We can then use ValkeyChangeConsumer by passing an instance of it to the notifying API

The change event is presented much better on Valkey now

Putting it all together

The source code for the Java program is available on the Percona Lab GitHub account: https://github.com/Percona-Lab/valkey-cdc-debezium

First, we will deploy the MySQL and Valkey instance.

– For MySQL, we will create the user mysqluser

– For Valkey, we will use the valkey-bundle Docker image, which includes valkey-json, valkey-search,  and valkey-ldap modules.

Download the Java program source code and compile it. Before compiling, please remember to update the database password config key.

Running the program

Summary

This integration of Change Data Capture (CDC) with Valkey offers significant benefits for managing cache invalidation and stampede problems. By leveraging Debezium Engine to stream database changes in real-time to Valkey, applications can ensure their cached data is always up-to-date, reducing the risk of serving stale information and the risk of Cache Stampede occurring.

Further reading

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments