Caching techinquesPeter Zaitsev
Recently Jay Pipes published great article about lazy connecting and caching which reminded me my post on this matter is well overdue.
Let me start with couple of comments about Jays article. First – caching in files should be used with caution. It may be very efficient especially if number of cached objects is small but if you get too many small objects which need to be cached files can become inefficient, especially if you do not care about putting them into different directories on file systems which can’t handle too many files in the same directory efficiently. The other problem with files is of course being local to the local node which might be inefficient with many web servers. Putting cache on shared storage could work but that is extra complication. There are cases when file cache works pretty well – for example caching full page results in files can be extremely efficient, especially if you can ofload sending cache result to the client by web server (instead of doing it from PHP). The other case when caching in files can be very helpful is caching data long term – ie if you got it from some Web Service. For smaller short lived objects things like memcached (or local shared memory for small installations) can be more efficient.
Lets now get to the main point of this article – I wanted to talk in a bit more details about caching algorithms and for which kind of application each of them can work.
TTL Based Caching This is probably most common approach for cache organization in the Web – you cache the object and say it is going to expire in 60 seconds. This type of caching is even supported by HTTP protocol with its Expires Header. This approach to the caching is very simple and pretty efficient, however it leads to having stale data which is problematic for some applications, also objects may expire before they are really modified so cache can be less efficient than possible. TTL caching is great if you have no way to know when object has changed. For example if you retrieve product information from Amazon Web Services, you have no way to get notification about it being changed so the only thing you can do is check if it was changed every so often.
Cache Invalidate With this apporach you invalidate or remove objects in cache once underlying information is updated. This is how MySQL Query Cache works by removing all queries derived from the table if that table is updated. This technique requires you to know which objects are based on which data so you can invalidate wisely and of course have update notifications. This cache can be more efficient with rare updates but you better to have fine grain control of what has to be invalidated – approach used by MySQL Query Cache is too coarse invalidating a lot of queries which really do not need to be. The other problem with this method is – invalidation can be expensive, especially if you get many objects cached and they are stored on the storage or network rather than local memory. This approach works very well if you cache is structured so each update removes only one object from the cache.
Cache Update This one is quite similar to cache-invalidate approach but with difference what cache content is actually renewed rather than invalidated. This approach is more efficient, especially if renewing is done in background as it greatly reduces number of cache misses, while it also wastes resources by updating what might never be needed again. Another complication for this approach is – cache has to become active and know how to update cache content rather than being simply storage system. You can also mix cache update and cache-invalidate approaches efficiently. For example I worked with one company having trouble with some Full Text Search requests – they just were way to slow, while some of them was very frequent. Until Full Text itself could be optimized we added cache update algorithm which would populate cache with result set for most frequent and complicated queries as new data is loaded and index is rebuilt.
Version based caching This could be thought about as modification of previous ones but I will put it as separate item as it is quite efficient way to organize things. In this case version is associated with each of the data sources used to compose the object. For each of the object data source versions are stored together with cached data. Once you take object out of cache you validate if all versions are still current and use it or throw it away. The good thing about this approach is – you cache operations have constant complexity – you do not have to do big invalidation clean up on updates. Let me come with example how it can work in practice. Lets say you’re running big blogging site which has many blogs published. You may store blog version which is updated on each blog modification and have all cache lookups for this user to check if blog version has chaged. You can become more fine grained and have versions added to blog entries so if comment is added to post B you do not invalidate post A. Of course this method also has its downsides – you have to perform extra version check on each lookup and it might be complicated and caching bugs hard to debug, but it can give very good cache efficiency for some applications.
These are of course only some of the approaches which you can use, and in many cases mixing them may make sense. You may use TTL for some of the objects while invalidation for others.