I worked on the problem recently which showed itself as rather low MySQL load (probably 5% CPU usage and close to zero IO) would spike to have hundreds instances of threads running at the same time, causing intense utilization spike and server very unresponsive for anywhere from half a minute to ten minutes until everything would go back to normal. What was interesting is Same query was taking large portion of slots in PROCESSLIST. I do not just mean query with same fingerprint but literally the same query with same constants.
What we observed was a cache miss storm – situation which can happen with memcache (as in this case) as well as with query cache. If you have the item which is expensive to generate but which has a lot of hits in the cache you can get into situation when many clients at once will have miss in the cache and will attempt to re-create the item pushing server to overload. Now because a lot of requests being proceed in parallel the response time for initial request may take a lot longer than if it is ran all by itself increasing the time it takes server to recover.
What do I mean by expensive query in this case ? This is the query which has too high ratio of requests to be served with 100% misses for portion of time. For example if I have 100 accesses for given cache objects per second and it takes 500ms to populate it, it still will be too expensive, because for these 500 ms it takes to populate the item 50 requests will be started (this is the average case, because of random arrivals the worse case is worse) which takes 25 seconds to deal with (assuming there is just one execution unit). Because we normally have multiple cores and multiple drives it can be less than that but it is enough to cause hiccup for a few seconds which is unacceptable for a lot of modern applications.
How can you deal with this problem ? You should carefully watch frequently accessed cache items as well as cache items which take long to generate in case of cache miss. To find first one for memcached you can use mk-query-digest to analyze which items are requested frequently, it can decode memcached wire traffic. For second you can have instrumentation in your applications or take a look at MySQL Slow queries – which is good enough if you populate each cache item with single query.
Optimize query if you can. This is a good thing to do in any case but it may not be the only part of best solution. You can get some query patterns getting slow over time as data size growths or execution plan changes, you can also have some items becoming hot unexpectedly due to changes to content interest or launch of new features.
Use Smarter Cache Especially with memcache it is you who decide how to populate the cache. There is number of techniques you can use to avoid this problem such as probabilistic invalidation, you can also put the special value in the cache to reflect it is being updated right now so you’re better wait rather than starting populating it. For MySQL Query Cache the solution should have been to make queries wait on first query started to complete. Unfortunately this have not been implemented so far.
Pre-Populate the cache In some cases you can’t change how caching works easily, especially if it is built in in the application however it may be easier enough to identify hot items
and pre-populate them before they expire. So if item expires in 15 minutes you can refresh it every 10 minutes and so you get basically no misses. This works best when there are few hot cache entries which cause the problem.
So if your server has seemingly random spikes of activities check this out – cache miss storm could be one of the possible causes.