Why don’t our new Nagios plugins use caching?

In response to the release of our new MySQL monitoring plugins on Friday, one commenter asked why the new Nagios plugins don’t use caching. It’s worth answering in a post rather than a comment, because there is an important principle that needs to be understood to monitor servers correctly. But first, some history.

When I wrote a set of high-quality Cacti templates for MySQL a few years ago (which are now replaced by the new project), making the Cacti templates use caching was important for two reasons:

  1. Performance. Cacti runs some of its polling processes serially, so if each graph has to reach out to the MySQL server and retrieve a bunch of data, the polling can take too long. I’ve seen cases where a Cacti server that’s graphing too many MySQL servers doesn’t finish the current iteration before the next one starts. Making sure the data isn’t retrieved from MySQL repeatedly cuts a lot of latency out of the process. In fact, I went further than that; I didn’t cache the data retrieved from MySQL, I cached the resulting metrics. This avoids having to do InnoDB parsing repeatedly, too. With this approach, Cacti is capable of monitoring many more servers.
  2. Consistency. Graphs are supposed to be generated at a single instant in time, but as I just said, they actually aren’t. Graphs that show slightly different data, or greatly different data in some cases I observed, are frustrating at best and wrong or misleading at worst. When you’re using these graphs for forensic or diagnostic purposes, this is very bad. Caching the results from the first call and reusing it for all subsequent polls avoids this problem. All of the graphs are generated from one and only one set of data retrieved from MySQL.

It’s worth noting that Cacti doesn’t do this itself, which I consider to be a Cacti design shortcoming. I did not like writing my own cache management code. Handling cached data correctly is one of the two hard things in programming (choosing proper variable names is the other one, of course).

Now, somewhere along the line “you should cache when you are monitoring a system for failures” became a tacit best practice. I don’t know why, but I can speculate that there might have been a misunderstanding — perhaps caching was understood to reduce load on MySQL, rather than reducing load on the Cacti server and speeding up the poller.

This is a very bad idea! I do NOT advocate caching for Nagios checks. Why? Exactly the inverse of the two reasons that graphing should be cached!

  1. Performance doesn’t matter. First, most Nagios checks are run once every five minutes, or once a minute at most. Running SHOW STATUS infrequently doesn’t add load to the server. Most real production installations I see actually have multiple people running innotop or what have you, once a second! Second, if your monitoring is to be at all useful, you should not monitor too many things inside of MySQL. If you take the typical mass-market monitoring program’s silly “alert on 99 different kinds of cache hit ratio thresholds” approach, you’ll soon have sysadmins with /dev/null email filters, which will make your monitoring system utterly useless. However, even if you have multiple checks running once a minute, it’s still a trivial amount of load.
  2. Consistency is just another word for staleness. Great, your monitoring system thinks that the server is fine, based on a three-minute-old file — what good is that? You should be monitoring the server, as of NOW, rather than monitoring the contents of some file as of several minutes ago.

In conclusion, you should cache for historical metrics monitoring, but not for fault detection monitoring. That’s why the new Nagios plugins don’t have any caching functionality.

Share this post

Comment (1)

  • Stuart Reply

    Quite right too – thank you for the post. I’m amazed anyone would even ask the question.

    February 28, 2012 at 1:42 am

Leave a Reply