At Percona Live last week, someone showed me a graph from their Cacti monitoring system, using the templates that I wrote. It was the buffer pool pages read, written, and created. He asked me if the graph was okay. Shouldn’t there be a lot more pages read than written, he asked? It’s a great question.
I’ve blogged before about the danger of trying to interpret ratios. Ratios are not good ways to discover whether systems are healthy. So, why graph them, then?
First, let me say that the graph actually doesn’t show a ratio — it just shows the absolute values of the reads, writes, and creates per second, stacked on top of each other. The person was mentally comparing them and creating a ratio from them. But there’s no ratio on the graph itself:
Regardless of that, some systems ought to have more reads than writes, and vice versa. So if you’re looking at your graph wondering what it should look like, the answer is probably “it should look exactly as it looks!”
I’ve gotten a lot of questions over time about how to interpret the Cacti graphs, and this person helped me to understand what the questions were really about. People were asking me “when I look at these graphs, how can I tell if anything is wrong with my system?” But that’s not really the most useful way to approach the graphs.
It really comes down to the difference between discovery and diagnosis. In general, it’s best to use the graphs for diagnosing problems that you already know about, not for trying to discover problems. Your monitoring and alerting system (Nagios?) should be trying to discover whether there is a problem. The graphs are there for quickly showing you what has changed. If the website suddenly starts responding very slowly, for example, then you can look at the graphs and see if any of them have sharp increases or decreases. You can use that information to help you diagnose.
But in general, I wouldn’t spend very much time looking at the graphs from day to day. I’d just check them once in a while — maybe once a week I’d look at the monthly view — to see if there were any sharp changes during the past week; I’d ensure that I know why those changes happened if I see any (maybe I deployed a new release); and I’d want to make sure that the graphs are still working, and haven’t gotten broken due to some problem like privileges or firewall rules.
In the ideal world, I’d like to simply collect everything, and not even define any graphs for the metrics. Then I’d like an easy way to make graphs on an ad-hoc basis. But Cacti is designed to have defined graphs, and that makes it tempting for people to spend a lot of time looking at them