In this post, I’ll cover the new MySQL monitoring plugins we created for Nagios, and explain their features and intended purpose.
I want to add a little context. What problem were we trying to solve with these plugins? Why yet another set of MySQL monitoring plugins?
The typical problem with Nagios monitoring (and indeed with most other monitoring systems) is the alert spam. I’ve spoken to a lot of users with monitoring systems over the years, and everyone says that they get a lot of false alarms, and that they have email filters set up for alerts. Recently I spoke to one person who gets about thirty thousand alerts per week (!!!) in his email. I’m sure you can see that this makes the monitoring system much less effective. The usual cause of such noisy monitoring is when you alert on things that don’t really indicate problems.
To avoid this problem as much as possible, I suggest the following:
- Alert only on things that reliably indicate a real problem. Do not alert on a threshold that merely indicates something is abnormal; abnormal things happen constantly. It’s useful to spend time investigating abnormalities, but in general, you should try to make sure that doesn’t creep into your fault-detection alerts.
- Alert only on things that are actionable. If you get an alert and you can’t do anything about it, that’s a certain recipe for a /dev/null email filter.
- Avoid ratios and thresholds as much as possible, because they are rarely actionable or reliable indicators of problems, and it’s difficult or impossible to find the right threshold for all situations. This suggestion is a consequence of the previous two, but it’s worth calling out separately in my opinion. I’ll highlight a couple of threshold-based alerts later in this blog post, and explain the need to be careful with them.
Although we’re not finished with the plugins, we have a usable starting point, and we’ll build on it as time passes. The plugins we created are based on our analysis of several hundred customer downtime incidents, and their causes and preventions. You can read more about that research in our white papers, here and here.
If you read those white papers, you’ll see a pretty extensive list of suggestions for what to monitor in MySQL. Those are only suggestions; any given deployment might want to monitor only some of them, as appropriate, and might need to include others as well. Here’s how you can use our new monitoring plugins to check for some of the conditions that caused downtime for our customers:
- Failed or full LVM snapshots. LVM backup strategies can fail to delete snapshots after finishing, sometimes due to the backup script failing before completion. Additionally, LVM snapshots that don’t have enough copy-on-write space can fail before the backup completes. The pmp-check-lvm-snapshots plugin will alert when a snapshot is failed or overly full.
- Too many deadlocks. Deadlocks (and lock wait timeouts, covered in another plugin) are frequent causes of serious errors in applications because the applications often don’t handle deadlocks correctly, and in many cases also don’t log the resulting errors. The best place to fix these problems is in the application, but the DBA really needs to be monitoring it as well, because an application error can creep in easily and be deployed, and you will not find it anywhere but in production. The pmp-check-mysql-deadlocks plugin can help you detect such problems, in conjunction with Percona Toolkit’s pt-deadlock-logger tool.
- Open but deleted filehandles. It’s quite easy for a bad logrotate script (such as the one that ships by default with many server installations) or a system administration mistake to delete one of MySQL’s log or data files from disk. The database server will continue to function, however, because the file isn’t really deleted. The MySQL server keeps an open filehandle to the file, and it won’t be deleted until the server restarts, so the problem remains undetected while it becomes more and more severe. This can cause two serious types of problems: the disk can fill up with no apparent cause or solution, or you can restart the server and lose your data permanently. The pmp-check-mysql-deleted-files plugin will alert when files have been deleted but remain open via a filehandle.
- Wrong filesystem privileges. It’s very easy for a file to creep into MySQL’s data directory with wrong permissions, without causing problems until a later date. For example, if a database isn’t owned by MySQL, you won’t have any problems until you try to create or drop a table, or the database itself. Likewise, if a system administrator leaves a temporary file in the database directory and then you try to drop the database, it will fail. These things tend to happen at inconvenient times. The pmp-check-mysql-file-privs plugin checks whether MySQL owns every file and directory within its data directory.
- InnoDB is getting stuck, has long-running transactions, or other problems. The pmp-check-mysql-innodb plugin checks for several indicators of severe internal problems within InnoDB, such as contention that crosses a tipping point and cripples the server. Sometimes such conditions are like credit card debt: easy to start, impossible to dig out afterwards. Another very common cause of serious incidents is application misbehavior, such as opening a transaction but never committing it, which creates a lot of bloat or locking in the database server. This plugin can alert on that and a couple other types of common problems, too.
- Missing PID file. A variety of problems can delete MySQL’s PID file, such as a buggy init script that attempts to start the server when it’s already running, or executing the server by hand instead of through the init script. Overly zealous cron jobs to delete old files in /tmp can do the same thing. If the PID file is missing, then many init scripts will behave very badly. Sample problems include starting two instances of MySQL on the same files in the data directory, and refusing to shut down MySQL so that it is killed forcibly (i.e. crashed) instead of gracefully shutting down for a system reboot. The pmp-check-mysql-pidfile plugin verifies that MySQL’s PID file exists.
- Evidence of contention in the processlist. The pmp-check-mysql-processlist plugin checks SHOW FULL PROCESSLIST for patterns that indicate similar problems to those discussed within InnoDB, but it can find evidence of problems in the server as well as the storage engine. For example, a pile-up of connections all trying to open or close tables can indicate LOCK_open contention.
- Replication is delayed or stopped. Our plugins are structured to monitor as few things as possible, but as many as needed. Replication is one thing that needs to be monitored in two dimensions, which shouldn’t be combined: delay (Seconds_behind_master or the delay indicated by the pt-heartbeat tool, if you choose), and actual failure. The pmp-check-mysql-replication-delay plugin checks for excessive delay, and pmp-check-mysql-replication-running checks that replication is functioning correctly.
- Replicas have different data than their master. In addition to delay or failure, replication can fail silently and data can diverge or drift on a replica. The pt-table-checksum tool from Percona Toolkit can generate table checksums, and the pmp-check-pt-table-checksum plugin can alert when those checksums indicate that the replica is not logically identical to its master.
- The server is allocating too much memory. A runaway server process, or poorly optimized queries, can consume too much memory and cause the server to swap until it is unresponsive and has to be power cycled. In addition, other processes can be runaways, too, such as a poorly written script or daemon; it is not unheard of for a monitoring agent or RAID utility software to do this. The pmp-check-unix-memory plugin checks for a single process consuming too much memory and/or for too little free memory in general. This plugin requires careful configuration because it uses a threshold for which there is no good one-size-fits-all value. This plugin, perhaps more than any other, should only be used judiciously.
- The server has been restarted recently. It is surprising how often a server crashes, perhaps even cycling through crashes, and is not detected until sometime later when the problem has grown much more serious. You can use the pmp-check-mysql-status plugin to alert when the server’s Uptime variable is too small.
- The server is approaching max_connections. Overflowing the max_connections setting is a bit like a lack of disk space: it is possible for it to be slowly approached over time, and then to begin causing problems when connections are rejected. Similarly to deadlocks and timeouts, applications often don’t handle or log this error appropriately. The pmp-check-mysql-status plugin can be configured to alert when Threads_connected approaches max_connections too closely, which can give the system administrator time to address the situation before it becomes a problem. This check requires careful configuration, or it too can become a noise generator. However, if your application environment makes it difficult to handle max_connections problems (the use of fixed-size connection pooling, for example, which can sometimes require application downtime to resolve), then it might be worth considering to let you deal with the problem when it’s convenient instead of when it’s urgent.
In addition to the checks in that list, there are several others that can be performed with existing plugins, so there was no need to write new ones. For example, the plugins that ship with Nagios are suitable for checking that the server is alive and responds to a query.
The pmp-check-mysql-status is also very flexible and can be used to trigger an alert on almost anything from SHOW VARIABLES and SHOW STATUS that you want. If you want to monitor ratios and rates, for example, you can certainly do so; it also supports an incremental mode of operation, so you can (for example) get the incremental increase in a variable such as Queries and alert on a queries-per-second threshold. (I don’t suggest doing this; I’m only listing an example of what you can do. The documentation has several other examples.)
In future releases we will include more plugins to monitor more types of problems, but the current plugins already cover some of the most common and severe conditions that we’ve observed to cause actual production downtime. Please download and try out the plugins, and help us improve them. If you are a customer and you need help configuring any of the plugins, we are happy to assist. In addition, please ask our sales staff to include a free monitoring setup in a new support contract. I believe they will be happy to negotiate this with you.