Monitoring MongoDB with Nagios

Monitoring MongoDB with NagiosIn this blog, we’ll discuss monitoring MongoDB with Nagios.

There is a significant amount of talk around graphing MongoDB metrics using things like Prometheus, Data Dog, New Relic, and Ops Manager from MongoDB Inc. However, I haven’t noticed a lot of talk around “What MongoDB alerts should I be setting up?”

While building out Percona’s remote DBA service for MongoDB, I looked at Prometheus’s AlertManager. After reviewing it, I’m not sure it’s quite ready to be used exclusively. We needed to decide quickly if there are better Nagios checks on the market, or did I need to write my own?

In the end, we settled on a hybrid approach. There are some good frameworks, but we need to create or tweak some of the things needed for an “SEV 1-” or “SEV 2-” type issue (which are most important to me). One of the most common problems for operations, Ops, DevOps, DBA teams and most engineering is alert spam. As such I wanted to be very careful to only alert on the things pointing to immediate dangers or current outages. As a result, we have now added to the GitHub for Percona Monitoring Plugins. Since we use Grafana and Prometheus for metrics and graphing, there are no accompanying Catci information templates. In the future, we’ll need to decide how this will change PMP overtime. In the meantime, we wanted to make the tool available now and worry about some of the issues later on.

As part of this push, I want to give you some real world examples of how you might use this tool. There are many options available to you, and Nagios is still a bit green in regards to making those options as user-friendly as our tools are.

There seems to be a huge amount going on here, but let’s break it down into a few categories:

  • Connection options
  • Actions
  • Action options
  • Status options

Hopefully, this takes some of the scariness out of the script above.

Connection options
  • Host / Port Number
    • Pretty simple, this is just the host you want to connect to and what TCP port it is listening on.
  • Username and Password
    • Like with Host/Port, this is some of your normal and typical Mongo connection field options. If you do not set both the username and password, the system will assume auth was disabled.
  • SSL
    • This is mostly around the old SSL support in Mongo clients (which was a boolean). This tool needs updating to support the more modern SSL connection options. Use this as a “deprecated” feature that might not work on newer versions.
  • ReplicaSet
    • Very particular option that is only used for a few checks and verifies that the connection uses a replicaset connection. Using this option lets the tool automatically find a primary node for you, and is helpful to some checks specifically around replication and high availability (HA):
      • check_election
      • check_repl_lag
      • check_cannary_test
      • chech_have_primary
      • check_oplog
Actions and what they mean
  • check_connections
    • This parameter refers to memory usage, but beyond that you need to know if your typical connections suddenly double. This indicates something unexpected happened in the application or database and caused everything to reconnect. It often takes up to 10 minutes for those old connections to go away.
  • check_election
    • This uses the status file options we will cover in a minute, but it checks to see if the primary from the last check differs from the current found primary. If so, it alerts. This check should only have a threshold of one before it alarms (as an alert means an HA event occurred).
  • check_lock_pct
    • MMAP only, this engine has a write lock on the whole collection/database depending on the version. This is a crucial metric to determine if MMAP writes are blocking reads, meaning you need to scale the DB layer in some way.
  • check_repl_lag
    • Checks the replication stream to understand how lagged a given node is the primary. To accomplish this, it uses a fake record in the test DB to cause a write. Without this, a read-only system would look lagged artificially as no new oplog entries get created.
  • check_flushing
    • A common issue with MongoDB is very long flush times, causing a system halt. This is a caused by your disk subsystem not keeping up, and then the DB having to wait on flushing to make sure writes get correctly journaled.
  • check_total_indexes
    • The more indexes you have, the more the planner has to work to determine which index is a good fit. This increases the risk that the recovery of a failure will take a long time. This is due to the way a restore builds indexes and how MongoDB can only make one index at a time.
  • check_balance
    • While MongoDB should keep things in balance across a cluster, many things can happen: jumbo chunks, a disabled balancer being, constantly attempting to move the same chunk but failing, and even adding/removing sharding. This alert is for these cases, as an imbalance means some records might get served faster than others. It is purely based on the chunk count that the MongoDB balancer is also based on, which is not necessarily the same as disk usage.
  • check_queues
    • No matter what engine you have selected, a backlog of sustained reads or writes indicates your DB layer is unable to keep up with demand. It is important in these cases to send an alert if the rate is maintained. You might notice this is also in our Prometheus exporter for graphics as both trending and alerting are necessary to watch in a MongoDB system.
  • check_cannary_test
    • This is a typical query for the database and then used to set critical/warning levels based on the latency of the returned query. While not as accurate as full synthetic transactions, queries through the application are good to measure response time expectations and SLAs.
  • check_have_primary
    • If we had an HA event but failed to get back up quickly, it’s important to know if a new primary is causing writes to error on the system. This check simply determines if the replica set has a primary, which means it can handle reads and writes.
  • check_oplog
    • This check is all about how much oplog history you have. This is much like measuring how much history you have in MySQL blogs. The reason this is important is when recovering from a backup and performing a point in time recovery, you can use the current oplog if the oldest timestamp in the oplog is newer than the backup timestamp. As a result, this is normal three times the backup interval you use to guarantee that you have plenty of time to find the newest recovery and then do the recovery.
  • check_index_ratio
    • This is an older metric that modern MongoDB versions will not find useful, but in the past, it was a good way to understand the percentage of queries not handled by an index.
  • check_connect
    • A very basic check to ensure it can connect (and optionally login) to MongoDB and verify the server is working.
Status File options

These options rarely need to be changed but are present in case you want to store the status on an SHM mount point to avoid actual disk writes.

  • statusfile
    • This is where a copy of the current rs.status, serverStatus and other command data is stored
  • backup-statusfile
    • Like status_file, but status_file is moved here when a new check is done. These two objects can then be compared to find the delta between two checkpoints.
  • max-stale
    • This is the amount of age for which an old file is still valid. Deltas older then this aren’t allowed and exist to protect the system from will assumption when a statusfile is hours or days old.

If you have any questions on how to use these parameters, feel free to let us know. In the code, there is also a defaults dictionary for most of these options so that in many cases setting warning and critical level are not needed.

Share this post