I’ve been working with Percona XtraDB Cluster quite a bit recently, and have been investigating various warnings. I came across this one today:
[Warning] WSREP: last inactive check more than PT1.5S ago (PT1.51811S), skipping check
This warning is related to the evs.inactive_check_period option. This option controls the poll period for the group communication response time. If a node is delayed, it is added to a delay list and it can lead to the cluster evicting the node.
While some troubleshooting tips seem to associate the warning with VMWare snapshots, this isn’t the case here, as we see the warning on a physical machine.
I checked for backups or desynced nodes, and this also wasn’t the case. The warning was not accompanied by any errors or other information, so there was nothing critical happening.
In the troubleshooting link above, Galera developers said:
This can be seen on bare metal as well — with poorly configured mysqld, O/S, or simply being overloaded. All it means is that this thread could not get CPU time for 7.1 seconds. You can imagine that access to resources in virtual machines is even harder (especially I/O) than on bare metal, so you will see this in virtual machines more often.
This is not a Galera specific issue (it just reports being stuck, other mysqld threads are equally stuck) so there is no configuration options for that. You simply must make sure that your system and mysqld are properly configured, that there is enough RAM (buffer pool not over provisioned), that there is swap, that there are proper I/O drivers installed on guest and so on.
Basically, Galera runs in virtual machines as well as the virtual machines approximates bare metal.
It could also be an indication of unstable network or just higher average network latency than expected by the default configuration. In addition to checking network, do check I/O, swap and memory when you do see this warning.
Our graphs and counters otherwise look healthy. If this is the case, this is most likely nothing to worry about.
If all of this looks good, ensure the servers are all talking to the same NTP server, have the same time zone and the times and dates are in sync. While this warning could be a sign of an overloaded system, if everything else looks good this warning isn’t something to worry about.
The warning comes from
evs_proto.cpp in the Galera code:
if (last_inactive_check_ + inactive_check_period_*3 < now)
log_warn << "last inactive check more than " << inactive_check_period_*3
<< " ago (" << (now - last_inactive_check_)
<< "), skipping check";
last_inactive_check_ = now;
Since the default for
inactive_check_period is one second according to the Galera documentation, if it is now later than three seconds after the last check, it skips the rest of the above routine and adds the node to the delay list and does some other logic. The reason it does this is that it doesn’t want to rely on stale counters before making decisions. The message is really just letting you know that.
In Percona XtraDB Cluster, this setting defaults to 0.5s. This warning simply could be that your
inactive_check_period is too low, and the delay is not high enough to add the node to the delay list. So you could consider increasing
evs.inactive_check_period to resolve the warnings. (Apparently in Galera, it may also now be 0.5s but documentation is stale.)
To find a sane value my colleague David Bennett came up with this command line, which gives you an idea of when your check warnings are happening:
$ cat mysqld.log | grep 'last inactive check more than' | perl -ne 'm/(PT(.*)S)/; print $1."n"' | sort -n | uniq -c
Therefore, in this case, it may be a good idea to set
inactive_check_period at 1 or 1.5 to make the warnings go away.
Each node in the cluster keeps its own local copy of how it sees the topology of the entire cluster.
check_inactive is a node event that is triggered every
inactive_check_period seconds to help the node update its view of the whole cluster, and ensure it is accurate. Service messages can be broadcast to the cluster informing nodes of changes to the topology. For example, if a cluster node is going down it will broadcast a service message telling each node in the cluster to remove it. The action is queued but the actual view of the cluster is updated with
check_inactive. This is why it adds nodes to its local copy of inactive, suspect and delayed nodes.
If a node thinks it might be looking at stale data, it doesn’t make these decisions and waits until the next time for a fresh queue. Unfortunately, if
inactive_check_period is too low, it will keep giving you warnings.