Actively monitoring replication connectivity with MySQL’s heartbeat

December 29, 2011

Author

Miguel Angel Nieto

MySQL

Share this Post:

Until MySQL 5.5 the only variable used to identify a network connectivity problem between Master and Slave was slave-net-timeout. This variable specifies the number of seconds to wait for more Binary Logs events from the master before abort the connection and establish it again. With a default value of 3600 this has been a historically bad configured variable and stalled connections or high latency peaks were not detected for a long period of time or not detected at all. Also, if that variable is set to a low value, let’s say 30 seconds, and the master had no events to send, the slave would reset the connection after 30 seconds even if the connection was healthy.

Therefore, before this new heartbeat feature, we had no way to check the connection status between the servers. We needed an active master/slave connection check. And here is where replication’s heartbeat can help us.

This feature was introduce in 5.5 as another parameter to the CHANGE MASTER TO command. After you enable it, the MASTER starts to send “beat” packages (of 106 bytes) when it is idle (no events to send to the slave) every X seconds where X is a value you can define in seconds.

Now, let’s say that slave-net-timeout=30. If the master is idle, without events to send, it will start to send those beats. Therefore, the connection reset won’t be triggered after those 30 seconds, because now the slave knows that the connection is still alive.

How can I configure replication’s heartbeat?

Is very easy to setup with negligible overhead:

mysql_slave > STOP SLAVE; mysql_slave > CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD=1; mysql_slave > START SLAVE;

MASTER_HEARTBEAT_PERIOD is a value in seconds in the range between 0 to 4294967 with resolution in milliseconds.

Is interesting to note that having a 5.5 slave with replication’s heartbeat enabled and connected to a 5.1 master doesn’t break the replication. Of course, the heartbeat will not work in this case because the master doesn’t know what is a beat or how to send it 🙂

What status variables do I have?

The heartbeat check period time and the number of beats received.

mysql_slave > SHOW STATUS LIKE '%heartbeat%'; +---------------------------+-------+ | Variable_name | Value | +---------------------------+-------+ | Slave_heartbeat_period | 1.000 | | Slave_received_heartbeats | 1476 | +---------------------------+-------+

How can we check if the connection is down?

– If the master’s binary log position is greater than the one in slave but it is not receiving those new events, then it is down.
– If the master is idle but we see the number of received heartbeats increasing, then the connection is not down.
– If the master is idle but we don’t see heartbeats increasing, then it is down.

0 0 votes

Article Rating

6 Comments

Oldest

Newest Most Voted

Gerad Coles

14 years ago

Hi Miguel,

Very useful article, but I’m confused — is the MASTER_HEARTBEAT_PERIOD in milliseconds or seconds? One part of the article states that you can configure the heartbeat in seconds, and the example shows the value ‘1’, which would be very aggressive if it were in milliseconds, yet you state that “MASTER_HEATBEAT_PERIOD is a value in seconds in the range between 0 to 4294967 with resolution in milliseconds”.

Just want to make sure we get this right.

Also, what is the significance of ‘Slave_received_heartbeats’? In our 5.5 environment it seems to be a static number (such as 147, as you see above) — is this the number of missed beats? It doesn’t seem to increment at all.

Thanks!

Leonardo Cassan

14 years ago

Hi Miguel,

in my case it doesn’t work out of the box. I performed some tests simulating a network failure and what I observed is that, before considering the first heartbeat failed, it must fail the TCP connection. In order to do so, the system first wait for the TCP keepalive time (/proc/sys/net/ipv4/tcp_keepalive_time) and then it starts to perform /proc/sys/net/ipv4/tcp_keepalive_intvl attempts every /proc/sys/net/ipv4/tcp_keepalive_probes seconds. In total, since the default for the first TCP keepalive is 2 hours (7200s), this would mean that even if you set MASTER_HEARTBEAT_PERIOD=1, in case the network connection fails, the system won’t restart the I/O thread before 2 hours and this is misleading. I checked this simulating the network failure on the master and on the slave and the behaviour is the same.
Just my contribution for all the other users that might experience an issue like this.

Cheers,

Leonardo

Flavian

12 years ago

This worked out really good in Percona 5.6 with Slave_last_heartbeat implemented. Also this really helps in cross dc replication to auto start replication if it doesn’t get any hearbeat reply. It just save me from network blips and slave stopping without any error.

Suraj

12 years ago

Hi Miguel,

Very Good Article,

I setup the Master-Master replication as you described in your tutorial but when 1 server goes down then it will not connect to secondary. Means Failover is not working in my structure. Here I have one Web application which is on 192.168.1.68 & I am using two database servers for this application Server 1 : 192.168.1.126 Server 2 : 192.168.1.54 Primarily web application dump the data in 192.168.1.126 but in case this mysql server 1 goes down then it should automatically connected to Server 2 : 192.168.1.54 without noticing to web application user. Please provide me the solustion. Waiting for your reply. Thanks, Suraj