September 1, 2014

Actively monitoring replication connectivity with MySQL’s heartbeat

Until MySQL 5.5 the only variable used to identify a network connectivity problem between Master and Slave was slave-net-timeout. This variable specifies the number of seconds to wait for more Binary Logs events from the master before abort the connection and establish it again. With a default value of 3600 this has been a historically bad configured variable and stalled connections or high latency peaks were not detected for a long period of time or not detected at all. We needed an active master/slave connection check. And here is where replication’s heartbeat can help us.

This feature was introduce in 5.5 as another parameter to the CHANGE MASTER TO command. After you enable it, the MASTER starts to send “beat” packages (of 106 bytes) to the SLAVE every X seconds where X is a value you can define. If the network link goes down or the latency goes up for more than the time threshold, then the SLAVE IO thread will disconnect and try to connect again. This means we now measure the connection time or latency, not the time without binary log events. We’re actively checking the communication.

How can I configure replication’s heartbeat?

Is very easy to setup with negligible overhead:

mysql_slave > STOP SLAVE;
mysql_slave > CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD=1;
mysql_slave > START SLAVE;

MASTER_HEATBEAT_PERIOD is a value in seconds in the range between 0 to 4294967 with resolution in milliseconds. After the loss of a beat the SLAVE IO Thread will disconnect and try to connect again. Here is the SHOW SLAVE STATUS output after an error:

mysql_slave > show slave status\G
[...]
Slave_IO_Running: Connecting
Slave_SQL_Running: Yes
[...]
Last_IO_Errno: 2003
Last_IO_Error: error reconnecting to master 'rsandbox@127.0.0.1:19972' - retry-time: 60 retries: 86400
[...]

Is interesting to note that having a 5.5 slave with replication’s heartbeat enabled and connected to a 5.1 master doesn’t break the replication. Of course, the heartbeat will not work in this case because the master doesn’t know what is a beat or how to send it :)

What status variables do I have?

The heartbeat check period time and the number of beats received.

mysql_slave > SHOW STATUS LIKE '%heartbeat%';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| Slave_heartbeat_period | 1.000 |
| Slave_received_heartbeats | 1476 |
+---------------------------+-------+

Conclusion

If you need to know when exactly the connection between your Master/Slaves breaks then replication’s heartbeat is the easiest and fastest solution to implement.

About Miguel Angel Nieto

Miguel joined Percona in October 2011. He has worked as a System Administrator for a Free Software consultant and in the supporting area of the biggest hosting company in Spain. His current focus is improving MySQL and helping the community of Free Software to grow. Miguel's roles inside Percona are Senior Support Engineer and Manager of EMEA Support Team.

Comments

  1. Hi Miguel,

    Very useful article, but I’m confused — is the MASTER_HEARTBEAT_PERIOD in milliseconds or seconds? One part of the article states that you can configure the heartbeat in seconds, and the example shows the value ’1′, which would be very aggressive if it were in milliseconds, yet you state that “MASTER_HEATBEAT_PERIOD is a value in seconds in the range between 0 to 4294967 with resolution in milliseconds”.

    Just want to make sure we get this right.

    mysql> SHOW STATUS LIKE ‘%heartbeat%’;
    +—————————+——–+
    | Variable_name | Value |
    +—————————+——–+
    | Slave_heartbeat_period | 15.000 |
    | Slave_received_heartbeats | 147 |
    +—————————+——–+
    2 rows in set (0.00 sec)

    Also, what is the significance of ‘Slave_received_heartbeats’? In our 5.5 environment it seems to be a static number (such as 147, as you see above) — is this the number of missed beats? It doesn’t seem to increment at all.

    Thanks!

  2. Leonardo Cassan says:

    Hi Miguel,

    in my case it doesn’t work out of the box. I performed some tests simulating a network failure and what I observed is that, before considering the first heartbeat failed, it must fail the TCP connection. In order to do so, the system first wait for the TCP keepalive time (/proc/sys/net/ipv4/tcp_keepalive_time) and then it starts to perform /proc/sys/net/ipv4/tcp_keepalive_intvl attempts every /proc/sys/net/ipv4/tcp_keepalive_probes seconds. In total, since the default for the first TCP keepalive is 2 hours (7200s), this would mean that even if you set MASTER_HEARTBEAT_PERIOD=1, in case the network connection fails, the system won’t restart the I/O thread before 2 hours and this is misleading. I checked this simulating the network failure on the master and on the slave and the behaviour is the same.
    Just my contribution for all the other users that might experience an issue like this.

    Cheers,

    Leonardo

  3. Flavian says:

    This worked out really good in Percona 5.6 with Slave_last_heartbeat implemented. Also this really helps in cross dc replication to auto start replication if it doesn’t get any hearbeat reply. It just save me from network blips and slave stopping without any error.

  4. Suraj says:

    Hi Miguel,

    Very Good Article,

    I setup the Master-Master replication as you described in your tutorial but when 1 server goes down then it will not connect to secondary. Means Failover is not working in my structure. Here I have one Web application which is on 192.168.1.68 & I am using two database servers for this application Server 1 : 192.168.1.126 Server 2 : 192.168.1.54 Primarily web application dump the data in 192.168.1.126 but in case this mysql server 1 goes down then it should automatically connected to Server 2 : 192.168.1.54 without noticing to web application user. Please provide me the solustion. Waiting for your reply. Thanks, Suraj

  5. raj kumar says:

    in my case

    —————————+——–+
    | Variable_name | Value |
    +—————————+——–+
    | Slave_heartbeat_period | 30.000 |
    | Slave_received_heartbeats | 0 |
    +—————————+——–+

    slave recived heartbeats are not updating/increasing, what is the root cause in my case, kindly help me asap.

    Thanks

Speak Your Mind

*