Using pt-heartbeat with ProxySQL

Using pt-heartbeat with ProxySQLProxySQL and Orchestrator are usually installed to achieve high availability when using MySQL replication. On a failover (or graceful takeover) scenario, Orchestrator will promote a slave, and ProxySQL will redirect the traffic. Depending on how your environment is configured, and how long the promotion takes, you could end up in a scenario where you need manual intervention.

In this post, we are going to talk about some considerations when working with ProxySQL in combination with pt-heartbeat (part of Percona Toolkit), with the goal of making your environment more reliable.

Why Would We Want pt-heartbeat With ProxySQL?

If you have intermediate masters, the seconds_behind_master metric is not good enough. Slave servers that are attached to intermediate masters will report the seconds_behind_master relative to their own master, not the “real” top-level server receiving the writes. So it is possible ProxySQL will send traffic to 2nd level slaves that are showing no latency respective to “their” master, but still have stale data. This happens when the intermediate master is lagging behind.

Another reason is the show slave status metric resolution is 1 second. Deploying pt-heartbeat will get us the real latency value in milliseconds, across the entire topology. Unfortunately, ProxySQL rounds the value to seconds, so we cannot fully take advantage of this at the time of this writing.

How Do I Deploy pt-heartbeat for a ProxySQL Environment?

ProxySQL since version 1.4.4 has built-in support to use pt-heartbeat. We only need to specify the heartbeat table as follows:

Now we need to decide how to deploy pt-heartbeat to be able to update the heartbeat table on the master. The easiest solution is to install pt-heartbeat on the ProxySQL server itself. 

We need to create a file to store pt-heartbeat configuration, e.g /etc/percona-toolkit/pt-heartbeat-prod.conf:

We point pt-heartbeat to go through ProxySQL and route its traffic to the writer hostgroup. In order to do this, we need a query rule:

In the example above, the writer hostgroup is 10 and the user that pt-heartbeat connects as is monitor;

Next, we need to create a systemd service that will be in charge of making sure pt-heartbeat is always running. The caveat here is that when the master is set to read-only for any reason (e.g. a master switch), pt-heartbeat will stop (by design) and return an error code !=0. So, we need to tell systemd to catch this and restart the daemon. The way to accomplish this is by using the Restart=on-failure functionality.

Here’s a sample systemd unit script:

We also specify RestartSec=5s, as the default of 100 ms sleep before restarting a service is overkill for this use case.

After changes are done, we need to daemon-reload to update SystemV:

At this point we can start and stop the service using:

Dealing With Master Takeover/Failover

Let’s assume slave servers are configured with max_replication_lag of 5 seconds. Usually, the failover process can take a few seconds, where pt-heartbeat might stop updating the heartbeat table. This means that the slave that is picked as the new master might report replication lag (as per pt-heartbeat), even if it was not really behind the master.

Now, we found out ProxySQL does not automatically clear the max_replication_lag setting for a server when it becomes a master. When configured to use pt-heartbeat, it will (incorrectly)  flag the new master as lagging, and shun it, causing writes to this cluster to be rejected!. This behavior happens ONLY when using pt-heartbeat. Using the show slave status method to detect latency, everything works as expected. You can check the bug report for more information.

For the time being, one way to deal with the above scenario is to write a script that monitors the mysql_servers table, and if it finds a writer node that has max_replication_lag configured, clear that value so it won’t be shunned.

Here’s some sample code:

We can use ProxySQL’s scheduler to run our script every 5 seconds like this:

Conclusion

We have seen how pt-heartbeat is useful to monitor the real latency for an environment using ProxySQL. We also discussed how to deal with the edge case of a new master being shunned because of latency, due to how ProxySQL and pt-heartbeat work.

Share this post

Leave a Reply