September 20, 2014

Your troubles are over, pt-heartbeat 2.1

Poor pt-heartbeat 2.1: it was perfectly fine through Percona Toolkit 2.1.7, but it’s had a rough life since. Its troubles are finally over as of Percona Toolkit 2.1.10. At a conference recently, a user joked to me that “Percona Toolkit is awesome, but you really broke pt-heartbeat.” It’s true, and if you haven’t heard or experienced the story, here it is for the record.

Since the dawn of time, pt-heartbeat had computed slave lag like:

Surprisingly, that worked for many years despite what should have been an obvious problem: different time zones: t1 could be in PST and t2 in EST and even if replication took effectively zero seconds, lag would be reported because 00:00 PST (t1) is 03:00 EST (t2).

This problem was not reported until bug 886059 in PT 2.1.8. The solution jumped out at us: use only UTC timestamps. We did, and this is when the troubles of pt-heartbeat 2.1 began.

The change in 2.1.8 broke pt-heartbeat in two ways. First, only using UTC timestamps, made 2.1.8 backwards-incompatible with 2.1.7 and earlier versions. We’re usually very attentive to maintaining backwards-compatibility within minor versions, but our failing in this case was the mistaken assumption that everyone running 2.1 would upgrade all their servers at once. This is not true: people run mixes of minor versions, relying on developers to maintain backwards-compatibility within the minor version. So those running 2.1.8 with 2.1.7 or earlier versions of pt-heartbeat ran into the very problem we were trying to fix: different time zones, because 2.1.8 used only UTC whereas 2.1.7 and earlier versions did not.

The second break was switching from Perl time to MySQL time which caused a loss of precision. Most MySQL servers have only 1-second timestamps, like 2013-06-28 00:00:01, but Perl time (when using the Time::HiRes module) gives us microsecond timestamps like 2013-06-28 00:00:01.123456. Microsecond timestamps are required to achieve what the tool’s docs state:

pt-heartbeat has a maximum resolution of 0.01 second.

So those running 2.1.8 ran into a complete loss of high-precision slave lag readings; this made the tool pretty much useless because MySQL already has 1-second slave lag resolution: seconds_behind_master (even though the way it computes that value is different than how pt-heartbeat computes slave lag).

We fixed those two bugs in PT 2.1.9, bug 1099665 and bug 1103221 respectively, by adding the –utc option. Now pt-heartbeat 2.1.9 was back to its original 2.1.7 self: microsecond timestamps from Perl were used again, and users could specify –utc to use UTC timestamps. Unfortunately, the troubles of pt-heartbeat 2.1 did not end there.

Certain MySQL functions are “time zone aware”, like UNIX_TIMESTAMP() for which the docs state:

The server interprets date as a value in the current time zone and converts it to an internal value in UTC.

So although we wrote UTC timestamps, we read them back in whatever the system time zone was. The fix: SET time_zone=’+0:00′ to truly make everything UTC. This was bug 1163372: “pt-heartbeat –utc –check always returns 0″. Given this change and those before it, bug 1137556, “pt-heartbeat docs don’t account for –utc”, was perhaps inevitable. Not only had we been breaking and fixing pt-heartbeat’s handling of time, but reading the docs would have given the wrong information, too.

Finally, those bugs have been fixed in Percona Toolkit 2.1.10 and I think it’s ok to say now: pt-heartbeat 2.1, your troubles are over. 2.1.10 will probably be the last release in the 2.1 series; users should upgrade to 2.2. pt-heartbeat 2.2.4 and later have all the fixes of pt-heartbeat 2.1.10.

Sorry pt-heartbeat 2.1, and sorry users of pt-heartbeat 2.1. We’ve learned from this tragic tale, and we hope never to repeat it.

Comments

  1. William says:

    Slightly off topic, but…

    What is up with the related searches in the side bar?

    They’re all related to 192.168.2.1 in different forms. I didn’t perform a search, just navigated to the page from the main page. Is it just the fact that 2.1 appears in the headline? That’s really odd.

    I would think that related articles would be ones about Percona toolkit or replication.

  2. Hi William, thanks for pointing this out. Yes, I agree it is odd. Looks like all of the version numbers used in the post is causing that. But I’ll see if I can see of anything else that might have done it.

Speak Your Mind

*