PagerDuty
Software
1000+
Percona XtraDB Cluster, Percona XtraBackup and Percona Toolkit
MySQL
PagerDuty runs its primary database infrastructure in the Amazon Elastic Compute Cloud (EC2). Prior to adopting Percona Server®, PagerDuty relied on MySQL Community Edition and the data was stored and synchronously replicated on a pair of DRDB servers. Connected to the master DRDB server was a number of async disaster recovery slaves. The problem with this setup was that failover required a manual flip that typically resulted in at least 2 minutes of downtime, after which, a cold server required a longer time to spin up.
“Percona is well known in the industry and offers trusted software, so it just made sense to move to it instead of MySQL Community Edition,” said Doug Barth, operations engineer at PagerDuty. “We started with Percona Server and soon found that the clustering capabilities of Percona XtraDB Cluster offered a number of critical benefits.”
PagerDuty now hosts its primary database on Percona XtraDB Cluster, which integrates Percona Server with the Galera library of MySQL high availability solutions. PagerDuty used Percona XtraDB Cluster to create a three-node cluster in Amazon EC2. Currently, Percona is configured as a single instance, with HAProxy sitting between Percona and the client application to handle automatic failover to another instance. In the near future, PagerDuty will implement a multi-master replication configuration.
“Percona has performed very well in production,” said Barth. “We’ve had no issues with the software, which gives us a high degree of confidence to move forward with our multi-master strategy.”
“Percona has performed very well in production,” said Barth. “We’ve had no issues with the software, which gives us a high degree of confidence to move forward with our multi-master strategy.”
Percona Technologies
Percona XtraDB Cluster provides PagerDuty with a variety of critical benefits. For example, in the event of failover, instead of the significant downtime and the performance hit that occurred with DRDB and MySQL Community Edition, Barth can simply mark the current preferred master as down, so new connections go to a different node. Percona XtraDB Cluster can then be restarted on the master with no downtime or performance hit. This is then repeated until all three nodes are restarted. “Our Percona XtraDB Cluster remains up the entire time, and there’s no impact on customers,” said Barth.
Higher Performance Storage – The DRDB pair used Amazon Elastic Block Store (EBS) to store data. While EBS is reliable for data storage, companies can experience extended downtime due to network issues. Because Percona XtraDB Cluster makes it easy to spin up a new cluster node and grab data from another existing node, PagerDuty had the confidence to move to Amazon EC2 Instance Store instead of EBS. EC2 Instance Store is local disk storage for the Percona Server virtual machine—and therefore eliminates network issues—but it is “ephemeral” storage, that is, it is available only during the lifetime of the host virtual machine. “This is a huge benefit to our Percona cluster,” said Barth. “We are no longer dependent on EBS to be up and reliable in order for our database to be up and reliable. With Instance Store, our data is on a device that’s physically attached to the computer running our virtual machine, yet we have the confidence that with three active Percona XtraDB Cluster nodes, our data is safe.”
Asynchronous Slave Replication – Percona XtraDB Cluster allows async slaves to be moved between any of the cluster nodes with no impact on performance. Because of the way the cluster annotates the binary logs, replication can begin immediately from a known coordinate point on any node. “Our database is now large enough that a restore from a MySQL dump would take 15 hours and from a binary backup, it would take 2 to 3 hours. The great benefit for us is that now, by simply calculating the coordinates on another node, we can be up and running in a few minutes,” said Barth.
As a secondary backup strategy, PagerDuty uses Percona XtraBackup to create full and incremental binary backups of its database on a dedicated backup host. Once again, the key benefit is recovery time. If necessary, this binary restore would take just a couple of hours instead of the 15 hours required for a MySQL dump.
PagerDuty uses a variety of tools from the Percona Toolkit. For example, a pt-query-digest is run every night to identify slow queries; pt-heartbeat monitors async slaves; and pt-slave-delay manages a delayed slave setup for “fat finger errors.”
PagerDuty uses:
PagerDuty helps businesses increase uptime with smart alerting and powerful on-call scheduling. With seamless integrations across monitoring tools such as Splunk, New Relic, Nagios, and Zenoss PagerDuty enables IT and DevOps profressionals to reduce response times by ensuring alerts get to the right person, faster.