This post is a follow-up to my November 19 webinar, “Tips from the Trenches: A Guide to Preventing Downtime for the Over-Extended DBA,” during which I described some of the most common reasons DBAs experience avoidable downtime. The session was aimed at the “over-stretched DBA,” identified as the MySQL DBA short of time or an engineer of another discipline without the depth of the MySQL system. The over-stretched DBA may be prone to making fundamental mistakes that cause downtime through poor response time, operations that cause blocking on important data or administrative mishaps through the lack of best practice monitoring and alerting. (You can download my slides and view the recorded webinar here.)
Monitor the things
One of the aides to keeping the system up and running is ensuring that your finger is on the pulse of the environment. Here on the Percona Managed Services team, we leverage Percona Monitoring Plugins (open source plugins for Nagios, Cacti and Zabbix) to ensure we have visibility of our client’s operations. Having a handle on basics such as disk space, memory usage and MySQL operational metrics ensures that we avoid trivial downtime that would affect the client’s uptime or worse, their bottom line.
One of the most common reasons that an application is unable to serve data to its end user is that access to a table is being blocked due to another ongoing operation. This can be blamed on a variety of sources: backups, schema changes, poor configuration and long running transactions can all lend themselves to costly blocking. Understanding the impact of actions on a MySQL server can be the difference between a happy end user and a frustrated one.
During the webinar I made reference to some resources and techniques that can assist the over extended DBA avoid downtime and here are some highlights….
Monitoring and Alerting
It’s important that you have some indications that something is reaching its capacity. It might be the disk, connections to MySQL or auto_increment limit on a highly used table. There is quite the landscape to cover but here are a handful of helpful tools:
* Percona Monitoring Plugins
* New Relic
Poorly performing SQL can be indicative that the configuration is incorrect, that there’s a missing index or that your development team needs a quick lesson on MySQL anti-patterns. Arm yourself with proof that the SQL statements are substandard using these resources and work with the source to make things more efficient:
* Percona Cloud Tools
* pt-query-digest, explain, indexes
If you need to ensure that your application survives hiccups such as hardware failure or network impairment, a well deployed HA solution will give you the peace of mind that you can quickly mitigate bumps in the road.
* Percona XtraDB Cluster, Galera
* Percona Replication Manager
A wise man once quoted “A backup today saves you tomorrow.” Covering all bases can be the difference between recovering from a catastrophic failure and job hunting. Mixing logical, physical and incremental backups while adding in some offsite copies can provide you with the safety net in the event that a small mistake like a dropped table is met or worse, all working copies of data and backups are lost in a SAN failure. It happens so be prepared.
* Percona XtraBackup
* mysqlbinlog (5.6)
We had some great questions from the attendees and regrettably were unable to answer them all, so here are some of them with my response.
Q: I use MySQL on Amazon RDS. Isn’t much of the operations automated or do these tips still apply?
A: It’s not completely automated. There are still challenges to address and configuration opportunities, but understanding the limitations of RDS is key. For example, the location and size of the tmpdir is something you are unable to customise on RDS. You would typically review this config in a production environment if your workload required it. Any costly queries that perform operations requiring tmp area to sort (think OLAP) might not be a good fit on RDS due to this limitation. Getting to know the limitations around hosted or DBaaS services is time well spent to avoid explaining what keeps taking the application down in peak hours.
Q: What other parts of Percona Toolkit do you recommend for MySQL operations?
A: Percona Toolkit is a well-evolved suite of tools that all MySQL DBAs should familiarize themselves with. In particular I will fit many tools into my weekly workflow:
- pt-duplicate -key-checker
The key with Percona Toolkit is that many common tasks or problems that could cause you to reinvent the wheel are covered, mature and production ready. As with any tool, you should always read the label or in this case the documentation so you’re well aware what the tools can do, the risks and the features that you can make use of.
Q: HA – are there any solutions that you would stay away from?
A: Using any particular HA solution is going to be another R&D exercise. You will need to understand the tradeoffs, configuration options and compare between products. Some might have a higher TCO or lack functionality. Once the chosen solution is implemented it’s pertinent that the engineers understand the technology to be able to troubleshoot or utilize the functionality in the situation where failover needs to be instigated. I like HA solutions to be fast to failover to and some entail starting MySQL from cold.
Q: You mentioned having tested backups. How do you perform this?
A: Percona’s method is using a dedicated host with access to the backup files. Then with a combination of mysqlsandbox and pt-table-checksum we can discover if we trust the files we capture for disaster recovery. Many people underestimate the importance of this task.
Q: Percona Cloud Tools – how much does it cost?
A: Right now it’s a free service. Visit cloud.percona.com for more information, but in a nutshell Percona Cloud Tools is a hosted service providing access to query performance insights for all MySQL uses.
Q: Is there API access to Percona Cloud Tools for application integration?
A: There is currently not a public API available. It is on the roadmap, though. We’d be interested to hear more about your use case so please sign up for the service and try it out. After signing in, all pages include a Feedback link to share your thoughts and ideas such as how you’d like to use a public API.
Q: Can you use MHA with Percona XtraDB Cluster?
A: MHA is not something that can be used with Percona XtraDB Cluster (PXC). It’s common to partner PXC with HAProxy for making sure your writes are going to the appropriate node.
Q: Can MHA make automatic failover? If MHA has automatic failover, what do you recommend? Configure it for automatic failover?
A: MHA can make an automatic failover. Personally I prefer managed failover. When working with automated failover it’s important that failback is manual to avoid “flapping.” “Splitbrain” is an ailment that you don’t want to suffer from as well and auto failover removes the human judgment from the decision to relocate operations from a failed node onto a standby node. If you are going to vote for an automatic failover it is advised to test all potential failure scenarios and to employ a STONITH method to really ensure that the unresponsive node is not serving read/write traffic.
Q: What is the best way to detect database blocking from DML statements? Is there a tool that will show blocking after the fact so you don’t have to catch it real-time?
A: Once again, Percona has a tool called pt-deadlock-logger that can detect and log deadlocks. Detecting locking can be achieved using “SHOW ENGINE INNODB STATUS” or utilizing the information_schema.innodb_locks table. Some engineering might be required for this to be logged but those resources exist for use.
Q: Since you mentioned tinkering with ELK I was wondering if you had any tips on good Kibana dashboards to build to monitor MySQL databases/clusters?
A: ELK is something that I’m looking to publish some information on soon so watch this space!
Thanks again everyone for the great questions! And as a reminder, you can download my slides and view the recorded webinar here.