So your MySQL server has crashed. What do you do now? When a server is down, in my opinion, there are two steps that are essential and both are extremely important and neither should be neglected:
Too many people rush to Step #2 and lose pertinent diagnostics from Step #1. Likewise, too many people will spend too much time on Step #1 and delay getting to Step #2 and restoring service. The goal is to collect diagnostics as quickly as possible for later review while getting service restored as fast as possible.
As a Technical Account Manager (TAM) and assisting on server restoration calls, I have seen both issues at play. Technical resources have a tendency to get so bogged down in trying to understand the cause of the server outage that they forget that the downtime is costing the business money. The desire to crawl through server logs, review metrics, pour-over system metrics, and so on, can be too tempting for some who are concerned that important diagnostic data will be lost when service is restored. This is a valid concern, but there must be a middle ground.
Conversely, many, especially those in management, will demand service is restored immediately to continue business functions. Of course, after the service is back up, the demand for an RCA will come. Sadly, many metrics, and some logs, are lost when a server is bounced. Below are basic guidelines on what metrics to collect for MySQL. The steps are in no particular order.
|
1 |
sudo cp /path/to/datadir/*.log /some/where/safe |
|
1 |
sudo cp /path/to/my.cnf /some/where/safe |
|
1 |
sudo cp /var/log/syslog /some/where/safe/syslog<br>sudo cp /var/log/messages /some/where/safe/messages<br>sudo journalctl -e > /some/where/safe/journalctl.txt |
|
1 |
sudo mysqladmin -i10 -c10 proc > /some/where/safe/mysql_procs.txt<br>mysql> SHOW GLOBAL VARIABLES;<br>sudo mysqladmin -i10 -c10 ext > /some/where/safe/mysql_ext.txt<br>mysql> SHOW ENGINE INNODB STATUSG |
|
1 |
sudo ./pt-stalk --no-stalk --iterations=2 --sleep=30 --dest=/some/where/safe -- --user=root --password=<mysql-root-pass>; |
|
1 |
sudo cp -R /path/to/datadir /some/where/safe/datadir |
|
1 |
sudo cp /path/to/data/dir/GRA* /some/where/safe/datadir/ |
|
1 |
sudo mpstat -a 1 60 > /some/where/safe/mpstat.txt<br>sudo vmstat 1 60 > /some/where/safe/vmstat.txt<br>sudo iostat -dmx 1 60 > /some/where/safe/iostat.txt |
|
1 |
sudo cat /proc/cpuinfo > /some/where/safe/cpuinfo.txt |
|
1 |
sudo pt-summary > /some/where/safe/pt-summary.txt<br>sudo pt-mysql-summary > /some/where/safe/pt-mysql-summary.txt |
|
1 |
# disk info<br>sudo df -k > /some/where/safe/df_k.txt<br>sudo lsblk -o KNAME,SCHED,SIZE,TYPE,ROTA > /some/where/safe/lsblk.txt<br>sudo lsblk --all > $PTDEST/lsblk-all;<br><br># lv/pv/vg only for systems with LVM<br>sudo lvdisplay --all --maps > /some/where/safe/lvdisplau-all-maps.txt<br>sudo pvdisplay --maps > /some/where/safe/pvdisplay-maps.txt<br>sudo pvs -v > /some/where/safe/pvs_v.txt<br>sudo vgdisplay > /some/where/safe/vgdisplay.txt<br><br># nfsstat for systems with NFS mounts <br>sudo nfsstat -m > /some/where/safe/nfsstat_m.txt<br>sudo nfsiostat 1 120 > /some/where/safe/nfsiostat.txt<br><br># Collect hardware information <br>sudo dmesg > /some/where/safe/dmesg.txt<br>sudo dmesg -T free -m > /some/where/safe/dmesg_free.txt <br>sudo dmesg -T > /some/where/safe/dmesg_t.txt<br>sudo ulimit -a > /some/where/safe/ulimit_a.txt<br>sudo cat /proc/sys/vm/swappiness > /some/where/safe/swappiness <br>sudo numactl --hardware > /some/where/safe/numactl-hardware.txt<br> |
It goes without saying, it would be best to script the above into a useful bash script you can run when there is an issue. Just be sure to test the script in advance of an issue.
Again, the goal is to preserve useful diagnostic data that could be useful for determining the root cause of the issue at a later time after the service is restored. Just don’t get caught up in looking through the above diagnostics! Certainly, more data is better but the above is a great starting point. As time goes on, you may realize you wish you had other metrics and can add them to your script or Standard Operating Procedure (SOP).
Naturally, adding monitoring like Percona Monitoring and Management (PMM) would be a great option that can save you a lot of time and collect even more trends over time which can be extremely helpful.
With the above diagnostics, you would have a ton of information in the event of an issue to find the root cause. Now, you can sort through the diagnostics. Of course, if you need help with that, Percona can help you here as well.
Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!
Resources
RELATED POSTS