So your MySQL server has crashed. What do you do now? When a server is down, in my opinion, there are two steps that are essential and both are extremely important and neither should be neglected:
- Save diagnostic information for determining the root cause analysis (RCA).
- Get the server back up and running.
Too many people rush to Step #2 and lose pertinent diagnostics from Step #1. Likewise, too many people will spend too much time on Step #1 and delay getting to Step #2 and restoring service. The goal is to collect diagnostics as quickly as possible for later review while getting service restored as fast as possible.
As a Technical Account Manager (TAM) and assisting on server restoration calls, I have seen both issues at play. Technical resources have a tendency to get so bogged down in trying to understand the cause of the server outage that they forget that the downtime is costing the business money. The desire to crawl through server logs, review metrics, pour-over system metrics, and so on, can be too tempting for some who are concerned that important diagnostic data will be lost when service is restored. This is a valid concern, but there must be a middle ground.
Conversely, many, especially those in management, will demand service is restored immediately to continue business functions. Of course, after the service is back up, the demand for an RCA will come. Sadly, many metrics, and some logs, are lost when a server is bounced. Below are basic guidelines on what metrics to collect for MySQL. The steps are in no particular order.
- Save a copy of the MySQL Error Log.
Shell1sudo cp /path/to/datadir/*.log /some/where/safe
- Make a copy of the MySQL configuration file.
Shell1sudo cp /path/to/my.cnf /some/where/safe
- Make a copy of system logs and save them somewhere on persistent storage in a location that will not be overwritten. Consider doing something like the following on Linux:
Shell123sudo cp /var/log/syslog /some/where/safe/syslogsudo cp /var/log/messages /some/where/safe/messagessudo journalctl -e > /some/where/safe/journalctl.txt
- If MySQL is running still and you can log in, get some MySQL metrics. You will want to save the output into files somewhere.
Shell1234sudo mysqladmin -i10 -c10 proc > /some/where/safe/mysql_procs.txtmysql> SHOW GLOBAL VARIABLES;sudo mysqladmin -i10 -c10 ext > /some/where/safe/mysql_ext.txtmysql> SHOW ENGINE INNODB STATUS\G
- If MySQL is running and you have Percona Toolkit, you should collect some pt-stalk output.
Shell1sudo ./pt-stalk --no-stalk --iterations=2 --sleep=30 --dest=/some/where/safe -- --user=root --password=<mysql-root-pass>;
- If you have space and time, a copy of the database files (data directory in MySQL) could be helpful. Certainly, for many installations, getting all of the data files will be impossible. If it is a small database and space and time allow, it can be best to get all the files just in case.
Shell1sudo cp -R /path/to/datadir /some/where/safe/datadir
- Copy database logs and save them somewhere safe for later review. Systems like Percona XtraDB Cluster (PXC) will create GRA files during an issue which can be really helpful to look at to determine the root cause. By combining the GRA header file with the contents of the GRA log files, you can use the mysqlbinlog command to get the records of transactions causing issues. More information can be found in one of our older blogs here
Percona XtraDB Cluster (PXC): what about GRA_*.log files?.
Shell1sudo cp /path/to/data/dir/GRA* /some/where/safe/datadir/
- Save system metrics pertaining to CPU, I/O, and memory usage:
Shell123sudo mpstat -a 1 60 > /some/where/safe/mpstat.txtsudo vmstat 1 60 > /some/where/safe/vmstat.txtsudo iostat -dmx 1 60 > /some/where/safe/iostat.txt
- Save system info.
Shell1sudo cat /proc/cpuinfo > /some/where/safe/cpuinfo.txt
- If you have Percona Toolkit, the following would be very helpful:
Shell12sudo pt-summary > /some/where/safe/pt-summary.txtsudo pt-mysql-summary > /some/where/safe/pt-mysql-summary.txt
- Get hardware diagnostics.
Shell12345678910111213141516171819202122# disk infosudo df -k > /some/where/safe/df_k.txtsudo lsblk -o KNAME,SCHED,SIZE,TYPE,ROTA > /some/where/safe/lsblk.txtsudo lsblk --all > $PTDEST/lsblk-all;# lv/pv/vg only for systems with LVMsudo lvdisplay --all --maps > /some/where/safe/lvdisplau-all-maps.txtsudo pvdisplay --maps > /some/where/safe/pvdisplay-maps.txtsudo pvs -v > /some/where/safe/pvs_v.txtsudo vgdisplay > /some/where/safe/vgdisplay.txt# nfsstat for systems with NFS mountssudo nfsstat -m > /some/where/safe/nfsstat_m.txtsudo nfsiostat 1 120 > /some/where/safe/nfsiostat.txt# Collect hardware informationsudo dmesg > /some/where/safe/dmesg.txtsudo dmesg -T free -m > /some/where/safe/dmesg_free.txtsudo dmesg -T > /some/where/safe/dmesg_t.txtsudo ulimit -a > /some/where/safe/ulimit_a.txtsudo cat /proc/sys/vm/swappiness > /some/where/safe/swappinesssudo numactl --hardware > /some/where/safe/numactl-hardware.txt
It goes without saying, it would be best to script the above into a useful bash script you can run when there is an issue. Just be sure to test the script in advance of an issue.
Again, the goal is to preserve useful diagnostic data that could be useful for determining the root cause of the issue at a later time after the service is restored. Just don’t get caught up in looking through the above diagnostics! Certainly, more data is better but the above is a great starting point. As time goes on, you may realize you wish you had other metrics and can add them to your script or Standard Operating Procedure (SOP).
Naturally, adding monitoring like Percona Monitoring and Management (PMM) would be a great option that can save you a lot of time and collect even more trends over time which can be extremely helpful.
With the above diagnostics, you would have a ton of information in the event of an issue to find the root cause. Now, you can sort through the diagnostics. Of course, if you need help with that, Percona can help you here as well.