One of the things I like about consulting at Percona is the opportunity to be exposed to unusual problems. I recently worked with a customer having issues getting SST to work with Percona XtraDB Cluster. A simple problem you would think. After four hours of debugging, my general feeling was that nothing made sense.
I added a bash trace to the SST script and it claimed MySQL died prematurely:
[ -n '' ]]
+ ps -p 11244
+ wsrep_log_error 'Parent mysqld process (PID:11244) terminated unexpectedly.'
+ wsrep_log '[ERROR] Parent mysqld process (PID:11244) terminated unexpectedly.'
++ date '+%Y-%m-%d %H:%M:%S'
+ local readonly 'tst=2017-11-28 22:02:46'
At the same time, from the MySQL error log MySQL was complaining the SST script died:
2017-11-28 22:02:46 11244 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.31.4.179' --datadir '/var/lib/my
sql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '11244' '' : 32 (Broken pipe)
2017-11-28 22:02:46 11244 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2017-11-28 22:02:46 11244 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2017-11-28 22:02:46 11244 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2017-11-28 22:02:46 11244 [ERROR] Aborting
Clearly, something odd was at play. But what? At that point, I decided to try a few operations with the mysql user. Finally, I stumbled onto something:
[root@db-01 mysql]# su mysql -
bash-4.2$ ps fax
PID TTY STAT TIME COMMAND
11901 pts/0 S 0:00 bash -
11902 pts/0 R+ 0:00 _ ps fax
There are way more than 100 processes on these servers, so, why can’t the mysql user see them? Of course, the SST script monitors the state of its parent process using “ps”. Look at the bash trace above: 11244 is the mysqld pid. After a little Googling exercise, I found this blog post about the /proc hidepid mount option. Of course, the customer was using this option:
[root@db-02 lib]# mount | grep '^proc'
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime,hidepid=2)
I removed the hidepid option using remount, and set hidepid=0 on all the nodes:
mount -o remount,rw,nosuid,nodev,noexec,relatime,hidepid=0 /proc
This simple command solved the issue. The SST scripts started to work normally. A good lesson learned: do not overlook security settings!