Debugging MariaDB Galera Cluster SST Problems – A Tale of a Funny Experience

February 12, 2019
Author
Francisco Bordenave
Share this Post:

MariaDB galera cluster starting timeRecently, I had to work on an emergency for a customer who was having a problem restarting a MariaDB Galera Cluster. After a failure in the cluster they decided to restart the cluster entirely following the right path: bootstrapping the first node, and then adding the rest of the members, one by one. Everything went fine until one of the nodes rejected the request to join the cluster.

Given this problem, the customer decided to ask us to help with the problematic node because none of the tests they did worked, and the same symptom repeated over and over: SST started, copied few gigs of data and then it just hanged (apparently) while the node remained out of the cluster.

Identifying the issue…

Once onboard with the issue, initially I just checked that the cluster was trying a SST: given the whole dataset was about 31GB I decided to go directly to a healthy solution: to clean up the whole datadir and start afresh. No luck at all, the symptom was exactly the same no matter what I tried:

After reviewing the logs I noticed few strange things. In the joiner:

In the donor (output has been a obfuscated to avoid sharing private info and the times are not matching deliberately):

So SST starts correctly and then failed. I tried forcing different donors, checked firewall rules, etc. Nothing.

Additionally I noticed that the process was starting over and over, while monitoring, .ssh folder was growing up to certain size (something around 7GB) and then would start over. The logs kept showing the same messages, the init script failed with an error but the process kept running either until I executed service mysql stop or kill -9 to all processes. It was getting stranger every minute.

At this point I was totally lost, scratching my head looking for solutions. More strange still was that trying a manual SST using netcat worked perfectly! So it was definitely a problem with the init script. Systemd journal was not providing any insight…

And then…

MariaDB Cluster dies in the SST process after 90 seconds

Suddenly I noticed that the failure was happening roughly 90 seconds after the start. A short googling later—doing more specific search—I found this page:
https://mariadb.com/kb/en/library/systemd/#ssts-and-systemd which explained precisely my problem.

The MariaDB init script has changed its timeout from 900 seconds to 90 while MySQL Community and Percona Server has this value set to 15 mins. Also it seems that this change has caused some major issues with nodes crashing as documented in MDEV-15607 — the bug is reported as fixed but we still can see timeout problems.

I observed that in case of failure, systemd was killing the mysqld process but not stopping the service. This results in an infinite SST loop that only stops when the service is killed or stopped via systemd command.

The fix was super easy, I just needed to create a file to be used by systemd that sets the timeout to a more useful value as follows:

As you may notice I set the timeout to 15 minutes but I could set it to any time. That was it, the next SST will have plenty of time to finish. Reference to this change is very well documented here

On reflection…

One could argue about this change, and I’m still having some internal discussions about it. In my opinion, a 90 seconds timeout is too short for a Galera cluster. It is very likely that almost any cluster will reach that timeout during SST. Even a regular MySQL server that suffers a crash with a high proportion of dirty pages or many operations to rollback, 90 seconds doesn’t seem to be an feasible time for crash recovery. Why the developers changed it to such a short timeout I have no idea. Luckily, it is very easy to fix now I know the reason.


Photo by Tim Gouw on Unsplash

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Far
Enough.

Said no pioneer ever.
MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.
© 2026 Percona All Rights Reserved