NDB cluster is a very interesting solution in term of high availability since there are no single point of failure. In an environment like EC2, where a node can disappear almost without notice, one would think that it is a good fit.
It is indeed a good fit but reality is a bit trickier. The main issue we faced is that IPs are dynamic in EC2 so when an instance restarts, it gets a new IP. What the problem with a new IP? Just change the IP in the cluster config and perform an rolling restart! no? In fact this will not work, since the cluster is already in degraded mode, restarting the surviving node of the degraded node group (NoOfReplicas=2) will cause the NDB cluster to shutdown.
This can be solved by using host names instead of IPs in the config.ini file. What needs to be done is to define, in /etc/hosts, on entry per cluster member. The API nodes are not required. Here is an example:
|
1 2 3 4 5 6 |
$ more /etc/hosts 127.0.0.1 localhost.localdomain localhost 10.11.11.11 mgmn1 10.22.22.22 mgmn2 10.33.33.33 data1 10.44.44.44 data2 |
the file will be present and identical, at least for the NDB part, in all hosts. Next the NDB configuration must use the hostname like:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
[NDBD DEFAULT] NoOfReplicas=2 Datadir=/var/lib/mysql-cluster/ DataMemory=1G IndexMemory=100M [NDB_MGMD] Id=1 Hostname=mgmn1 [NDB_MGMD] Id=2 Hostname=mgmn2 [NDBD] Id=3 Hostname=data1 [NDBD] Id=4 Hostname=data2 [MYSQLD] [MYSQLD] [MYSQLD] [MYSQLD] [MYSQLD] |
This is, of course, a very minimalistic configuration but I am sure you get the point. Let’s go back to our original problem and consider that data2 went down. You spin up a new host, configure it with NDB. Consider, for example, that the IP of the new host is 10.55.55.55. The first thing to do is to update all the /etc/hosts files so that they look like:
|
1 2 3 4 5 6 |
$ more /etc/hosts 127.0.0.1 localhost.localdomain localhost 10.11.11.11 mgmn1 10.22.22.22 mgmn2 10.33.33.33 data1 10.55.55.55 data2 |
And now, you start the ndbd process (or ndbmtd) on data2. Since the management nodes read the /etc/hosts file before the change to get the IP from which to expect the connection, you’ll get the “No free node” error. But then, the management node, when looping back in its connection handling code, will read the new /etc/hosts file and the second time, the connection will succeed.