Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Percona XtraDB Cluster: Failure Scenarios with only 2 nodes

July 25, 2012

Author

Frederic Descamps

MySQL

Percona Software

Share this Post:

During the design period of a new cluster, it is always advised to have at least 3 nodes (this is the case with PXC but it’s also the same with PRM). But why and what are the risks ?

The goal of having more than 2 nodes, in fact an odd number is recommended in that kind of clusters, is to avoid split-brain situation. This can occur when quorum (that can be simplified as “majority vote”) is not honoured. A split-brain is a state in which the nodes lose contact with one another and then both try to take control of shared resources or provide simultaneously the cluster service.
On PRM the problem is obvious, both nodes will try to run the master and slave IPS and will accept writes. But what could happen with Galera replication on PXC ?

Ok first let’s have a look with a standard PXC setup (no special galera options), 2 nodes:

two running nodes (percona1 and percona2), communication between nodes is ok

<code>[root@percona1 ~]# clustercheck 
HTTP/1.1 200 OK

Content-Type: Content-Type: text/plain



Node is running.</code>

<code>[root@percona1 ~]# clustercheck

HTTP/1.1 200 OK

Content-Type: Content-Type: text/plain

Node is running.</code>

Same output on percona2

Now let’s check the status variables:

<code>
| wsrep_local_state_comment  | Synced (6)                           |
| wsrep_cert_index_size      | 2                                    |
| wsrep_cluster_conf_id      | 4                                    |
| wsrep_cluster_size         | 2                                    |
| wsrep_cluster_state_uuid   | 8dccca9f-d4b8-11e1-0800-344f6b618448 |
| wsrep_cluster_status       | Primary                              |
| wsrep_connected            | ON                                   |
| wsrep_local_index          | 0                                    |
| wsrep_ready                | ON                                   |</code>

<code>

| wsrep_local_state_comment | Synced (6) |

| wsrep_cert_index_size | 2 |

| wsrep_cluster_conf_id | 4 |

| wsrep_cluster_size | 2 |

| wsrep_cluster_state_uuid | 8dccca9f-d4b8-11e1-0800-344f6b618448 |

| wsrep_cluster_status | Primary |

| wsrep_connected | ON |

| wsrep_local_index | 0 |

| wsrep_ready | ON |</code>

on percona2:

<code>
| wsrep_local_state_comment  | Synced (6)                           |
| wsrep_cert_index_size      | 2                                    |
| wsrep_cluster_conf_id      | 4                                    |
| wsrep_cluster_size         | 2                                    |
| wsrep_cluster_state_uuid   | 8dccca9f-d4b8-11e1-0800-344f6b618448 |
| wsrep_cluster_status       | Primary                              |
| wsrep_connected            | ON                                   |
| wsrep_local_index          | 1                                    |
| wsrep_ready                | ON                                   |</code>

<code>

| wsrep_local_state_comment | Synced (6) |

| wsrep_cert_index_size | 2 |

| wsrep_cluster_conf_id | 4 |

| wsrep_cluster_size | 2 |

| wsrep_cluster_state_uuid | 8dccca9f-d4b8-11e1-0800-344f6b618448 |

| wsrep_cluster_status | Primary |

| wsrep_connected | ON |

| wsrep_local_index | 1 |

| wsrep_ready | ON |</code>

only wsrep_local_index defers as expected.

Now let’s stop the communication between both nodes (using firewall rules):

iptables -A INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT

1

iptables -A INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT

This rule simulates a network outage that makes the connections between the two nodes impossible (switch/router failure)

<code>[root@percona1 ~]# clustercheck 
HTTP/1.1 503 Service Unavailable

Content-Type: Content-Type: text/plain



Node is *down*.

</code>

<code>[root@percona1 ~]# clustercheck

HTTP/1.1 503 Service Unavailable

Content-Type: Content-Type: text/plain

Node is *down*.

</code>

We can see that the node appears down, but we can still run some statements on it:

on node1:

<code>| wsrep_local_state_comment  | Initialized (0)                      |
| wsrep_cert_index_size      | 2                                    |
| wsrep_cluster_conf_id      | 18446744073709551615                 |
| <strong>wsrep_cluster_size</strong>         | <strong>1</strong>                                    |
| wsrep_cluster_state_uuid   | 8dccca9f-d4b8-11e1-0800-344f6b618448 |
| wsrep_cluster_status       | non-Primary                          |
| wsrep_connected            | ON                                   |
| wsrep_local_index          | 0                                    |
| wsrep_provider_name        | Galera                               |
| wsrep_provider_vendor      | Codership Oy info@codership.com      |
| wsrep_provider_version     | 2.1(r113)                            |
| <strong>wsrep_ready</strong>                | <strong>OFF</strong>                                  |</code>

<code>| wsrep_local_state_comment | Initialized (0) |

| wsrep_cert_index_size | 2 |

| wsrep_cluster_conf_id | 18446744073709551615 |

| wsrep_cluster_size | 1 |

| wsrep_cluster_state_uuid | 8dccca9f-d4b8-11e1-0800-344f6b618448 |

| wsrep_cluster_status | non-Primary |

| wsrep_connected | ON |

| wsrep_local_index | 0 |

| wsrep_provider_name | Galera |

| wsrep_provider_vendor | Codership Oy info@codership.com |

| wsrep_provider_version | 2.1(r113) |

| wsrep_ready | OFF |</code>

on node2:

<code>| wsrep_local_state_comment  | Initialized (0)                      |
| wsrep_cert_index_size      | 2                                    |
| wsrep_cluster_conf_id      | 18446744073709551615                 |
| wsrep_cluster_size         | <strong>1</strong>                                    |
| wsrep_cluster_state_uuid   | 8dccca9f-d4b8-11e1-0800-344f6b618448 |
| wsrep_cluster_status       | non-Primary                          |
| wsrep_connected            | ON                                   |
| wsrep_local_index          | 0                                    |
| wsrep_provider_name        | Galera                               |
| wsrep_provider_vendor      | Codership Oy info@codership.com      |
| wsrep_provider_version     | 2.1(r113)                            |
| <strong>wsrep_ready</strong>                | <strong>OFF</strong>                                  |</code>

<code>| wsrep_local_state_comment | Initialized (0) |

| wsrep_cert_index_size | 2 |

| wsrep_cluster_conf_id | 18446744073709551615 |

| wsrep_cluster_size | 1 |

| wsrep_cluster_state_uuid | 8dccca9f-d4b8-11e1-0800-344f6b618448 |

| wsrep_cluster_status | non-Primary |

| wsrep_connected | ON |

| wsrep_local_index | 0 |

| wsrep_provider_name | Galera |

| wsrep_provider_vendor | Codership Oy info@codership.com |

| wsrep_provider_version | 2.1(r113) |

| wsrep_ready | OFF |</code>

And if you test to use the mysql server:

<code>[root@percona1 ~]# mysql
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 868
Server version: 5.5.24

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.

percona1 mysql> use test
ERROR 1047 (08S01): Unknown command</code>

<code>[root@percona1 ~]# mysql

Welcome to the MySQL monitor. Commands end with ; or g.

Your MySQL connection id is 868

Server version: 5.5.24

Oracle is a registered trademark of Oracle Corporation and/or its

affiliates. Other names may be trademarks of their respective

owners.

Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.

percona1 mysql> use test

ERROR 1047 (08S01): Unknown command</code>

If you try to insert data just while the communication problem occurs, here is what you will have:

<code>percona1 mysql> insert into percona values (0,'percona1','peter');
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
percona1 mysql> insert into percona values (0,'percona1','peter');
ERROR 1047 (08S01): Unknown command</code>

<code>percona1 mysql> insert into percona values (0,'percona1','peter');

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

percona1 mysql> insert into percona values (0,'percona1','peter');

ERROR 1047 (08S01): Unknown command</code>

Is is possible anyway to have a two nodes cluster ?

So, by default, Percona XtraDB Cluster does the right thing, this is how it needs to work and you don’t suffer critical problem when you have enough nodes.
But how can we deal with that and avoid the resource to stop ? If we check the list of parameters on galera’s wiki wcan see that there are two options referring to that:

– pc.ignore_quorum: Completely ignore quorum calculations. E.g. in case master splits from several slaves it still remains operational. Use with extreme caution even in master-slave setups, because slaves won’t automatically reconnect to master in this case.
– pc.ignore_sb : Should we allow nodes to process updates even in the case of split brain? This is a dangerous setting in multi-master setup, but should simplify things in master-slave cluster (especially if only 2 nodes are used).

Let’s try first with ignoring quorum:

By ignoring the quorum, we ask to the cluster to not perform the majority calculation to define the Primary Component (PC). A component is a set of nodes which are connected to each other and when everything is ok, the whole cluster is one component. For example if you have 3 nodes, and if 1 node gets isolated (2 nodes can see each others and 1 node can see only itself), we have then 2 components and the quorum calculation will be 2/3 (66%) on the 2 nodes communicating each others and 1/3 (33%) on the single one. In this case the service will be stopped on the nodes where the majority is not reached. The quorum algorithm helps to select a PC and guarantees that there is no more than one primary component in the cluster.
In our 2 nodes setup, when the communication between the 2 nodes is broken, the quorum will be 1/2 (50%) on both node which is not the majority… therefore the service is stopped on both node. In this case, service means accepting queries.

Back to our test, we check that the data is the same on both nodes:

<code>percona1 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name   |
+----+---------------+--------+
|  2 | percona1      | lefred |
|  3 | percona2      | kenny  |
+----+---------------+--------+
2 rows in set (0.00 sec)

percona2 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name   |
+----+---------------+--------+
|  2 | percona1      | lefred |
|  3 | percona2      | kenny  |
+----+---------------+--------+
2 rows in set (0.00 sec)</code>

<code>percona1 mysql> select * from percona;

+----+---------------+--------+

| id | inserted_from | name |

+----+---------------+--------+

| 2 | percona1 | lefred |

| 3 | percona2 | kenny |

+----+---------------+--------+

2 rows in set (0.00 sec)

percona2 mysql> select * from percona;

+----+---------------+--------+

| id | inserted_from | name |

+----+---------------+--------+

| 2 | percona1 | lefred |

| 3 | percona2 | kenny |

+----+---------------+--------+

2 rows in set (0.00 sec)</code>

Adding ignore quorum and restart both nodes:

set global wsrep_provider_options=”pc.ignore_quorum=true”; (this seems to not work properly currently, I needed to change it in my.cnf and restart the nodes)

wsrep_provider_options = “pc.ignore_quorum = true”

break again connection between both nodes

iptables -A INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT

and perform an insert, this first insert will take longer:

<code>percona1 mysql> insert into percona values (0,'percona1','vadim');
Query OK, 1 row affected (7.77 sec)
percona1 mysql> insert into percona values (0,'percona1','miguel');
Query OK, 1 row affected (0.01 sec)</code>

<code>percona1 mysql> insert into percona values (0,'percona1','vadim');

Query OK, 1 row affected (7.77 sec)

percona1 mysql> insert into percona values (0,'percona1','miguel');

Query OK, 1 row affected (0.01 sec)</code>

and on node2 you can also add records:

<code>percona2 mysql> insert into percona values (0,'percona2','jay');
Query OK, 1 row affected (0.02 sec)</code>

1 2	<code>percona2 mysql> insert into percona values (0,'percona2','jay'); Query OK, 1 row affected (0.02 sec)</code>

The wsrep status variables are like this now:

<code>| wsrep_local_state_uuid     | e20f9da7-d509-11e1-0800-013f68429ec1 |
| wsrep_local_state_comment  | Synced (6)                           |
| wsrep_cluster_size         | 1                                    |
| wsrep_cluster_state_uuid   | e20f9da7-d509-11e1-0800-013f68429ec1 |
| wsrep_cluster_status       | Primary                              |
| wsrep_connected            | ON                                   |
| wsrep_ready                | ON                                   |

| wsrep_local_state_uuid     | e20f9da7-d509-11e1-0800-013f68429ec1 |

| wsrep_local_state_comment  | Synced (6)                           |
| wsrep_cluster_size         | 1                                    |
| wsrep_cluster_state_uuid   | e20f9da7-d509-11e1-0800-013f68429ec1 |
| wsrep_cluster_status       | Primary                              |
| wsrep_connected            | ON                                   |
| wsrep_ready                | ON                                   |</code>

<code>| wsrep_local_state_uuid | e20f9da7-d509-11e1-0800-013f68429ec1 |

| wsrep_local_state_comment | Synced (6) |

| wsrep_cluster_size | 1 |

| wsrep_cluster_state_uuid | e20f9da7-d509-11e1-0800-013f68429ec1 |

| wsrep_cluster_status | Primary |

| wsrep_connected | ON |

| wsrep_ready | ON |

| wsrep_local_state_uuid | e20f9da7-d509-11e1-0800-013f68429ec1 |

| wsrep_local_state_comment | Synced (6) |

| wsrep_cluster_size | 1 |

| wsrep_cluster_state_uuid | e20f9da7-d509-11e1-0800-013f68429ec1 |

| wsrep_cluster_status | Primary |

| wsrep_connected | ON |

| wsrep_ready | ON |</code>

Then we fix the connection problem:

iptables -D INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT

nothing changes, data stays different:

<code>percona1 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name   |
+----+---------------+--------+
|  2 | percona1      | lefred |
|  3 | percona2      | kenny  |
|  5 | percona1      | liz    |
|  9 | percona1      | ewen   |
| 11 | percona1      | vadim  |
| 13 | percona1      | miguel |
+----+---------------+--------+
6 rows in set (0.00 sec)

percona2 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name   |
+----+---------------+--------+
|  2 | percona1      | lefred |
|  3 | percona2      | kenny  |
|  5 | percona1      | liz    |
|  9 | percona1      | ewen   |
| 10 | percona2      | jay    |
+----+---------------+--------+
5 rows in set (0.01 sec)</code>

<code>percona1 mysql> select * from percona;

+----+---------------+--------+

| id | inserted_from | name |

+----+---------------+--------+

| 2 | percona1 | lefred |

| 3 | percona2 | kenny |

| 5 | percona1 | liz |

| 9 | percona1 | ewen |

| 11 | percona1 | vadim |

| 13 | percona1 | miguel |

+----+---------------+--------+

6 rows in set (0.00 sec)

percona2 mysql> select * from percona;

+----+---------------+--------+

| id | inserted_from | name |

+----+---------------+--------+

| 2 | percona1 | lefred |

| 3 | percona2 | kenny |

| 5 | percona1 | liz |

| 9 | percona1 | ewen |

| 10 | percona2 | jay |

+----+---------------+--------+

5 rows in set (0.01 sec)</code>

You can keep inserting data, it won’t be replicated and you will have 2 different version of your inconsistent data !

Also when we restart a it will request an SST or in certain case fail to start like this:

<code>120723 23:45:30 [ERROR] WSREP: Local state seqno (6) is greater than group seqno (5): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)
         at galera/src/replicator_str.cpp:state_transfer_required():34
120723 23:45:30 [Note] WSREP: applier thread exiting (code:7)
120723 23:45:30 [ERROR] Aborting</code>

<code>120723 23:45:30 [ERROR] WSREP: Local state seqno (6) is greater than group seqno (5): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)

at galera/src/replicator_str.cpp:state_transfer_required():34

120723 23:45:30 [Note] WSREP: applier thread exiting (code:7)

120723 23:45:30 [ERROR] Aborting</code>

Of course all the data that was written on the node we just restarted is lost after the SST.

Now let’s try with pc.ignore_sb=true:

When the quorum algorithm fails to select a Primary Component, we have then a split-brain condition. In our 2 nodes setup when a node loses connection to it’s only peer, the default is to stop accepting queries to avoid database inconsistency. We can bypass this behaviour by ignoring the split-brain by adding

wsrep_provider_options = “pc.ignore_sb = true” in my.cnf

Then we can insert in both nodes without any problem when the connection between the nodes is gone:

<code>percona1 mysql> insert into percona values (0,'percona1','jaime');
Query OK, 1 row affected (0.02 sec)

percona1 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name   |
+----+---------------+--------+
|  2 | percona1      | lefred |
|  3 | percona2      | kenny  |
|  5 | percona1      | liz    |
|  9 | percona1      | ewen   |
| 11 | percona1      | vadim  |
| 13 | percona1      | miguel |
| 14 | percona1      | marcos |
| 15 | percona1      | baron  |
| 16 | percona1      | brian  |
| 17 | percona1      | jaime  |
+----+---------------+--------+
10 rows in set (0.00 sec)

percona2 mysql> insert into percona values (0,'percona2','daniel');
Query OK, 1 row affected (0.02 sec)

percona2 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name   |
+----+---------------+--------+
|  2 | percona1      | lefred |
|  3 | percona2      | kenny  |
|  5 | percona1      | liz    |
|  9 | percona1      | ewen   |
| 11 | percona1      | vadim  |
| 13 | percona1      | miguel |
| 14 | percona1      | marcos |
| 15 | percona1      | baron  |
| 16 | percona1      | brian  |
| 17 | percona2      | daniel |
+----+---------------+--------+
10 rows in set (0.00 sec)
</code>

<code>percona1 mysql> insert into percona values (0,'percona1','jaime');

Query OK, 1 row affected (0.02 sec)

percona1 mysql> select * from percona;

+----+---------------+--------+

| id | inserted_from | name |

+----+---------------+--------+

| 2 | percona1 | lefred |

| 3 | percona2 | kenny |

| 5 | percona1 | liz |

| 9 | percona1 | ewen |

| 11 | percona1 | vadim |

| 13 | percona1 | miguel |

| 14 | percona1 | marcos |

| 15 | percona1 | baron |

| 16 | percona1 | brian |

| 17 | percona1 | jaime |

+----+---------------+--------+

10 rows in set (0.00 sec)

percona2 mysql> insert into percona values (0,'percona2','daniel');

Query OK, 1 row affected (0.02 sec)

percona2 mysql> select * from percona;

+----+---------------+--------+

| id | inserted_from | name |

+----+---------------+--------+

| 2 | percona1 | lefred |

| 3 | percona2 | kenny |

| 5 | percona1 | liz |

| 9 | percona1 | ewen |

| 11 | percona1 | vadim |

| 13 | percona1 | miguel |

| 14 | percona1 | marcos |

| 15 | percona1 | baron |

| 16 | percona1 | brian |

| 17 | percona2 | daniel |

+----+---------------+--------+

10 rows in set (0.00 sec)

</code>

When the connection is back, the two servers are like independent, these are now two single node clusters.

We can see it in the log file:

<code>120724 10:58:09 [Note] WSREP: evs::proto(86928728-d56d-11e1-0800-f7c4916d8330, GATHER, view_id(REG,7e6d285b-d56d-11e1-0800-2491595e99bb,2)) detected inactive node: 7e6d285b-d56d-11e1-0800-2491595e99bb
120724 10:58:09 [Warning] WSREP: Ignoring possible split-brain (allowed by configuration) from view:
view(view_id(REG,7e6d285b-d56d-11e1-0800-2491595e99bb,2) memb {
	7e6d285b-d56d-11e1-0800-2491595e99bb,
	86928728-d56d-11e1-0800-f7c4916d8330,
} joined {
	7e6d285b-d56d-11e1-0800-2491595e99bb,
} left {
} partitioned {
})
to view:
view(view_id(TRANS,7e6d285b-d56d-11e1-0800-2491595e99bb,2) memb {
	86928728-d56d-11e1-0800-f7c4916d8330,
} joined {
} left {
} partitioned {
	7e6d285b-d56d-11e1-0800-2491595e99bb,
})
120724 10:58:09 [Note] WSREP: view(view_id(PRIM,86928728-d56d-11e1-0800-f7c4916d8330,3) memb {
	86928728-d56d-11e1-0800-f7c4916d8330,
} joined {
} left {
} partitioned {
	7e6d285b-d56d-11e1-0800-2491595e99bb,
})
120724 10:58:09 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
120724 10:58:09 [Note] WSREP: forgetting 7e6d285b-d56d-11e1-0800-2491595e99bb (tcp://192.168.70.2:4567)
120724 10:58:09 [Note] WSREP: deleting entry tcp://192.168.70.2:4567
120724 10:58:09 [Note] WSREP: (86928728-d56d-11e1-0800-f7c4916d8330, 'tcp://0.0.0.0:4567') turning message relay requesting off
120724 10:58:09 [Note] WSREP: STATE_EXCHANGE: sent state UUID: b0dddbb3-d56d-11e1-0800-b62dc3759660
120724 10:58:09 [Note] WSREP: STATE EXCHANGE: sent state msg: b0dddbb3-d56d-11e1-0800-b62dc3759660
120724 10:58:09 [Note] WSREP: STATE EXCHANGE: got state msg: b0dddbb3-d56d-11e1-0800-b62dc3759660 from 0 (percona2)
120724 10:58:09 [Note] WSREP: Quorum results:
	version    = 2,
	component  = PRIMARY,
	conf_id    = 2,
	members    = 1/1 (joined/total),
	act_id     = 16,
	last_appl. = 0,
	protocols  = 0/4/2 (gcs/repl/appl),
	group UUID = e20f9da7-d509-11e1-0800-013f68429ec1
120724 10:58:09 [Note] WSREP: Flow-control interval: [8, 16]
120724 10:58:09 [Note] WSREP: New cluster view: global state: e20f9da7-d509-11e1-0800-013f68429ec1:16, view# 3: Primary, number of nodes: 1, my index: 0, protocol version 2
</code>

<code>120724 10:58:09 [Note] WSREP: evs::proto(86928728-d56d-11e1-0800-f7c4916d8330, GATHER, view_id(REG,7e6d285b-d56d-11e1-0800-2491595e99bb,2)) detected inactive node: 7e6d285b-d56d-11e1-0800-2491595e99bb

120724 10:58:09 [Warning] WSREP: Ignoring possible split-brain (allowed by configuration) from view:

view(view_id(REG,7e6d285b-d56d-11e1-0800-2491595e99bb,2) memb {

7e6d285b-d56d-11e1-0800-2491595e99bb,

86928728-d56d-11e1-0800-f7c4916d8330,

} joined {

7e6d285b-d56d-11e1-0800-2491595e99bb,

} left {

} partitioned {

})

to view:

view(view_id(TRANS,7e6d285b-d56d-11e1-0800-2491595e99bb,2) memb {

86928728-d56d-11e1-0800-f7c4916d8330,

} joined {

} left {

} partitioned {

7e6d285b-d56d-11e1-0800-2491595e99bb,

})

120724 10:58:09 [Note] WSREP: view(view_id(PRIM,86928728-d56d-11e1-0800-f7c4916d8330,3) memb {

86928728-d56d-11e1-0800-f7c4916d8330,

} joined {

} left {

} partitioned {

7e6d285b-d56d-11e1-0800-2491595e99bb,

})

120724 10:58:09 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1

120724 10:58:09 [Note] WSREP: forgetting 7e6d285b-d56d-11e1-0800-2491595e99bb (tcp://192.168.70.2:4567)

120724 10:58:09 [Note] WSREP: deleting entry tcp://192.168.70.2:4567

120724 10:58:09 [Note] WSREP: (86928728-d56d-11e1-0800-f7c4916d8330, 'tcp://0.0.0.0:4567') turning message relay requesting off

120724 10:58:09 [Note] WSREP: STATE_EXCHANGE: sent state UUID: b0dddbb3-d56d-11e1-0800-b62dc3759660

120724 10:58:09 [Note] WSREP: STATE EXCHANGE: sent state msg: b0dddbb3-d56d-11e1-0800-b62dc3759660

120724 10:58:09 [Note] WSREP: STATE EXCHANGE: got state msg: b0dddbb3-d56d-11e1-0800-b62dc3759660 from 0 (percona2)

120724 10:58:09 [Note] WSREP: Quorum results:

version = 2,

component = PRIMARY,

conf_id = 2,

members = 1/1 (joined/total),

act_id = 16,

last_appl. = 0,

protocols = 0/4/2 (gcs/repl/appl),

group UUID = e20f9da7-d509-11e1-0800-013f68429ec1

120724 10:58:09 [Note] WSREP: Flow-control interval: [8, 16]

120724 10:58:09 [Note] WSREP: New cluster view: global state: e20f9da7-d509-11e1-0800-013f68429ec1:16, view# 3: Primary, number of nodes: 1, my index: 0, protocol version 2

</code>

Now when we put the two settings to true, with a two node cluster, it acts exactly like when ignore_sb is enabled.
And if the local state seqno is greater than group seqno it fails to restart. You need again to delete the file grastate.dat to request a full SST and you loose again some data.

This is why two node clusters is not recommended at all. Now if you have only storage for 2 nodes, using the galera arbitrator is a very good alternative then.

On a third node, instead of running Percona XtraDB Cluster (mysqld) just run garbd:

Currently there is no init script for garbd, but this is something easy to write as it can run in daemon mode using -d

<code>[root@percona3 ~]# garbd -a gcomm://192.168.70.2:4567 -g trimethylxanthine
2012-07-24 11:42:50.237  INFO: Read config: 
	daemon:  0
	address: gcomm://192.168.70.2:4567
	group:   trimethylxanthine
	sst:     trivial
	donor:   
	options: gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes
	cfg:     
	log:     

2012-07-24 11:42:50.248  INFO: protonet asio version 0
2012-07-24 11:42:50.252  INFO: backend: asio
2012-07-24 11:42:50.254  INFO: GMCast version 0
2012-07-24 11:42:50.255  INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2012-07-24 11:42:50.255  INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2012-07-24 11:42:50.257  INFO: EVS version 0
2012-07-24 11:42:50.258  INFO: PC version 0
2012-07-24 11:42:50.258  INFO: gcomm: connecting to group 'trimethylxanthine', peer '192.168.70.2:4567'
2012-07-24 11:42:50.270  INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.70.3:4567 
2012-07-24 11:42:50.533  INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') turning message relay requesting off
2012-07-24 11:42:51.080  INFO: declaring 577c0ad0-d570-11e1-0800-198f7634d8ec stable
2012-07-24 11:42:51.080  INFO: declaring 593e68c8-d570-11e1-0800-74c74791749b stable
2012-07-24 11:42:51.088  INFO: view(view_id(PRIM,577c0ad0-d570-11e1-0800-198f7634d8ec,8) memb {
	577c0ad0-d570-11e1-0800-198f7634d8ec,
	593e68c8-d570-11e1-0800-74c74791749b,
	eed227a2-d573-11e1-0800-b8b68845d409,
} joined {
} left {
} partitioned {
})
2012-07-24 11:42:51.268  INFO: gcomm: connected
2012-07-24 11:42:51.268  INFO: Changing maximum packet size to 64500, resulting msg size: 32636
2012-07-24 11:42:51.269  INFO: Shifting CLOSED -> OPEN (TO: 0)
2012-07-24 11:42:51.271  INFO: Opened channel 'trimethylxanthine'
2012-07-24 11:42:51.272  INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2012-07-24 11:42:51.273  INFO: STATE EXCHANGE: Waiting for state UUID.
2012-07-24 11:42:51.276  INFO: STATE EXCHANGE: sent state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f
2012-07-24 11:42:51.276  INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 0 (percona1)
2012-07-24 11:42:51.276  INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 1 (percona2)
2012-07-24 11:42:51.292  INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 2 (garb)
2012-07-24 11:42:51.292  INFO: Quorum results:
	version    = 2,
	component  = PRIMARY,
	conf_id    = 6,
	members    = 2/3 (joined/total),
	act_id     = 19,
	last_appl. = -1,
	protocols  = 0/4/2 (gcs/repl/appl),
	group UUID = e20f9da7-d509-11e1-0800-013f68429ec1
2012-07-24 11:42:51.292  INFO: Flow-control interval: [9999999, 9999999]
2012-07-24 11:42:51.292  INFO: Shifting OPEN -> PRIMARY (TO: 19)
2012-07-24 11:42:51.292  INFO: Sending state transfer request: 'trivial', size: 7
2012-07-24 11:42:51.297  INFO: Node 2 (garb) requested state transfer from '*any*'. Selected 0 (percona1)(SYNCED) as donor.
2012-07-24 11:42:51.297  INFO: Shifting PRIMARY -> JOINER (TO: 19)
2012-07-24 11:42:51.303  INFO: 2 (garb): State transfer from 0 (percona1) complete.
2012-07-24 11:42:51.308  INFO: Shifting JOINER -> JOINED (TO: 19)
2012-07-24 11:42:51.311  INFO: Member 2 (garb) synced with group.
2012-07-24 11:42:51.311  INFO: Shifting JOINED -> SYNCED (TO: 19)
2012-07-24 11:42:51.325  WARN: 0 (percona1): State transfer to 2 (garb) failed: -1 (Operation not permitted)
2012-07-24 11:42:51.328  INFO: Member 0 (percona1) synced with group.
</code>

<code>[root@percona3 ~]# garbd -a gcomm://192.168.70.2:4567 -g trimethylxanthine

2012-07-24 11:42:50.237 INFO: Read config:

daemon: 0

address: gcomm://192.168.70.2:4567

group: trimethylxanthine

sst: trivial

donor:

options: gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes

cfg:

log:

2012-07-24 11:42:50.248 INFO: protonet asio version 0

2012-07-24 11:42:50.252 INFO: backend: asio

2012-07-24 11:42:50.254 INFO: GMCast version 0

2012-07-24 11:42:50.255 INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567

2012-07-24 11:42:50.255 INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') multicast: , ttl: 1

2012-07-24 11:42:50.257 INFO: EVS version 0

2012-07-24 11:42:50.258 INFO: PC version 0

2012-07-24 11:42:50.258 INFO: gcomm: connecting to group 'trimethylxanthine', peer '192.168.70.2:4567'

2012-07-24 11:42:50.270 INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.70.3:4567

2012-07-24 11:42:50.533 INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') turning message relay requesting off

2012-07-24 11:42:51.080 INFO: declaring 577c0ad0-d570-11e1-0800-198f7634d8ec stable

2012-07-24 11:42:51.080 INFO: declaring 593e68c8-d570-11e1-0800-74c74791749b stable

2012-07-24 11:42:51.088 INFO: view(view_id(PRIM,577c0ad0-d570-11e1-0800-198f7634d8ec,8) memb {

577c0ad0-d570-11e1-0800-198f7634d8ec,

593e68c8-d570-11e1-0800-74c74791749b,

eed227a2-d573-11e1-0800-b8b68845d409,

} joined {

} left {

} partitioned {

})

2012-07-24 11:42:51.268 INFO: gcomm: connected

2012-07-24 11:42:51.268 INFO: Changing maximum packet size to 64500, resulting msg size: 32636

2012-07-24 11:42:51.269 INFO: Shifting CLOSED -> OPEN (TO: 0)

2012-07-24 11:42:51.271 INFO: Opened channel 'trimethylxanthine'

2012-07-24 11:42:51.272 INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3

2012-07-24 11:42:51.273 INFO: STATE EXCHANGE: Waiting for state UUID.

2012-07-24 11:42:51.276 INFO: STATE EXCHANGE: sent state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f

2012-07-24 11:42:51.276 INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 0 (percona1)

2012-07-24 11:42:51.276 INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 1 (percona2)

2012-07-24 11:42:51.292 INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 2 (garb)

2012-07-24 11:42:51.292 INFO: Quorum results:

version = 2,

component = PRIMARY,

conf_id = 6,

members = 2/3 (joined/total),

act_id = 19,

last_appl. = -1,

protocols = 0/4/2 (gcs/repl/appl),

group UUID = e20f9da7-d509-11e1-0800-013f68429ec1

2012-07-24 11:42:51.292 INFO: Flow-control interval: [9999999, 9999999]

2012-07-24 11:42:51.292 INFO: Shifting OPEN -> PRIMARY (TO: 19)

2012-07-24 11:42:51.292 INFO: Sending state transfer request: 'trivial', size: 7

2012-07-24 11:42:51.297 INFO: Node 2 (garb) requested state transfer from '*any*'. Selected 0 (percona1)(SYNCED) as donor.

2012-07-24 11:42:51.297 INFO: Shifting PRIMARY -> JOINER (TO: 19)

2012-07-24 11:42:51.303 INFO: 2 (garb): State transfer from 0 (percona1) complete.

2012-07-24 11:42:51.308 INFO: Shifting JOINER -> JOINED (TO: 19)

2012-07-24 11:42:51.311 INFO: Member 2 (garb) synced with group.

2012-07-24 11:42:51.311 INFO: Shifting JOINED -> SYNCED (TO: 19)

2012-07-24 11:42:51.325 WARN: 0 (percona1): State transfer to 2 (garb) failed: -1 (Operation not permitted)

2012-07-24 11:42:51.328 INFO: Member 0 (percona1) synced with group.

</code>

If the communication fails between node1 and node2, they will communicate and eventually send the changes through the node running garbd (node3) and if one node dies, the other one behaves without any problem and when the dead node comes back it will perform its IST or SST.

In conclusion: 2 nodes cluster is possible with Percona XtraDB Cluster but it’s not advised at all because it will generates a lot of problem in case of issue on one of the nodes. It’s much safer to use then a 3rd node even a fake one using garbd.

If you plan anyway to have a cluster with only 2 nodes, don’t forget that :
– by default if one peer dies or if the communication between both nodes is unstable, both nodes won’t accept queries
– if you plan to ignore split-brain or quorum, you risk to have inconsistent data very easily

0 0 votes

Article Rating

10 Comments

Oldest

Newest Most Voted

Admin

Peter Zaitsev

13 years ago

Fred,

I see when you’re trying to work with node1 you seems to get different error message “deadlock” and “unknown command” why is this ? Would not it be more practical if some same error message is used for all cases when cluster is in the state it can’t run command.

Author

Frederic Descamps

13 years ago

Peter: deadlock are common in Galera replication (when for example you have hight concurrent writes and you perform them on several nodes. In this case it returns a deadlock because the full process of split brain is not yet finished and to avoid inconsistent data as the transaction cannot be committed on the second node. The “unknown command” is returned when the node is set to not allow queries any more, the value of wsrep_ready is OFF. Generally the load balancer checks that status and don’t send any connection to that node.. the application should not cope with that error.

So to summarize, the deadlock occurs just because the cluster didn’t finish yet to put “offline” all the nodes.

Asko Aavamäki

13 years ago

Nice rundown!

I wonder if another article could demonstrate how that two node cluster would perform during that IST/SST mentioned at the end after the other node rejoins the cluster. Would this show that the node that remained operational would be blocked as well in order to server the SST requests to the node that is returning to the cluster, making the whole DB inoperational?

Maybe the effect could be demonstrated both for mysqldump and rsync SSTs (or even XtraBackup if that could be used to achieve non-blocking in that scenario!)

The issue is noted in documentation and is the other big reason for having a minimum of three nodes in addition to split brain etc. but I think it’d be good to see what that type of situation looks like in the logs so if someone runs in to it, it’s less of a mystery.

Author

Frederic Descamps

13 years ago

Asko,

thank you for the idea, I’ll try to find some time to write such post.

cheers,

Balasundaram Rangasamy

13 years ago

Nice one. I had similar problem, when I was working with only 2 node cluster. This article nicely explains why we should not go for 2 node cluster.

When I try to update a table on both node simultaneously, I did deadlock error, unknown command and even second node becomes unavailable.

Daniël van Eeden

13 years ago

REJECT doesn’t really simulate a failed network connection as a ICMP reject will be sent. Using DROP is a better test. It probably doesn’t change the results… and a misconfigured router/switch might be more common then a failed network connection.

Jalvin Trivedi

12 years ago

If i am working on two nodes if i rais one query after some time node1 is fail but still query is not complite then what happen about the query ???? Is it running continuously or restart ????????

Jonathan Fisher

11 years ago

Is it possible to have a master/slave setup with Percona cluster? Something like if the master is restarted, the slave takes over writes, but the master is caught up when it comes back online?

Steve Jackson

11 years ago

Hei Frederic

We have been experimenting with sb_ignore=true to develop a master-slave setup of sorts, where we know, and accept that split brain can potentially occur, but we deal with that issue ourselves, as we know which the active node is supposed to be, and are only writing to that node. We simply want the ease of architecture that comes with the cluster software.

What we found is, when the “failed” node is restarted, he will begin an SST, but about 80% of the time, the VM we are testing on will become unresponsive. It appears that fs is filling up, but the db is very small.

Is this anything you have experienced or heard about? I have searched the net a bit, but I guess there are few people using sb_ignore=true, thus few reports on this issue

Ulises_Vargas

7 years ago

Hi! i have one question, i have a percona xtradb cluster with 3 nodes, i restarted mysql service in one node and the service is not started or start with a error:

“mysql.service – Percona XtraDB Cluster
Loaded: loaded (/usr/lib/systemd/system/mysql.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since lun 2018-11-12 15:13:58 CST; 3h 20min ago
Process: 21428 ExecStopPost=/usr/bin/mysql-systemd stop-post (code=exited, status=0/SUCCESS)
Process: 21398 ExecStop=/usr/bin/mysql-systemd stop (code=exited, status=2)
Process: 20816 ExecStartPost=/usr/bin/mysql-systemd start-post $MAINPID (code=exited, status=1/FAILURE)
Process: 20815 ExecStart=/usr/bin/mysqld_safe –basedir=/usr (code=exited, status=1/FAILURE)
Process: 27167 ExecStartPre=/usr/bin/mysql-systemd start-pre (code=exited, status=1/FAILURE)
Main PID: 20815 (code=exited, status=1/FAILURE)

nov 12 15:13:58 DES-TX-NODE01 systemd[1]: Starting Percona XtraDB Cluster…
nov 12 15:13:58 DES-TX-NODE01 mysql-systemd[27167]: WARNING: Another instance…
nov 12 15:13:58 DES-TX-NODE01 systemd[1]: mysql.service: control process exi…1
nov 12 15:13:58 DES-TX-NODE01 systemd[1]: Failed to start Percona XtraDB Clu….
nov 12 15:13:58 DES-TX-NODE01 systemd[1]: Unit mysql.service entered failed ….
nov 12 15:13:58 DES-TX-NODE01 systemd[1]: mysql.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
”
but the variables ‘wsrep_local_state_comment’,’wsrep_ready’,’wsrep_connected’ is the same in the 3 nodes and the replication is Ok i can do create,insert,delete, drop in anywan node and replicated good, but i can’t restart service mysql, any idea what happend?.