Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Synchronous Replication in MySQL: Galera vs. Group Replication

February 24, 2017

Author

Przemysław Malkowski

Insight for DBAs

Insight for Developers

MySQL

Share this Post:

UPDATE: Some of the language in the original post was considered overly-critical of Oracle by some community members. This was not my intent, and I’ve modified the language to be less so. I’ve also changed the term “synchronous” (which the use of is inaccurate and misleading) to “virtually synchronous.” This term is more accurate and already used by both technologies’ founders, and should be less misleading.

I also wanted to thank Jean-François Gagné for pointing out the incorrect sentence about multi-threaded slaves in Group Replication, which I also corrected accordingly.

In today’s blog post, I will briefly compare two major virtually synchronous replication technologies available today for MySQL.

More Than Asynchronous Replication

Thanks to the Galera plugin, founded by the Codership team, we’ve had the choice between asynchronous and virtually synchronous replication in the MySQL ecosystem for quite a few years already. Moreover, we can choose between at least three software providers: Codership, MariaDB and Percona, each with its own Galera implementation.

The situation recently became much more interesting when MySQL Group Replication went into GA (stable) stage in December 2016.

Oracle, the upstream MySQL provider, introduced its own replication implementation that is very similar in concept. Unlike the others mentioned above, it isn’t based on Galera. Group Replication was built from the ground up as a new solution. MySQL Group Replication shares many very similar concepts to Galera. This post doesn’t cover MySQL Cluster, another and fully-synchronous solution, that existed much earlier then Galera — it is a much different solution for different use cases.

In this post, I will point out a couple of interesting differences between Group Replication and Galera, which hopefully will be helpful to those considering switching from one to another (or if they are planning to test them).

This is certainly not a full list of all the differences, but rather things I found interesting during my explorations.

It is also important to know that Group Replication has evolved a lot before it went GA (its whole cluster layer was replaced). I won’t mention how things looked before the GA stage, and will just concentrate on the latest available 5.7.17 version. I will not spend too much time on how Galera implementations looked in the past and will use Percona XtraDB Cluster 5.7 as a reference.

Multi-Master vs. Master-Slave

Galera has always been multi-master by default, so it does not matter to which node you write. Many users use a single writer due to workload specifics and multi-master limitations, but Galera has no single master mode per se.

Group Replication, on the other hand, promotes just one member as primary (master) by default, and other members are put into read-only mode automatically. This is what happens if we try to change data on the non-master node:

mysql> truncate test.t1;
ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement

1 2	mysql> truncate test.t1; ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement

To change from single primary mode to multi-primary (multi-master), you have to start group replication with the group_replication_single_primary_mode variable disabled.
Another interesting fact is you do not have any influence on which cluster member will be the master in single primary mode: the cluster auto-elects it. You can only check it with a query:

mysql> SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME like 'group_replication%';
+----------------------------------+--------------------------------------+
| VARIABLE_NAME                    | VARIABLE_VALUE                       |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 329333cd-d6d9-11e6-bdd2-0242ac130002 |
+----------------------------------+--------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME like 'group_replication%';

+----------------------------------+--------------------------------------+

| VARIABLE_NAME | VARIABLE_VALUE |

+----------------------------------+--------------------------------------+

| group_replication_primary_member | 329333cd-d6d9-11e6-bdd2-0242ac130002 |

+----------------------------------+--------------------------------------+

1 row in set (0.00 sec)

Or just:

mysql> show status like 'group%';                  
+----------------------------------+--------------------------------------+
| Variable_name                    | Value                                |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 329333cd-d6d9-11e6-bdd2-0242ac130002 |
+----------------------------------+--------------------------------------+
1 row in set (0.01 sec)

mysql> show status like 'group%';

+----------------------------------+--------------------------------------+

| Variable_name | Value |

+----------------------------------+--------------------------------------+

| group_replication_primary_member | 329333cd-d6d9-11e6-bdd2-0242ac130002 |

+----------------------------------+--------------------------------------+

1 row in set (0.01 sec)

To show the hostname instead of UUID, here:

mysql> select member_host as "primary master" from performance_schema.global_status join performance_schema.replication_group_members where variable_name='group_replication_primary_member' and member_id=variable_value;
+----------------+
| primary master |
+----------------+
| f18ff539956d   |
+----------------+
1 row in set (0.00 sec)

mysql> select member_host as "primary master" from performance_schema.global_status join performance_schema.replication_group_members where variable_name='group_replication_primary_member' and member_id=variable_value;

+----------------+

| primary master |

+----------------+

| f18ff539956d |

+----------------+

1 row in set (0.00 sec)

Replication: Majority vs. All

Galera delivers write transactions synchronously to ALL nodes in the cluster. (Later, applying happens asynchronously in both technologies.) However, Group Replication needs just a majority of the nodes confirming the transaction. This means a transaction commit on the writer succeeds and returns to the client even if a minority of nodes still have not received it.

In the example of a three-node cluster, if one node crashes or loses the network connection, the two others continue to accept writes (or just the primary node in Single-Primary mode) even before a faulty node is removed from the cluster.

If the separated node is the primary one, it denies writes due to the lack of a quorum (it will report the error ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction.). If one of the nodes receives a quorum, it will be elected to primary after the faulty node is removed from the cluster, and will then accept writes.

With that said, the “majority” rule in Group Replication means that there isn’t a guarantee that you won’t lose any data if the majority nodes are lost. There is a chance these could apply some transactions that aren’t delivered to the minority at the moment they crash.

In Galera, a single node network interruption makes the others wait for it, and pending writes can be committed once either the connection is restored or the faulty node removed from cluster after the timeout. So the chance of losing data in a similar scenario is lower, as transactions always reach all nodes. Data can be lost in Percona XtraDB Cluster only in a really bad luck scenario: a network split happens, the remaining majority of nodes form a quorum, the cluster reconfigures and allows new writes, and then shortly after the majority part is damaged.

Schema Requirements

For both technologies, one of the requirements is that all tables must be InnoDB and have a primary key. This requirement is now enforced by default in both Group Replication and Percona XtraDB Cluster 5.7. Let’s look at the differences.

Percona XtraDB Cluster:

mysql> create table nopk (a char(10));
Query OK, 0 rows affected (0.08 sec)

mysql> insert into nopk values ("aaa");
ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.nopk) without an explicit primary key with pxc_strict_mode = ENFORCING or MASTER

mysql> create table m1 (id int primary key) engine=myisam;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into m1 values(1);            
ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.m1) that resides in non-transactional storage engine with pxc_strict_mode = ENFORCING or MASTER

mysql> set global pxc_strict_mode=0;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into nopk values ("aaa");
Query OK, 1 row affected (0.00 sec)

mysql> insert into m1 values(1);
Query OK, 1 row affected (0.00 sec)

mysql> create table nopk (a char(10));

Query OK, 0 rows affected (0.08 sec)

mysql> insert into nopk values ("aaa");

ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.nopk) without an explicit primary key with pxc_strict_mode = ENFORCING or MASTER

mysql> create table m1 (id int primary key) engine=myisam;

Query OK, 0 rows affected (0.02 sec)

mysql> insert into m1 values(1);

ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.m1) that resides in non-transactional storage engine with pxc_strict_mode = ENFORCING or MASTER

mysql> set global pxc_strict_mode=0;

Query OK, 0 rows affected (0.00 sec)

mysql> insert into nopk values ("aaa");

Query OK, 1 row affected (0.00 sec)

mysql> insert into m1 values(1);

Query OK, 1 row affected (0.00 sec)

Before Percona XtraDB Cluster 5.7 (or in other Galera implementations), there were no such enforced restrictions. Users unaware of these requirements often ended up with problems.

Group Replication:

mysql> create table nopk (a char(10));
Query OK, 0 rows affected (0.04 sec)

mysql> insert into nopk values ("aaa");
ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.

2017-01-15T22:48:25.241119Z 139 [ERROR] Plugin group_replication reported: 'Table nopk does not have any PRIMARY KEY. This is not compatible with Group Replication'

mysql> create table m1 (id int primary key) engine=myisam;
ERROR 3161 (HY000): Storage engine MyISAM is disabled (Table creation is disallowed).

mysql> create table nopk (a char(10));

Query OK, 0 rows affected (0.04 sec)

mysql> insert into nopk values ("aaa");

ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.

2017-01-15T22:48:25.241119Z 139 [ERROR] Plugin group_replication reported: 'Table nopk does not have any PRIMARY KEY. This is not compatible with Group Replication'

mysql> create table m1 (id int primary key) engine=myisam;

ERROR 3161 (HY000): Storage engine MyISAM is disabled (Table creation is disallowed).

I am not aware of any way to disable these restrictions in Group Replication.

GTID

Galera has it’s own Global Transaction ID, which has existed since MySQL 5.5 and is independent of MySQL’s GTID feature introduced in MySQL 5.6. If MySQL’s GTID is enabled on a Galera-based cluster, both numerations exist with their own sequences and UUIDs.

Group Replication is based on a native MySQL GTID feature and relies on it. Interestingly, a separate sequence block range (initially 1M) is pre-assigned for each cluster member.

WAN Support

The MySQL Group Replication documentation isn’t very optimistic on WAN support, claiming that both “Low latency, high bandwidth network connections are a requirement” and “Group Replication is designed to be deployed in a cluster environment where server instances are very close to each other, and is impacted by both network latency as well as network bandwidth.” These statements are found here and here. However, there is network traffic optimization: Message Compression.

I don’t see group communication level tunings available yet, as we find in the Galera evs.* series of wsrep_provider_options.

Galera founders actually encourage trying it in geo-distributed environments, and some WAN-dedicated settings are available (the most important being WAN segments).

But both technologies need a reliable network for good performance.

State Transfers

Galera has two types of state transfers that allow syncing data to nodes when needed: incremental (IST) and full (SST). Incremental is used when a node has been out of a cluster for some time, and once it rejoins the other nodes has the missing write sets still in Galera cache. Full SST is helpful if incremental is not possible, especially when a new node is added to the cluster. SST automatically provisions the node with fresh data taken as a snapshot from one of the running nodes (donor). The most common SST method is using Percona XtraBackup, which takes a fast and non-blocking binary data snapshot (hot backup).

In Group Replication, state transfers are fully based on binary logs with GTID positions. If there is no donor with all of the binary logs (included the ones for new nodes), a DBA has to first provision the new node with initial data snapshot. Otherwise, the joiner will fail with a very familiar error:

2017-01-16T23:01:40.517372Z 50 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236

2017-01-16T23:01:40.517372Z 50 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236

The official documentation mentions that provisioning the node before adding it to the cluster may speed up joining (the recovery stage). Another difference is that in the case of state transfer failure, a Galera joiner will abort after the first try, and will shut down its mysqld instance. The Group Replication joiner will then fall-back to another donor in an attempt to succeed. Here I found something slightly annoying: if no donor can satisfy joiner demands, it will still keep trying the same donors over and over, for a fixed number of attempts:

[root@cd81c1dadb18 /]# grep 'Attempt' /var/log/mysqld.log |tail
2017-01-16T22:57:38.329541Z 12 [Note] Plugin group_replication reported: 'Establishing group recovery connection with a possible donor. Attempt 1/10'
2017-01-16T22:57:38.539984Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 2/10'
2017-01-16T22:57:38.806862Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 3/10'
2017-01-16T22:58:39.024568Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 4/10'
2017-01-16T22:58:39.249039Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 5/10'
2017-01-16T22:59:39.503086Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 6/10'
2017-01-16T22:59:39.736605Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 7/10'
2017-01-16T23:00:39.981073Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 8/10'
2017-01-16T23:00:40.176729Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 9/10'
2017-01-16T23:01:40.404785Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 10/10'

[root@cd81c1dadb18 /]# grep 'Attempt' /var/log/mysqld.log |tail

2017-01-16T22:57:38.329541Z 12 [Note] Plugin group_replication reported: 'Establishing group recovery connection with a possible donor. Attempt 1/10'

2017-01-16T22:57:38.539984Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 2/10'

2017-01-16T22:57:38.806862Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 3/10'

2017-01-16T22:58:39.024568Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 4/10'

2017-01-16T22:58:39.249039Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 5/10'

2017-01-16T22:59:39.503086Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 6/10'

2017-01-16T22:59:39.736605Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 7/10'

2017-01-16T23:00:39.981073Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 8/10'

2017-01-16T23:00:40.176729Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 9/10'

2017-01-16T23:01:40.404785Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 10/10'

After the last try, even though it fails, mysqld keeps running and allows client connections…

Auto Increment Settings

Galera adjusts the auto_increment_increment and auto_increment_offset values according to the number of members in a cluster. So, for a 3-node cluster, auto_increment_increment will be “3” and auto_increment_offset from “1” to “3” (depending on the node). If a number of nodes change later, these are updated immediately. This feature can be disabled using the wsrep_auto_increment_control setting. If needed, these settings can be set manually.

Interestingly, in Group Replication the auto_increment_increment seems to be fixed at 7, and only auto_increment_offset is set differently on each node. This is the case even in the default Single-Primary mode! this seems like a waste of available IDs, so make sure that you adjust the group_replication_auto_increment_increment setting to a saner number before you start using Group Replication in production.

Multi-Threaded Slave Side Applying

Galera developed its own multi-threaded slave feature, even in 5.5 versions, for workloads that include tables in the same database. It is controlled with the wsrep_slave_threads variable. Group Replication uses a feature introduced in MySQL 5.7, where the number of applier threads is controlled with slave_parallel_workers. Galera will do multi-threaded replication based on potential conflicts of changed/locked rows. Group Replication parallelism is based on an improved LOGICAL_CLOCK scheduler, which uses information from writesets dependencies. This can allow it to achieve much better results than in normal asynchronous replication MTS mode. More details can be found here: http://mysqlhighavailability.com/zooming-in-on-group-replication-performance/

Flow Control

Both technologies use a technique to throttle writes when nodes are slow in applying them. Interestingly, the default size of the allowed applier queue in both is much different:

- gcs.fc_limit (Galera) = 16 (the limit is increased automatically based on number of nodes, i.e. to 28 in 3-node cluster)

- group_replication_flow_control_applier_threshold (Group Replication) = 25000.

Moreover, Group Replication provides separate certifier queue size, also eligible for the Flow Control trigger: group_replication_flow_control_certifier_threshold. One thing I found difficult, is checking the actual applier queue size, as the only exposed one via performance_schema.replication_group_member_stats is the Count_Transactions_in_queue (which only shows the certifier queue).

Network Hiccup/Partition Handling

In Galera, when the network connection between nodes is lost, those who still have a quorum will form a new cluster view. Those who lost a quorum keep trying to re-connect to the primary component. Once the connection is restored, separated nodes will sync back using IST and rejoin the cluster automatically.

This doesn’t seem to be the case for Group Replication. Separated nodes that lose the quorum will be expelled from the cluster, and won’t join back automatically once the network connection is restored. In its error log we can see:

2017-01-17T11:12:18.562305Z 0 [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2017-01-17T11:12:18.631225Z 0 [Note] Plugin group_replication reported: 'getstart group_id ce427319'
2017-01-17T11:12:21.735374Z 0 [Note] Plugin group_replication reported: 'state 4330 action xa_terminate'
2017-01-17T11:12:21.735519Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-01-17T11:12:21.735527Z 0 [Note] Plugin group_replication reported: 'state 4257 action xa_exit'
2017-01-17T11:12:21.735553Z 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'
2017-01-17T11:12:21.735558Z 0 [Note] Plugin group_replication reported: 'new state x_start'

2017-01-17T11:12:18.562305Z 0 [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'

2017-01-17T11:12:18.631225Z 0 [Note] Plugin group_replication reported: 'getstart group_id ce427319'

2017-01-17T11:12:21.735374Z 0 [Note] Plugin group_replication reported: 'state 4330 action xa_terminate'

2017-01-17T11:12:21.735519Z 0 [Note] Plugin group_replication reported: 'new state x_start'

2017-01-17T11:12:21.735527Z 0 [Note] Plugin group_replication reported: 'state 4257 action xa_exit'

2017-01-17T11:12:21.735553Z 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'

2017-01-17T11:12:21.735558Z 0 [Note] Plugin group_replication reported: 'new state x_start'

The status changes to:

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 329333cd-d6d9-11e6-bdd2-0242ac130002 | f18ff539956d | 3306 | ERROR |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;

+---------------------------+--------------------------------------+--------------+-------------+--------------+

+---------------------------+--------------------------------------+--------------+-------------+--------------+

+---------------------------+--------------------------------------+--------------+-------------+--------------+

1 row in set (0.00 sec)

It seems the only way to bring it back into the cluster is to manually restart Group Replication:

mysql> START GROUP_REPLICATION;
ERROR 3093 (HY000): The START GROUP_REPLICATION command failed since the group is already running.
mysql> STOP GROUP_REPLICATION;
Query OK, 0 rows affected (5.00 sec)

mysql> START GROUP_REPLICATION;
Query OK, 0 rows affected (1.96 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 24d6ef6f-dc3f-11e6-abfa-0242ac130004 | cd81c1dadb18 | 3306 | ONLINE |
| group_replication_applier | 329333cd-d6d9-11e6-bdd2-0242ac130002 | f18ff539956d | 3306 | ONLINE |
| group_replication_applier | ae148d90-d6da-11e6-897e-0242ac130003 | 0af7a73f4d6b | 3306 | ONLINE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
3 rows in set (0.00 sec

mysql> START GROUP_REPLICATION;

ERROR 3093 (HY000): The START GROUP_REPLICATION command failed since the group is already running.

mysql> STOP GROUP_REPLICATION;

Query OK, 0 rows affected (5.00 sec)

mysql> START GROUP_REPLICATION;

Query OK, 0 rows affected (1.96 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;

+---------------------------+--------------------------------------+--------------+-------------+--------------+

+---------------------------+--------------------------------------+--------------+-------------+--------------+

+---------------------------+--------------------------------------+--------------+-------------+--------------+

3 rows in set (0.00 sec

Note that in the above output, after the network failure, Group Replication did not stop. It waits in an error state. Moreover, in Group Replication a partitioned node keeps serving dirty reads as if nothing happened (for non-super users):

cd81c1dadb18 {test} ((none)) > SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 24d6ef6f-dc3f-11e6-abfa-0242ac130004 | cd81c1dadb18 | 3306 | ERROR |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
1 row in set (0.00 sec)

cd81c1dadb18 {test} ((none)) > select * from test1.t1;
+----+-------+
| id | a |
+----+-------+
| 1 | dasda |
| 3 | dasda |
+----+-------+
2 rows in set (0.00 sec)

cd81c1dadb18 {test} ((none)) > show grants;
+-------------------------------------------------------------------------------+
| Grants for test@% |
+-------------------------------------------------------------------------------+
| GRANT SELECT, INSERT, UPDATE, DELETE, REPLICATION CLIENT ON *.* TO 'test'@'%' |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)

cd81c1dadb18 {test} ((none)) > SELECT * FROM performance_schema.replication_group_members;

+---------------------------+--------------------------------------+--------------+-------------+--------------+

+---------------------------+--------------------------------------+--------------+-------------+--------------+

+---------------------------+--------------------------------------+--------------+-------------+--------------+

1 row in set (0.00 sec)

cd81c1dadb18 {test} ((none)) > select * from test1.t1;

+----+-------+

| id | a |

+----+-------+

| 1 | dasda |

| 3 | dasda |

+----+-------+

2 rows in set (0.00 sec)

cd81c1dadb18 {test} ((none)) > show grants;

+-------------------------------------------------------------------------------+

| Grants for test@% |

+-------------------------------------------------------------------------------+

| GRANT SELECT, INSERT, UPDATE, DELETE, REPLICATION CLIENT ON *.* TO 'test'@'%' |

+-------------------------------------------------------------------------------+

1 row in set (0.00 sec)

A privileged user can disable super_read_only, but then it won’t be able to write:

cd81c1dadb18 {root} ((none)) > insert into test1.t1 set a="split brain";
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.
cd81c1dadb18 {root} ((none)) > select * from test1.t1;
+----+-------+
| id | a |
+----+-------+
| 1 | dasda |
| 3 | dasda |
+----+-------+
2 rows in set (0.00 sec)

cd81c1dadb18 {root} ((none)) > insert into test1.t1 set a="split brain";

ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.

cd81c1dadb18 {root} ((none)) > select * from test1.t1;

+----+-------+

| id | a |

+----+-------+

| 1 | dasda |

| 3 | dasda |

+----+-------+

2 rows in set (0.00 sec)

I found an interesting thing here, which I consider a bug. In this case, a partitioned node can actually perform DDL, despite the error:

cd81c1dadb18 {root} ((none)) > show tables in test1;
+-----------------+
| Tables_in_test1 |
+-----------------+
| nopk |
| t1 |
+-----------------+
2 rows in set (0.01 sec)

cd81c1dadb18 {root} ((none)) > create table test1.split_brain (id int primary key);
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.

cd81c1dadb18 {root} ((none)) > show tables in test1;
+-----------------+
| Tables_in_test1 |
+-----------------+
| nopk |
| split_brain |
| t1 |
+-----------------+
3 rows in set (0.00 sec)

cd81c1dadb18 {root} ((none)) > show tables in test1;

+-----------------+

| Tables_in_test1 |

+-----------------+

| nopk |

| t1 |

+-----------------+

2 rows in set (0.01 sec)

cd81c1dadb18 {root} ((none)) > create table test1.split_brain (id int primary key);

ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.

cd81c1dadb18 {root} ((none)) > show tables in test1;

+-----------------+

| Tables_in_test1 |

+-----------------+

| nopk |

| split_brain |

| t1 |

+-----------------+

3 rows in set (0.00 sec)

In a Galera-based cluster, you are automatically protected from that, and a partitioned node refuses to allow both reads and writes. It throws an error: ERROR 1047 (08S01): WSREP has not yet prepared node for application use. You can force dirty reads using the wsrep_dirty_reads variable.

There many more subtle (and less subtle) differences between these technologies – but this blog post is long enough already. Maybe next time 🙂

Article with Similar Subject

http://lefred.be/content/group-replication-vs-galera/

0 0 votes

Article Rating

18 Comments

Oldest

Newest Most Voted

Admin

Vadim Tkachenko

9 years ago

As a comment, in this post http://lefred.be/content/mysql-group-replication-synchronous-or-asynchronous-replication/ Fred states that MySQL Group replication IS NOT synchronous

ovaistariq

9 years ago

Reply to Vadim Tkachenko

It would be prudent to mention here that PXC is also not synchronous.

Author

Przemysław Malkowski

9 years ago

Reply to ovaistariq

Yes, “synchronous” is too simplified term, as it only applies to the fact that transaction is acknowledged by the other cluster members before commit returns. I used it only because it is so common adjective seen in materials about Galera, even without mentioning about wsrep_sync_wait/causal_reads. And IMHO Group Replication is very similar in that matter. I should probably rather use “virtually synchronous” term, used by both solutions founders quite often. Reference examples:
http://mysqlhighavailability.com/mysql-group-replication-a-quick-start-guide/
http://mysqlhighavailability.com/mysql-group-replication-transaction-life-cycle-explained/
http://galeracluster.com/documentation-webpages/

Jean-François Gagné

9 years ago

Reply to Przemysław Malkowski

Please do not use “virtually synchronous”, this just confuses things. Even semi-synchronous is wrong and leads to people thinking that the semi-sync nodes are the most up-to-date slaves. All of those solutions are actually asynchronous (I would claim, without being able to provide a formal proof, that all solutions with independent failure modes of the nodes can only provide some sort of relaxed memory consistency model at scale). Enough of this synchronous bullshit, let’s find new words !

I suggest using the word “agree”/”agreement” like “all the _active_ nodes in the cluster agree” to commit a transaction, or “the majority of the nodes in the cluster agree”, or “all the remaining nodes in the cluster agree on failure detection”.

If looks to me like the usage of the word synchronous (as opposed to asynchronous) is miss-used to characterize some sort or pseudo-high-availability property of the system. However, as I write below, I am not sure that any supported deployment is currently highly available. The right properties to describe those systems are “durability” and “consistency”, as in a write is durable and a write is consistent.

Author

Przemysław Malkowski

9 years ago

Reply to Jean-François Gagné

It wasn’t me who used “virtually synchronous” name for the Group Replication and Galera for the first time. It is the wording used in the official Oracle team blog, many times. But indeed Paxos used in GR seems to be not quite same as this definition: https://en.wikipedia.org/wiki/Virtual_synchrony
I don’t feel like re-inventing correct names though, I would rather leave it to GR and Galera founders.
And indeed, when I think “synchronous”, I do think first about durability and consistency, and only after that about HA.

Author

Przemysław Malkowski

9 years ago

Reply to Przemysław Malkowski

I am sorry to see you perceived my post as an attack on Group Replication. This was not my intention, and in fact I am quite excited to see this new solution in the ecosystem. I also do not think that Galera was better when it came out years ago – we can all see how many bugs were fixed and how many features added till now. So maybe the unfortunate wording I used contributed to this impression.
I really do not want to judge which solution is better and I wish to limit my opinions to as little as possible. What I am interested in, are pure facts.
In the same time, due to nature of this blog post form, I cannot explain in detail each aspect I touched. So, for instance in “Majority vs. All” section I focused on writes only, more from data durability point of view, and I didn’t really think here about comparing “availability” aspect. It is my personal preference to think first about durability, not any hidden goal of this post.
And to make a thorough comparison of each aspect, each would require a separate blog post.

Author

Przemysław Malkowski

9 years ago

Reply to Przemysław Malkowski

Thank you for pointing out about the multi-threaded slave description inaccuracy. Indeed, parallelism was bound to the size of a single binary log group-commit in earlier 5.7 versions, and later that was improved. I missed also the fact that Group Replication further improved that, which I can see is well explained in: http://mysqlhighavailability.com/zooming-in-on-group-replication-performance/ I will try to correct this section.

Jean-François Gagné

9 years ago

Reply to Vadim Tkachenko

I agree with ovaistariq, PXC/Galera is also not synchronous.

In post above, I can read “Galera delivers write transactions synchronously to ALL nodes in the cluster. (Later, applying happens asynchronously in both technologies.)”. What does that mean ? Synchronously delivering but asynchronously applying: STOP PLAYING WITH WORDS and please actually tell the users what this means in terms of trade-offs !

My understanding of above is that PXC/Galera blocks writes in case of one node failure until agreement by the other nodes on that failure. This means that PXC/Galera favors cluster “consistency” and write “durability” at the expense of write “availability”. Said otherwise, in the face of a single node failure, writes will block/fail on a cluster until failure detection is completed (and failure detection either take seconds if timeout is high or happens too often if timeout is low). In short, I decode that as PXC/Galera is not a true 100% write availability solution.

On the other hand, Group Replication will make progress when a majority of node agrees on committing the transaction (2 of 3 or 3 of 5). Yes, that means that a read might return stale data (which it is also true on PXE/Galera unless you use the obscure wsrep-sync-wait option, which I will come back to below). That also means that after commit succeeds on the majority, you can lose that commit if you lose that majority of nodes (how unlikely is that ?). In terms of trade-offs, I understand that Group Replication favors write “availability” by compromising a little on “durability” (I write a little as you can scale that compromise with the size of the cluster). Said otherwise, in the face of a single node failure, writes can still/will succeed on Group Replication. In short, I decode that Group Replication aims at being a true 100% write availability solution.

I wrote “aims” above because currently, multi-writes are not the default deployment of Group Replication. I trust this will change/improve in the future.

About wsrep-sync-wait, I notice that Mark Callaghan’s comment on [1] is still unanswered (I am partially quoting the question for completeness: “Docs for wsrep_sync_wait leave a lot to be desired.”). The unsatisfying doc is [2]. [3] is more complete and states: “the node blocks new queries”. My understanding in terms of trade-offs is that wsrep-sync-wait allows for data “consistency” at the expense of write “availability” (writes are blocked while data consistency is enforced with wsrep-sync-wait. In short, I decode again that PXC/Galera is not a true 100% write availability solution.

[1]: https://www.percona.com/blog/2017/02/15/group-replication-shipped-early/

[2]: https://www.percona.com/doc/percona-xtradb-cluster/5.7/wsrep-system-index.html#wsrep_sync_wait

[3]: http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait

So for users, knowing all that, pick what you need. And for vendors, please read [4] again replacing “performance” by “feature-set”.

[4]: http://jfg-mysql.blogspot.nl/2015/11/disappointment-at-mysql-ecosystem.html

ovaistariq

9 years ago

‘@Przemek, chances of data loss in a quorum based system is true for any quorum based system and not just GR. You can increase the number of nodes to mitigate it. However there are other benefits of a quorum based system, for example, a single bad node would not impact the cluster in case of a quorum based system.

I am more concerned about the fact that you can continue to do dirty reads even in the case of a partitioned node. Ultimately, I think Galera is more mature as compared to GR and it would take time for GR to reach the same level of maturity.

Valeriy

9 years ago

Thank you for this detailed comparison!

One thing to fix:

s/This doesn’t see to be the case for Group Replication./This doesn’t seem to be the case for Group Replication./

Jean-François Gagné

9 years ago

Diving deeper in this post, I find more errors (details below)! I am not surprised to find false things in there as, to me, the goal of this post looks more an unfair attack on Group Replication (and Oracle/MySQL) than a trustworthy informative post. For me, this post deserves the tagging of FUD.

In the text above, I can read “In MySQL 5.7, this [Multi-Threaded Slave Side Applying] is based on group commit or schema.”. THIS IS NOT TRUE (details below). The author obviously has not done his homework of fully understanding parallel replication in MySQL 5.7 before writing this sentence and I invite him to ask me questions after my Percona Live talk ([1]).

[1]: https://www.percona.com/live/17/sessions/mysqlmariadb-parallel-replication-inventory-use-cases-and-limitations

If the author would have done his homework, he would know/understand that LOGICAL_CLOCK Parallel Replication in MySQL 5.7 is not based on Group Commit but on _intervals_ (I mentioned this in my recent post: [2]). He would also know that Group Replication optimizes “slave” applying with the information used for certification. I refer him to the great FOSDEM talk ([3]) by Oracle/MySQL Senior Performance Engineer: Vitor Oliveira. If in a hurry, you can jump at minute 11 and second 20 for details about “Applier Throughput”.

[2]: http://jfg-mysql.blogspot.com/2017/02/metric-for-tuning-parallel-replication-mysql-5-7.html

[3]: https://fosdem.org/2017/schedule/event/finetune_gr/

Jean-François Gagné

8 years ago

Reply to Jean-François Gagné

More details about how members of a group can replicate faster than a standard slave in the following post:
http://jfg-mysql.blogspot.nl/2018/01/write-set-in-mysql-5-7-group-replication.html

Admin

Peter Zaitsev

9 years ago

Thank you for all your comments. I recognize as Percona has competing product – Percona XtraDB Cluster (PXC) it is easy to assume our goal here is to badmouth MySQL Group Replication. This is not an intent.

It is worth to note Percona Supports BOTH Galera replication (in PXC Product) as well as MySQL Group Replication as part of Percona Server Product and there is no difference for us in terms of compensation which product customers use.

For quite a while, before MySQL Group Replication became GA we have been working on testing it. We have reported many Group Replication bugs, and we appreciate how proactive MySQL team at Oracle has been fixing them.

Having said that there are number of design oversights or not completed features in Group Replication as it is in MySQL 5.7 which this post highlights, such as being unable to prevent stale reads, no built in provisioning system for new nodes or the fact nodes do not rejoin the cluster after temporary network issues.

I’m sorry for technical inaccuracies in this blog post. We have corrected ones which has been discovered so far.

Admin

Peter Zaitsev

9 years ago

I also wanted to comment on few technical issues

1) On Virtually Synchronous – this is how Galera refers to the replication it provides and MySQL Replication has been refereed to in many publication. Our choice in such cases is often to use the form adopted by industry rather than trying to invent something on our own. I believe simply calling it Asynchronous replication is not a great choice either as it has different properties and guarantees compared to MySQL Asynchronous Replication.

2) I think choice of updating ALL the nodes vs Majority of the nodes (Quorum) has significant implications. Galera’s approach where each node in the cluster has at least received transaction allows to implement wsrep_sync_wait mode easily as information about all writes presents in the system. It also ensures you can run SELECT … FOR UPDATE to get the current data guaranteed where stale data is not acceptable.

3) Durability and HA is complex discussion. MySQL Cluster was mentioned here as truly HA solution, however MySQL Cluster is Durable in memory – so if whole cluster looses power some of transactions will be lost. Is this any different from GR or PXC/Galera ? It seems to me it is not. Now if you use wsrep_sync_wait=1 in Galera your application is prevented from reading stale data providing effectively same behavior as MySQL Cluster “Synchronous” replication. Am I missing something obvious here ?

james

9 years ago

Thanks to all contributing to this discussion and I read this blog with interest.

My understanding is: what technology to use depends on your application. I believe that synchronised (or virtual synchronised – for both GR and Galera – not care about the terms that much 🙂 ) is for applications which mainly READs and WRITEs sometimes.

Admin

Vadim Tkachenko

9 years ago

Reply to james

James,

Application is the one aspect.
When the technology is going into production, there are more operational questions to consider.
1. Initial setup – how complicated is to perform the setup. E.g. for your topic – if you add a new node, will be it added automatically or you need manually copy files
2. Diagnostic and troubleshooting – when something does not perform as expected, how hard it is to get to the root cause
3. Backups – we all agree that backups is critically important. How to take backup in production for the given technology?
4. Upgrades – upgrades might be released quite often. how to perform the upgrade for minor release. the same for the major release?
5. Node failures – important for the distributed systems like in this article. How the system handles the failed node? Is application affected by this?
6. “Safety” – is this guaranteed to get a “safe” result, or the system may give you the response that based on dirty reads from the failed node?

These all affects the DBA team significantly and also define the cost of running the technology in the production.

James

9 years ago

Reply to Vadim Tkachenko

Thanks a lot, Vadim. Points taken.

Also I would like to share my experience with PXC/Galera having used the technology for nearly 3 years now. Enough said about benefits of PXC/Galera. Here I just list some thing different:

1). Currently ALTER/OPTIMISE et al locks whole cluster. I am waiting for Galera version 4 since May 2015
2). mysqld got killed by OOM frequently and mysqld_safe can not start it. No evidence of memory leaks, nor of Galera cause, yet though
3). Performance penalty – it depends on application – my experience is about 30%
4). Not fault tolerant – all nodes are coupled and any slow node can bring the whole cluster down
5). Not resilient – error in schema design can largely affect cluster performance and sometimes deadlock

Before choosing a technology, I would suggest to read this blog: https://www.percona.com/blog/2013/05/14/is-synchronous-replication-right-for-your-app

I also have a legacy system which uses Master-Slave architecture and agree with Jean here: https://www.slideshare.net/JeanFranoisGagn/how-bookingcom-avoids-and-deals-with-replication-lag

Comments welcome 🙂
Thanks

john

9 years ago

great post. I look forward to more follow up as gr evolves. I’m wondering what happens in gr if an insert is not applied to a node and in some later case an update or delete comes along for that row. We saw this happen once in pxc (no idea how it happened) and the node crashed (as expected) and the cluster continued working until we did SST to rebuild the node. What would gr do in this case?

Also, for a WAN solution where we have a secondary pxc cluster in another data center is that supported by gr?