From the initial release, one of the biggest complaints I had about Group Replication is that it allowed “stale” reads and there was no way to prevent them or to even know that you read “stale” data. That was a huge limitation. Thankfully, Oracle released features to control the consistency levels, and it was exactly a year ago! I don’t know about you, but I personally was confused by naming it group_replication_consistency=’AFTER’ or ‘BEFORE’.
So now I want to try to make sense of it and share my understanding (even if it is one year later).
Setup:
- 3 Node Percona Server for MySQL 8.0.18 running in Group Replication.
We will start with the default group_replication_consistency=’EVENTUAL’ and work from there. So let’s consider a very simple table:
1 2 3 4 5 |
CREATE TABLE `t1` ( `id` int(11) NOT NULL AUTO_INCREMENT, `cnt` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) |
With over 10mln rows:
1 2 3 4 5 6 |
select count(*) from t1; +----------+ | count(*) | +----------+ | 10485760 | +----------+ |
And we will do a very simple action. On Node 1 we will update the table and on Node 2 we will read data.
Node 1:
1 2 3 |
UPDATE t1 SET cnt=11; Query OK, 10485760 rows affected (1 min 20.23 sec) Rows matched: 10485760 Changed: 10485760 Warnings: 0 |
And immediately after that on the Node 2:
1 2 3 4 5 6 7 |
select cnt from t1 where id=10001; +------+ | cnt | +------+ | 10 | +------+ 1 row in set (0.00 sec) |
There are a few points to note:
- On Node 1 it took 1min 20 sec for the transaction to execute
- On Node 2 we got the result immediately, but we essentially got a “stale” read in that data was already updated, but we got an old version of it.
How can we get a better outcome:
group_replication_consistency=’BEFORE’;
Let’s look into the consistency level ‘BEFORE’. It says that a transaction on Node 2 will wait until the previously committed transaction on the Node 1 transaction is committed also on Node 2. So let’s change the set session to group_replication_consistency=’BEFORE’; on Node 2 and execute a similar update again.
Node 1 (still group_replication_consistency=’EVENTUAL’;):
1 2 3 |
UPDATE t1 SET cnt=12; Query OK, 10485760 rows affected (1 min 18.66 sec) Rows matched: 10485760 Changed: 10485760 Warnings: 0 |
And after this, on Node 2 (with set session group_replication_consistency=’BEFORE’;):
1 2 3 4 5 6 7 |
select cnt from t1 where id=10001; +------+ | cnt | +------+ | 12 | +------+ 1 row in set (1 min 11.45 sec) |
So there are very notable changes:
- The transaction on Node 2 returned the correct result now.
- But now it took 1 min 11 sec to return it (instead of 0sec as in the previous case). Basically, the transaction waited on the transaction from Node 1 to be applied on Node 2
- The execution time on Node 1 did not change.
This mode allowed us to do exactly what we wanted – prevent stale reads. Great outcome! So what about group_replication_consistency=’AFTER’ ?
group_replication_consistency=’AFTER’;
This mode says that a transaction on Node 1 will not return OK until it makes sure that other nodes applied the transaction. To see it in action, we will put Node 1 into group_replication_consistency=’AFTER’ and Node 2 into group_replication_consistency=’EVENTUAL’;.
Node 1 (in group_replication_consistency=’AFTER’):
1 2 3 |
UPDATE t1 SET cnt=13; Query OK, 10485760 rows affected (3 min 0.46 sec) Rows matched: 10485760 Changed: 10485760 Warnings: 0 |
After that, Node 2 (in group_replication_consistency=’EVENTUAL’):
1 2 3 4 5 6 7 |
select cnt from t1 where id=10001; +------+ | cnt | +------+ | 13 | +------+ 1 row in set (0.00 sec) |
Here the situation is different. Now the execution time on Node 1 doubled, as the transaction waits to be committed on all nodes and after that on Node 2 the execution is immediate.
This mode still avoids “stale” reads, but in this case, we shifted the wait time from Node 2 to Node 1, and this is how we can view the difference between ‘BEFORE’ and ‘AFTER’ consistency modes in Group Replication.
Both modes provide a consistent view, but:
- In ‘BEFORE’ mode: the readers on secondary nodes will be blocked, waiting on the moment when the consistent view is available, and
- In ‘AFTER’ mode the writers are blocked till other nodes get a consistent view.
So which mode to choose? Actually I think it is good to have an option here. You can choose, if you want, to put wait time on your readers or on your writers; the decision is up to how your application is designed.
Thank you for the post Vadim. And indeed it depends of the application. If you do a majority of reads and a smaller amount of writes AND you want to avoid stale reads, then you accept to pay the price at write time and use AFTER. We also have 2 others modes : BEFORE_AND_AFTER (and you pay each time, we are very close to synchronous replication, even it’s not yet) and also BEFORE_ON_PRIMARY_FAILOVER, which works as EVENTUAL but in case of failover it insure that the new PRIMARY applied everything before acting as so.
Something very nice is that you can accept to pay the price only when it’s really needed and switch the consistency level you need per session.
Very interesting,
It also shows what MySQL Group Replication does not have pipelining (or whatever right term would be) – the statement which updates lots of data is not started on other nodes until it is completed on the first node hence you have more than double latency for data propagation.
Also if query itself should take 1m 20 sec on the node to execute and we have 3min for replication this means we also have added 20sec overhead of transferring the data
You are right Peter, transactions are propagated at commit (exactly as PXC and Galera, Galera 4 added streaming replication on demand IIRC and they advise to only use it when specifically needed). For all these distributed systems, the recommendation has always been to use small transactions.
About the time Vadim experienced (3m instead of 1m20), don’t forget that GR uses Mencius like Paxos implementation, where majority is enough to continue the process, with consistency AFTER, all the nodes need to acknowledge the prepare of the transaction. So in this case you have the latency of the network added for the data transfer to all members and the coordination too.
For more info, see https://mysqlhighavailability.com/group-replication-consistent-reads-deep-dive/ and https://lefred.be/content/mysql-innodb-cluster-consistency-levels/
Cheers.
You can use a MySQL Shell Plugin to see where the time is spent… see https://twitter.com/lefred/status/1220638238005760001