While we should be aiming for small and fast transactions with Galera, it is always possible at some point you might want a single large transaction, but what is involved?
First, this is supposed to be controlled by two settings, wsrep_max_ws_rows and wsrep_max_ws_size . The first variable is not yet enforced and has no effect – see here and here – so don’t bother tuning this knob just yet. In my opinion, I would rather implement only one – having a limit by rows is hard to control as a DBA since each row’s size can be very different per workload.
The second variable restricts the writeset size in bytes and has better control on cluster performance. If your network and CPU can only process N amount of bytes per second, this is a good variable to enforce. Additionally, the maximum allowed value for this setting is only 2GB, with a default of 1GB. You can actually set this higher than 2GB, but only the 2GB is being enforced in my tests.
Let’s see what happens when we adjust this setting to large values. I have a sysbench table with 10M rows below and will update all rows in a single transaction.
1 2 3 4 5 6 7 8 |
-rw-rw---- 1 mysql mysql 8.5K Oct 20 03:25 t.frm -rw-rw---- 1 mysql mysql 2.4G Oct 20 03:46 t.ibd node1 mysql> set global wsrep_max_ws_size=1024*1024*1024*2; Query OK, 0 rows affected (0.00 sec) node1 mysql> update t set k=k+1; ERROR 1180 (HY000): Got error 5 during COMMIT |
On the MySQL error log this translates to:
1 2 |
2015-10-20 04:27:14 10068 [Warning] WSREP: transaction size limit (2147483648) exceeded: 2147516416 2015-10-20 04:27:14 10068 [ERROR] WSREP: rbr write fail, data_len: 0, 2 |
Let’s see what happens if we set this to 4GB:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
node1 mysql> set global wsrep_max_ws_size=1024*1024*1024*4; Query OK, 0 rows affected, 1 warning (0.00 sec) node1 mysql> show warnings; +---------+------+-----------------------------------------------------------+ | Level | Code | Message | +---------+------+-----------------------------------------------------------+ | Warning | 1292 | Truncated incorrect wsrep_max_ws_size value: '4294967296' | +---------+------+-----------------------------------------------------------+ 1 row in set (0.00 sec) node1 mysql> select @@wsrep_max_ws_size; +---------------------+ | @@wsrep_max_ws_size | +---------------------+ | 4294901759 | +---------------------+ 1 row in set (0.00 sec) |
We got a warning that it exceeds the allowed value, not the 2GB, but something else a bit lower than the 32bit UNSIGNED MAX VALUE. So let’s try the transaction again:
1 2 |
node1 mysql> update t set k=k+1; ERROR 1180 (HY000): Got error 5 during COMMIT |
We clearly got the same error on the client, and on the MySQL error log, it’s a bit different:
1 2 3 |
2015-10-20 04:54:14 10068 [ERROR] WSREP: Maximum writeset size exceeded by 1737995426: 90 (Message too long) at galera/src/write_set_ng.hpp:check_size():662 2015-10-20 04:54:14 10068 [ERROR] WSREP: transaction size exceeded |
If you really must process a large number of rows, one way to try and reduce the amount of writeset size is to set binlog_row_image to minimal . Let’s test this again with a 2GB wsrep_max_ws_size and see how it goes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
node1 mysql> set global wsrep_max_ws_size=1024*1024*1024*2; Query OK, 0 rows affected (0.00 sec) node1 mysql> select @@wsrep_max_ws_size; +---------------------+ | @@wsrep_max_ws_size | +---------------------+ | 2147483648 | +---------------------+ 1 row in set (0.00 sec) node1 mysql> SET GLOBAL binlog_row_image=minimal; Query OK, 0 rows affected (0.00 sec) node1 mysql> select @@binlog_row_image; +--------------------+ | @@binlog_row_image | +--------------------+ | MINIMAL | +--------------------+ 1 row in set (0.00 sec) node1 mysql> show global status like 'wsrep_replicated_bytes'; +------------------------+-------------+ | Variable_name | Value | +------------------------+-------------+ | wsrep_replicated_bytes | 13211964556 | +------------------------+-------------+ 1 row in set (0.00 sec) node1 mysql> update t set k=k+1; Query OK, 10000000 rows affected (11 min 18.33 sec) Rows matched: 10000000 Changed: 10000000 Warnings: 0 node1 mysql> show global status like 'wsrep_replicated_bytes'; +------------------------+-------------+ | Variable_name | Value | +------------------------+-------------+ | wsrep_replicated_bytes | 13402597135 | +------------------------+-------------+ 1 row in set (0.00 sec) node1 mysql> select (13402597135-13211964556)/1024/1024 as ws_size; +--------------+ | ws_size | +--------------+ | 181.80139446 | +--------------+ 1 row in set (0.01 sec) |
In our tests, we reduced it to at least 10x the original writeset size. Thanks to my colleague Przemyslaw for pointing this out.
Now we know how big we can go in terms of size and how to go further with minimal row image, although this does not mean that you can and should be allowing it. Make sure to set a sane limit on the sizes depending on your workload and realistic performance expectations. Large transactions will not only cause unexpected performance issues with your cluster, but will lead to usability issues as well in terms of increased deadlocks. Lastly, make sure to review what limitations there would be when switching to Galera-based clusters here for an enjoyable experience :).
Jervin,
I think there is another interesting note, thing to test when it comes to large transaction sizes.
If I’m not mistaken the write set serialization is serialized, which means if you have transaction which modifies 10M or rows and it takes 2sec to certify (for example) the cluster will have to pause all other write traffic for this duration. Or was this changed in more recent version ?
Peter,
This is correct, once that large transaction reaches certification on the remote nodes, nothing else can do certification for the duration in which case that node will likely send flow control.
Jervin,
Is this instrumented some way ? Is it possible to see how long certification is taking ?
Peter,
Really none to be specific. You can check flow control and wsrep_cert_deps_distance but these can be approximations. From this idea I opened https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1512940
Perhaps we can also extend slow logging to include certification time.
Peter,
Really none to be specific. You can check flow control and wsrep_cert_deps_distance but these can be approximations. From this idea I opened https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1512940
Perhaps we can also extend slow logging to include certification time.
Hi,
the only way to avoid this error “Got error 90 “Message too long” during COMMIT”, is to split the transactions into smaller amounts?