Apologies for the silly title, but the issue is a real one, even though it is not a new thing. Schema upgrades are not an ordinary operation in Galera. For the subject at hand, the bottom line is: under the default Total Order Isolation (TOI) method, “the cluster replicates the schema change query as a statement before its execution.” What this means in practice is that a DDL issued in one node is replicated to other nodes in the cluster before it is even executed in the source node, let alone completed successfully.
As a result of this, it may fail in one node and be successful in another, and this without raising loud alerts or stopping nodes to protect against data inconsistency. This is not a bug in itself but rather a compromise of design. With new changes in MySQL and the recent support for atomic DDLs added in version 8.0, we may see improvements in this area over time in Galera as well. For now, the warning should be maintained: DDLs are not atomic in Galera, even in 8.0.
How Big of an Issue Is This?
To be fair, it’s rare – we do not see this happening often.
How Does the Problem Manifest Itself in Practice? Give Us an Example!
We recently investigated a case where a DBA issued a
TRUNCATE TABLE statement in one node, which was taking some time to complete. His reaction at the time was to hit
Ctrl+C on the session where the DDL was issued. That session seemed to hang, so he ended up opening a new session and issued a full
DELETE on that same table. It also took time to complete, however this time, he was more patient. The problem was: once the statement was completed, he noticed that the two other nodes disappeared from the cluster. Upon closer inspection of the logs, both nodes reported the following error before evicting the cluster:
[ERROR] Slave SQL: Could not execute Delete_rows event on table mydb.target; Can't find record in 'target’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1073, Error_code: 1032
DELETE was successful on the source node but failed to complete on the other two nodes. Why? Because the target table was already empty: the previous
TRUNCATE TABLE was effectively executed in those nodes.
TRUNCATE TABLE also successful on the source node? It turned out it wasn’t. In fact, the
TRUNCATE was recorded in the slow query log of the source node like this:
# Time: (...)
# User (...)
# Schema: building Last_errno: 1317 Killed: 0
# Query_time: 145.753247 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0
# Bytes_sent: 44 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0
# QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No
# Filesort: No Filesort_on_disk: No Merge_passes: 0
# No InnoDB statistics available for this query
Note the Last_errno indicated above:
$ perror 1317
MySQL error code 1317 (ER_QUERY_INTERRUPTED): Query execution was interrupted
Our colleague Przemek reproduced a similar problem, which he described in bug PXC-3536: Modify processing to not allow threads/queries to be killed if the thread is in TOI. The Percona Engineering team has since fixed the abortion issue, leaving the following note on the bug tracker:
Fixed in 5.7[.33-31.49] and 8.0[.22-13.1]. Any attempt to kill the query or connection while in TOI will fail. Note that due to timing, the “kill query” may sometimes return success, but the operation will still have succeeded.
In the case we investigated, the MySQL error log of the source node had the answer; there was some other higher-priority transaction already in place on that node that prevented it from executing the
[Note] WSREP: MDL conflict db=mydb table=target ticket=MDL_SHARED_WRITE solved by abort
To clarify, this cluster was receiving writes from multiple nodes, and the higher-priority transaction was applied by a replication thread, which usually has precedence over local events when a conflict is detected.
As you may have suspected, the
TRUNCATE was not found in the binary log of the source node; since the statement did not complete, it was not logged. Only the subsequent
DELETE statement was there.
And as for the binary logs of the other two nodes, they included both the
TRUNCATE as well as the
DELETEs covering all rows in the target table. Rows these other nodes no longer had in place and thus were unable to remove caused the data inconsistency that followed.