Recently I came across an interesting investigation about long-running transactions getting killed on an Aurora Reader instance. In this article, I will explain why it is advisable to avoid long-running transactions on Aurora readers when executing frequent DDL operations on the Writer, or at least be aware of how a DDL can impact your Aurora readers.
Aurora uses a shared volume often called a cluster volume that manages the data for all the DB instances which are part of the cluster. Here DB instances could be a single Aurora instance or multiple instances (Writer and Aurora Read Replicas) within a cluster.
Aurora replicas connect to the same storage volume as the primary DB instance and support only read operations. So if you add a new Aurora replica it would not make a new copy of the table data and instead will connect to the shared cluster volume which contains all the data.
This could lead to an issue on replica instances when handling the DDL operations.
Below is one such example.
|
1 |
mysql> SELECT AURORA_VERSION();<br>+------------------+<br>| AURORA_VERSION() |<br>+------------------+<br>| 3.02.2 |<br>+------------------+<br>1 row in set (0.22 sec) |
Start a transaction on reader:
|
1 |
mysql> SELECT connection_id();<br>+-----------------+<br>| connection_id() |<br>+-----------------+<br>| 21|<br>+-----------------+<br>1 row in set (0.27 sec)<br><br>mysql> SELECT * FROM t WHERE old_column not like '%42909700340-70078987867%'; |
While the transaction is ongoing on the reader, execute any DDL against the same table on the writer
|
1 |
mysql> ALTER TABLE t ADD COLUMN new_column VARCHAR(32); |
Check status on reader, the transaction would be terminated forcefully
|
1 |
mysql> SELECT * FROM t WHERE old_column not like '%42909700340-70078987867%';<br>ERROR 2013 (HY000): Lost connection to MySQL server during query<br><br>mysql> SELECT connection_id();<br>ERROR 2006 (HY000): MySQL server has gone awayNo connection.<br>Trying to reconnect...<br>Connection id: 22<br>Current database: db<br>+-----------------+<br>| connection_id() |<br>+-----------------+<br>| 22 |<br>+-----------------+<br>1 row in set (3.19 sec) |
Now, let’s see what happens when there is a backup happening from a reader node and the writer receives a DDL for that particular table that is being backed up.
Take a logical backup of a table using mydumper:
|
1 |
mydumper --success-on-1146 --outputdir=/backups/ --verbose=3 --host=aurora-reader --ask-password --tables-list=db.t |
While the backup is ongoing on the reader, execute any DDL against the same table on the writer.
|
1 |
mysql> ALTER TABLE t ADD COLUMN new_column VARCHAR(32); |
Check the status of the backup
|
1 |
** Message: 16:04:51.108: Thread 1 dumping data for `db`.`t` into /backups/db.t.00000.sql| Remaining jobs: 6<br>..<br>..<br>** Message: 16:04:51.941: Waiting threads to complete<br>** Message: 16:04:51.941: Thread 2 shutting down<br>** (mydumper:44955): CRITICAL **: 16:04:55.268: Could not read data from db.t: Lost connection to MySQL server during query |
So what is the issue?
As stated above, Aurora does not use binary log-based replication to replicate data to the readers. The underlying storage is the same for all the instances (writer+readers) within a cluster and Aurora handles it with let’s say “magic”.
Now, because of this “magic” in Aurora, when you perform any DDL operation on writer instance, the reader instances are forced to terminate any long-running transactions so as to acquire the metadata lock so that DDL operation can continue on writer instance.
Hence, if you are using Aurora replicas for logical backups (mysqldump/mydumper) or if you are running some long-running jobs on the reader instance you may encounter the issue mentioned above.
To understand this better let’s see what happens when we perform any DDL operation in a binary log-based replication environment and in the Aurora replication environment. Following are the high-level steps when any DDL gets executed.
Binary log-based replication:
Aurora replication:
What are the issues?
What is the solution?
Create an external replica of the Aurora cluster using binary log-based replication. This replica can be used to take logical backups or to execute some long-running queries that will not be interrupted by the DDL operation on the Aurora writer instance.
You may follow the Percona blog to create an external replica from Aurora using MyDumper or review the AWS documentation page.
Resources
RELATED POSTS