Migrating to Amazon Aurora: Optimize for Binary Log Replication

Migrating to Amazon Aurora 1In this Checklist for Success series, we will discuss reducing unknowns when hosting in the cloud using and migrating to Amazon Aurora. These tips might also apply to other database as a service (DBaaS) offerings.

In our previous article, we discussed the importance of continuous query performance analysis, especially in Amazon Aurora where there is less diagnostic visibility compared to running on EC2 or on-premise. Aside from uptime though, we need a lot more from our data, and we definitely cannot isolate it in Aurora.

Next on our checklist is that at one point or another, we will need to use asynchronous replication. Amazon Aurora has an excellent reputation for absorbing intense amounts of writes, and for many cases where you need an asynchronous replica, any replica can have potential issues catching up.

Different Clusters for Different Workloads

Critical workloads and datasets cannot rely on a single copy of their data. With Amazon Aurora, predictable performance means avoiding mixing workloads within your production cluster. While read heavy workloads might fit easily into read-replicas, reporting or analytics workloads might not be a good fit to execute on your main cluster where read-what-you-write profiles are normally found. You can either delegate this on a separate asynchronous replica as separate Amazon Aurora cluster, or another that runs on an EC2 instance (as an example). This is also true if, say, your analytics or reporting workload generate a significant amount of disk IOPs.

Amazon Aurora IO bills operations per million. You might save some money running disk-heavy analytics operations on a replica running on an i3 instance with a local NVMe for example. Similarly, running an async replica on an EC2 instance or on-premise allows you to take your own independent backups, or just an extra level of redundancy.

Multi-Threaded Replication

It is a known fact that MySQL asynchronous replication performance is subject to some limitations. The biggest one is that, by default, it is single-threaded. MySQL 5.6 introduced multi-threaded replication at one database per thread. This did not apply to the majority of use cases, as workloads vary per database and therefore create an imbalance. With MySQL 5.7 (Aurora 2.0), there have been additional improvements such as an alternative algorithm in parallelizing thread execution that depends on certain behaviors regarding how acting primary servers write binary log entries.

With that said, certain multi-threaded replication variables (transaction_write_set_extraction) require that the binlog format is set to ROW. This might sound counter-intuitive because the ROW binlog format actually can increase the replication workload. While ROW format reduces the ambiguity from potentially non-deterministic statements that could cause a replica to drift and become inconsistent, critical operations (schema changes) and optimizations (MTS) requires that you use ROW binlog format.

It should be apparent by now that the only reasonable path forward to improving asynchronous replication is via the multi-threaded approach. Along with that, there is the need for ROW binlog format. Any design effort should always include this fact if async replication lag is considered a risk. For the basics, configuration options like slave_compressed_protocol  and binlog_row_image can reduce network churn. In the deep end, reducing dataset hotspots, ensuring tables have PRIMARY KEYs and embracing multi-threaded optimization can also go a long way.

Additional Benefits

While running certain read-heavy queries on an async replica, or ensuring you have access to physical datafiles are common use cases, being able to switch to another location (region) or just simply another cluster might also be necessary for some instances.

  • Adding or dropping a column/index on a large table can be achieved with either pt-online-schema-change or gh-ost. But for cases where time is a constraint, applying schema changes on an asynchronous cluster and switching to that sometimes pays for itself.
  • Configuration changes or upgrades that require a cluster restart can take seconds or even minutes. Wouldn’t it be nice if you already had a cluster with all these changes ready to take over at a moment’s notice? Or even fail back to the original if there was an unforeseen issue?

Stay “tuned” for part three.

Meanwhile, we’d like to hear your success stories in Amazon Aurora in the comments below!

Don’t forget to come by and see us at AWS re:Invent, November 26-30, 2018 in booth 1605! Percona CEO Peter Zaitsev will deliver a keynote on MySQL High Availability & Disaster Recovery, Tuesday, November 27 at 1:45 PM – 2:45 PM in the Bellagio Hotel, Level 1, Gauguin 2

Share this post