In this Checklist for Success series, we will discuss reducing unknowns when hosting in the cloud using and migrating to Amazon Aurora. These tips might also apply to other database as a service (DBaaS) offerings.
Previous blogs in the migrating to Amazon Aurora series:
- Migrating to Amazon Aurora: Reduce the Unknowns
- Migrating to Amazon Aurora: Optimize for Binary Log Replication
The whole premise of a database as a service offering is that you do not need to worry about the operating the service, you just need to use it. But all DBaaS offerings have limitations as well as strengths. You should not get too comfortable with all the niceties of such services. You need to remain flexible and ultimately design to prevent database failure.
Have a Grasp of Your Cluster Behavior
Disable lab mode. You should not depend on this setting to take advantage of new features. It can break and leave you in a bad position. If you rely on this feature, and designed your application around it, you might find yourself working around the same problem if, for example, you are running the same queries on a non-Aurora deployment. This is not to say that you shouldn’t take advantage of all Aurora features. Lab mode, however, is “lab mode” and should not be enabled on a production environment.
Separate parameter group per cluster to keep your configuration changes isolated. In some cases, you might have a group of clusters that operate the same workload, But this should be rare, and also prohibits you from making rolling changes against each cluster.
Some might point out that syncing the parameter groups in this situation might be difficult. It really isn’t, and you don’t need any complicated tools to do it. For example, you can use pt-config-diff to regularly inspect the differences between the runtime config on each cluster and identify or resolve differences.
While it is ideal that your clusters are always up to date, it can be intrusive to let them run on their own. Especially if your workload is not dependent on high/low traffic periods. I recommend having more control over the upgrade process, and this excellent community blog post from Renato on how to do just that is worth a read.
Don’t Put All Your Eggs in One Basket
On another note, Aurora can hold up to 64TB of data. Yes, that’s big. It might not be a problem and some of you might even be excited about this potential. But when you think about it, do you really want to store that amount of data in a single basket? What if you need to analyze this data for a particular time period. Is the cost worth it? Surely at some point, you will need to transport that data somewhere.
We’ve seen problems even at sizes less than 2TB. If you need to rebuild an asynchronous replica, for example, it takes a while. You have to be really ahead of capacity planning to ensure that you add new read-replicas when needed. This can be a challenge when you are on the spot. A burst of traffic might already be over before the replica provisioning is complete no matter how fast Aurora replica provisioning is.
Another challenge with datasets that are too big is when you have large tables. Schema changes become increasingly difficult in these situations, especially when such tables are subject to highly concurrent reads and writes. Recall that in the blog Migrating to Amazon Aurora: Optimize for Binary Log Replication we recommend setting binlog_format to ROW to be able to use tools like gh-ost in these types of situations.
High Availability On Your Terms
One limitation with Aurora cluster instances is that there is no easy way of taking a misbehaving read-replica out of rotation. Sure, you can delete the read replica. That leads to transient errors to the application, however, and impacts performance due to the time lag required to replace it to cover the workload.
Similarly, a misbehaving query can easily spoil the whole cluster, even if that query is spread out evenly to the read-replicas. Depending on how quickly you can disable the query, it might result in losing some business in the process. It would be nice if you could blackhole, rewrite or redirect such queries on demand so as to isolate the impact (or even fix it immediately).
Lastly, certain situations require that you restart the cluster. However, doing so could violate your uptime SLA. These situations can occur when you need to apply a non-dynamic cluster parameter, or you need to perform a cluster upgrade.
You can avoid most of these problems by not solely relying on Aurora’s own implementation of high availability. I say this because they are continuously improving this process. For now, however, you can use tools like ProxySQL to redirect traffic both in-cluster and between clusters replicating asynchronously. Percona has existing blog posts on this topic: Leveraging ProxySQL with AWS Aurora to Improve Performance, Or How ProxySQL Out-performs Native Aurora Cluster Endpoints and How to Implement ProxySQL with AWS Aurora.
Meanwhile, we’d like to hear your success stories in migrating to Amazon Aurora in the comments below!
Don’t forget to come by and see us at AWS re:Invent, November 26-30, 2018 in booth 1605! Percona CEO Peter Zaitsev will deliver a keynote on MySQL High Availability & Disaster Recovery, Tuesday, November 27 at 1:45 PM – 2:45 PM in the Bellagio Hotel, Level 1, Gauguin 2