Using Parallel Query with Amazon Aurora for MySQL

parallel query amazon aurora for mysqlParallel query execution is my favorite, non-existent, feature in MySQL. In all versions of MySQL – at least at the time of writing – when you run a single query it will run in one thread, effectively utilizing one CPU core only. Multiple queries run at the same time will be using different threads and will utilize more than one CPU core.

On multi-core machines – which is the majority of the hardware nowadays – and in the cloud, we have multiple cores available for use. With faster disks (i.e. SSD) we can’t utilize the full potential of IOPS with just one thread.

AWS Aurora (based on MySQL 5.6) now has a version which will support parallelism for SELECT queries (utilizing the read capacity of storage nodes underneath the Aurora cluster). In this article, we will look at how this can improve the reporting/analytical query performance in MySQL. I will compare AWS Aurora with MySQL (Percona Server) 5.6 running on an EC2 instance of the same class.

In Short

Aurora Parallel Query response time (for queries which can not use indexes) can be 5x-10x better compared to the non-parallel fully cached operations. This is a significant improvement for the slow queries.

Test data and versions

For my test, I need to choose:

  1. Aurora instance type and comparison
  2. Dataset
  3. Queries

Aurora instance type and comparison

According to Jeff Barr’s excellent article (https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/) the following instance classes will support parallel query (PQ):

“The instance class determines the number of parallel queries that can be active at a given time:

  • db.r*.large – 1 concurrent parallel query session
  • db.r*.xlarge – 2 concurrent parallel query sessions
  • db.r*.2xlarge – 4 concurrent parallel query sessions
  • db.r*.4xlarge – 8 concurrent parallel query sessions
  • db.r*.8xlarge – 16 concurrent parallel query sessions
  • db.r4.16xlarge – 16 concurrent parallel query sessions”

As I want to maximize the concurrency of parallel query sessions, I have chosen db.r4.8xlarge. For the EC2 instance I will use the same class: r4.8xlarge.

Aurora:

MySQL on ec2

Table

I’m using the “Airlines On-Time Performance” database from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time  (You can find the scripts I used here: https://github.com/Percona-Lab/ontime-airline-performance).

The table is very wide, 84 columns.

Working with Aurora PQ (Parallel Query)

Documentation: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-mysql-parallel-query.html

Aurora PQ works by doing a full table scan (parallel reads are done on the storage level). The InnoDB buffer pool is not used when Parallel Query is utilized.

For the purposes of the test I turned PQ on and off (normally AWS Aurora uses its own heuristics to determine if the PQ will be helpful or not):

Turn on and force:

Turn off:

The EXPLAIN plan in MySQL will also show the details about parallel query execution statistics.

Queries

Here, I use the “reporting” queries, running only one query at a time. The queries are similar to those I’ve used in older blog posts comparing MySQL and Apache Spark performance (https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/ )

Here is a summary of the queries:

  1. Simple queries:
    • select count(*) from ontime where flightdate > '2017-01-01'
    • select avg(DepDelay/ArrDelay+1) from ontime
  2. Complex filter, single table:

3. Complex filter, join “reference” table