Shard-Query adds parallelism to queriesJustin Swanhart
Preamble: On performance, workload and scalability:
MySQL has always been focused on OLTP workloads. In fact, both Percona Server and MySQL 5.5.7rc have numerous performance improvements which benefit workloads that have high concurrency. Typical OLTP workloads feature numerous clients (perhaps hundreds or thousands) each reading and writing small chunks of data. The recent improvements to MySQL make it scale better for this workload when more resources (such as additional CPUs) are added. By scaling better I mean that it is able to take advantage of all available resources in order to handle the workload. One reason that it works best for this workload is that a single query executes in a single thread. MySQL never takes advantage of more than a single CPU when aggregating data and fetching rows from the buffer pool, with respect to a single query, but it can run many queries at once.
There are workloads other than OLTP and the recent optimizations to MySQL still leave a lot of low hanging fruit where these are concerned. This is particularly true for OLAP workloads. While I’m not going to diverge into a discussion of how OLAP varies from OLTP, it suffices to say that a typical OLAP workload features a low number of concurrent queries which each examine large amounts of data. Since a single query is single threaded in MySQL, the new optimizations don’t really help with this workload.
The following tests assume a workload consisting of a small number of concurrent queries (or only one) to demonstrate how much improvement could be made to MySQL so that is could better utilize all available resources (that is, scale better) when running small numbers of queries which examine a lot of data.
What is Shard-Query?
Shard-Query was initially conceived as a utility to add parallelism to horizontally partitioned data sets by running queries against each host in parallel, with the added feature of supporting aggregation. Then I hit upon the idea of taking SQL constructs like IN and BETWEEN and making these queries execute in parallel on a each host. If you have a sharded data set, then this gives you the opportunity for additional parallelism for each query. If you have only a single server, but it has enough resources to answer queries in parallel, then it can be used to add parallelism to queries which use IN or BETWEEN clauses. This added parallelism can have significant performance advantages as demonstrated in this blog post.
Many database servers can add this parallelism natively, but most of those are not open source. In the future, Shard-Query can be extended to other database servers such as PostgreSQL or Firebird fairly easily.
What machine did I use?
I used MySQL 5.5.7rc on a powerful Cisco UCS server with 12 real cores and 384GB of ram. The amount of ram is significantly larger than my already hefty 55GB testing data set, so this means that if MySQL could fully utilize all cores for my workload, then this test would be CPU bound.
What data did I use?
I loaded 55GB of the same data used in this earlier post into a partitioned InnoDB table.
/*!50500 PARTITION BY RANGE COLUMNS(FlightDate)
(PARTITION p2 VALUES LESS THAN ('1988-01-01') ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN ('1988-02-01') ENGINE = InnoDB,
PARTITION p253 VALUES LESS THAN ('2008-12-01') ENGINE = InnoDB,
PARTITION p254 VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB) */
What queries did I use?
I used a version of the queries in that same blog post. The original queries tend to filter on the Year column. I partitioned the table into months using the FlightDate column using the improved MySQL 5.5 partitioning options which work directly on columns without the need to use TO_DAYS. To accommodate my partitioning schema I modified the queries to use the FlightDate column instead of the Year column. See the “full disclosure” section at the end for the complete SQL.
These tests were done using ‘run_query.php’, which is the example application which comes with Shard-Query. As the name implies, it takes a list of queries in a file (or stdin) and a config file. It runs the SQL via ShardQuery and it prints the results.
This set of queries tests the most basic aggregation (count(*)) on a range of records. This table is partitioned by month which means that MySQL can use partition pruning to reduce the amount of data which must be examined. With this in mind, I modified Vadim’s queries to use the FlightDate column in the WHERE clause instead of Year.
Each iteration reads an additional year of data. That is, the first query reads one year, and the last query reads 21 years of collected flight data.
For example (the final query):
where FlightDate BETWEEN '1988-01-01' and '2008-12-31';
The reason that Shard-Query performs better is that it turns the OLAP query into something more like OLTP. Instead of getting one big chunk of data in one query, it runs many smaller queries each requesting significantly less data. On the other hand, MySQL 5.5.7 does not do this on its own. This is the low hanging fruit I was talking about. Even though the data is partitioned, MySQL will examine each partition serially. In the end, this means that things get slower as the query has to examine larger volumes of data.
Regarding the performance of Shard-Query, this machine has 12 real cores and 12 virtual cores, so we don’t see any advantage after increasing the number of workers past 24. The query becomes CPU bound at that point. If I needed more performance I could divide the data between two or more shards, or if possible, I could add more CPU cores. Regardless, even with a single server Shard-Query will perform much better than regular MySQL as the volume of data grows. Remember that this workload fits entirely in the buffer pool so adding CPUs will help only until we run out of memory bandwidth.
My second test involved the next four queries on Vadim’s list. The purpose of this test is to demonstrate that Shard-Query works with GROUP BY and other constructs.
Q1 - SELECT count(*) from ontime.ontime where FlightDate between '1988-01-01' and '1988-12-31';
Q2 - SELECT DayOfWeek, count(*) AS c FROM ontime WHERE FlightDate BETWEEN '2000-01-01' AND '2008-12-31' GROUP BY DayOfWeek ORDER BY c DESC;
Q3 - SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND FlightDate BETWEEN '2000-01-01' AND '2008-12-31' GROUP BY Origin ORDER BY c DESC LIMIT 10;
Q4 - SELECT carrier, count(*) FROM ontime WHERE DepDelay>10 AND FlightDate between '2007-01-01' and '2007-12-31' GROUP BY carrier ORDER BY 2 DESC;
As you can see, each of the queries runs significantly faster than just running the SQL via MySQL.
The remainder of Vadim’s queries use subqueries in the FROM clause, which Shard-Query does not yet add parallelism to. I plan to add support for those queries though, and I’ll post a follow-up when I do.
— Full Disclosure —
This is a text file containing the information.