Posted by Bradley C. Kuszmaul and David Wells
Executive Summary: A MySQL straight join can speed up a query that is very similar to TPC-H Q2 by a factor of 159 on MySQL.
Recently, we began looking at TPC-H performance on MySQL. Our early
tests yielded unexpectedly poor performance for MyISAM, InnoDB and the
Tokutek storage engine. So we decided to take at look at each query
individually to see what could be done. This post is about Query 2.
Before going further, let us be clear – this is NOT “TPC-H”
benchmarking. The TPC prescribes methods and procedures for measuring
performance, and we didn’t follow the rules (which you can read at
http://www.tpc.org/tpch/spec/tpch2.8.0.pdf). This exercise is about
understanding what can be done to improve queries that TPC-H
represents.
Our tests were run on dual-core 2GHz AMD64 boxes w/ 2 GB memory,
running Centos5.1 and MySQL 5.1.30, using our Fractal Tree Storage
Engine. We built a small (SF=10, that is 10GB) database using the
standard TPC-H scripts on a 1-TB SATA drive. The hardware is Wimpy,
but available. The problem size is small, but it makes it easier to
run experiments.
We focused on one of the slow-running queries, Q2. Our first test of
Q2 finished in 1830 seconds. Ugh. Terrible. We’ve heard that other
databases can run this query orders of magnitude faster.
The TPC-H document describes Q2 as
The Minimum Cost Supplier Query finds, in a given region, for each part of a certain type and size, the supplier who can supply it at minimum cost. If several suppliers in that region offer the desired part type and size at the same (minimum) cost, the query lists the parts from suppliers with the 100 highest account balances. For each supplier, the query lists the supplier’s account balance, name and nation; the part’s number and manufacturer; the supplier’s address, phone number and comment information.
The MySQL query is
|
1 |
<br>select<br> s_acctbal,<br> s_name,<br> n_name,<br> p_partkey,<br> p_mfgr,<br> s_address,<br> s_phone,<br> s_comment<br>from<br> part,<br> supplier,<br> partsupp,<br> nation,<br> region<br>where<br> p_partkey = ps_partkey<br> and s_suppkey = ps_suppkey<br> and p_size = 35<br> and p_type like '%STEEL'<br> and s_nationkey = n_nationkey<br> and n_regionkey = r_regionkey<br> and r_name = 'EUROPE'<br> and ps_supplycost = (<br> select<br> min(ps_supplycost)<br> from<br> partsupp,<br> supplier,<br> nation,<br> region<br> where<br> p_partkey = ps_partkey<br> and s_suppkey = ps_suppkey<br> and s_nationkey = n_nationkey<br> and n_regionkey = r_regionkey<br> and r_name = 'EUROPE'<br> )<br>order by<br> s_acctbal desc,<br> n_name,<br> s_name,<br> p_partkey<br>limit 100;<br> |
MySQL came up with this query plan:
|
1 |
<br>+----+--------------------+----------+--------+----------------------+--------------+---------+-----------------------------------------------------------------+------+----------------------------------------------+<br>| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |<br>+----+--------------------+----------+--------+----------------------+--------------+---------+-----------------------------------------------------------------+------+----------------------------------------------+<br>| 1 | PRIMARY | region | ALL | PRIMARY | NULL | NULL | NULL | 5 | Using where; Using temporary; Using filesort |<br>| 1 | PRIMARY | nation | ref | PRIMARY,nation_fk1 | nation_fk1 | 4 | tpch10G_tokudb.region.r_regionkey | 3 | |<br>| 1 | PRIMARY | supplier | ref | PRIMARY,supplier_fk1 | supplier_fk1 | 4 | tpch10G_tokudb.nation.n_nationkey | 1200 | |<br>| 1 | PRIMARY | partsupp | ref | PRIMARY,partsupp_fk2 | partsupp_fk2 | 4 | tpch10G_tokudb.supplier.s_suppkey | 128 | |<br>| 1 | PRIMARY | part | eq_ref | PRIMARY | PRIMARY | 4 | tpch10G_tokudb.partsupp.ps_partkey | 1 | Using where |<br>| 2 | DEPENDENT SUBQUERY | region | ALL | PRIMARY | NULL | NULL | NULL | 5 | Using where |<br>| 2 | DEPENDENT SUBQUERY | nation | ref | PRIMARY,nation_fk1 | nation_fk1 | 4 | tpch10G_tokudb.region.r_regionkey | 3 | Using index |<br>| 2 | DEPENDENT SUBQUERY | supplier | ref | PRIMARY,supplier_fk1 | supplier_fk1 | 4 | tpch10G_tokudb.nation.n_nationkey | 1200 | Using index |<br>| 2 | DEPENDENT SUBQUERY | partsupp | eq_ref | PRIMARY,partsupp_fk2 | PRIMARY | 8 | tpch10G_tokudb.part.p_partkey,tpch10G_tokudb.supplier.s_suppkey | 1 | |<br>+----+--------------------+----------+--------+----------------------+--------------+---------+-----------------------------------------------------------------+------+----------------------------------------------+<br>9 rows in set (0.00 sec)<br> |
In this query plans, we found MySQL was starting with a very small table (REGION) first, joining it’s way toward the larger tables.
A better query plan is to start at the PART table, then prune the
list of rows to get rid of parts that aren’t size 35 or made of steel,
and then joining with the smaller tables. The result : 11.5 seconds
average query time.
How did we do it?
First, we took control of query plan by specifying “STRAIGHT_JOIN” for both the main query and the dependent subquery, and then rearrange the
table order in the FROM clause:
OLD:
|
1 |
<br>from<br> part,<br> supplier,<br> partsupp,<br> nation,<br> region<br> |
NEW:
|
1 |
<br>from<br> part,<br> partsupp,<br> supplier,<br> nation,<br> region<br> |
We swapped the partsupp and supplier.
The resulting query plan is:
|
1 |
<br>mysql> explain select STRAIGHT_JOIN <br> s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment<br> from part, partsupp, supplier, nation, region<br> where p_partkey = ps_partkey and s_suppkey = ps_suppkey <br> and p_size = 35 and p_type like '%STEEL'<br> and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE'<br> and ps_supplycost = (<br> select STRAIGHT_JOIN min(ps_supplycost)<br> from partsupp, supplier, nation, region<br> where p_partkey = ps_partkey and s_suppkey = ps_suppkey <br> and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE')<br> order by s_acctbal desc, n_name, s_name,p_partkey;<br>+----+--------------------+----------+--------+----------------------+---------+---------+-------------------------------------+---------+----------------------------------------------+<br>| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |<br>+----+--------------------+----------+--------+----------------------+---------+---------+-------------------------------------+---------+----------------------------------------------+<br>| 1 | PRIMARY | part | ALL | PRIMARY | NULL | NULL | NULL | 2000000 | Using where; Using temporary; Using filesort |<br>| 1 | PRIMARY | partsupp | ref | PRIMARY,partsupp_fk2 | PRIMARY | 4 | tpch10G_tokudb.part.p_partkey | 80000 | Using where |<br>| 1 | PRIMARY | supplier | eq_ref | PRIMARY,supplier_fk1 | PRIMARY | 4 | tpch10G_tokudb.partsupp.ps_suppkey | 1 | |<br>| 1 | PRIMARY | nation | eq_ref | PRIMARY,nation_fk1 | PRIMARY | 4 | tpch10G_tokudb.supplier.s_nationkey | 1 | |<br>| 1 | PRIMARY | region | ALL | PRIMARY | NULL | NULL | NULL | 5 | Using where; Using join buffer |<br>| 2 | DEPENDENT SUBQUERY | partsupp | ref | PRIMARY,partsupp_fk2 | PRIMARY | 4 | tpch10G_tokudb.part.p_partkey | 800001 | |<br>| 2 | DEPENDENT SUBQUERY | supplier | eq_ref | PRIMARY,supplier_fk1 | PRIMARY | 4 | tpch10G_tokudb.partsupp.ps_suppkey | 1 | |<br>| 2 | DEPENDENT SUBQUERY | nation | eq_ref | PRIMARY,nation_fk1 | PRIMARY | 4 | tpch10G_tokudb.supplier.s_nationkey | 1 | |<br>| 2 | DEPENDENT SUBQUERY | region | ALL | PRIMARY | NULL | NULL | NULL | 5 | Using where; Using join buffer |<br>+----+--------------------+----------+--------+----------------------+---------+---------+-------------------------------------+---------+----------------------------------------------+<br>9 rows in set (0.00 sec)<br> |
The query optimizer doesn’t like it much, but in fact the are only a few thousand size 35 steel parts, so even though the query plan must
look at all the parts, it does it fast (since the primary key in our storage engine is clustering).
Although we didn’t try it, this query rewrite should work for InnoDB and MyISAM as well. (Since InnoDB is a clustered key, and MyISAM is well optimized for table scans.)
There might be some additional performance left on the table, but
a jump of 1830/11.5 = 159x seemed worth mentioning.
If you try this at home, don’t forget to turn off the query cache:
|
1 |
<br>mysql> set query_cache_type=0;<br>Query OK, 0 rows affected (0.00 sec)<br><br>mysql> show variables like 'query_cache_type';<br>+------------------+-------+<br>| Variable_name | Value |<br>+------------------+-------+<br>| query_cache_type | OFF |<br>+------------------+-------+<br>1 row in set (0.00 sec)<br> |
We conclude this post with a brief discussion of whether TPC-H is useful for evaluating MySQL.
We plan on blogging about more TPC-H-like queries as we get to them.