Improving TPC-H-like Queries – Q2

Posted by Bradley C. Kuszmaul and David Wells

Executive Summary: A MySQL straight join can speed up a query that is very similar to TPC-H Q2 by a factor of 159 on MySQL.

Recently, we began looking at TPC-H performance on MySQL. Our early
tests yielded unexpectedly poor performance for MyISAM, InnoDB and the
Tokutek storage engine. So we decided to take at look at each query
individually to see what could be done. This post is about Query 2.

Before going further, let us be clear – this is NOT “TPC-H”
benchmarking. The TPC prescribes methods and procedures for measuring
performance, and we didn’t follow the rules (which you can read at
http://www.tpc.org/tpch/spec/tpch2.8.0.pdf). This exercise is about
understanding what can be done to improve queries that TPC-H
represents.

Our tests were run on dual-core 2GHz AMD64 boxes w/ 2 GB memory,
running Centos5.1 and MySQL 5.1.30, using our Fractal Tree Storage
Engine. We built a small (SF=10, that is 10GB) database using the
standard TPC-H scripts on a 1-TB SATA drive. The hardware is Wimpy,
but available. The problem size is small, but it makes it easier to
run experiments.

We focused on one of the slow-running queries, Q2. Our first test of
Q2 finished in 1830 seconds. Ugh. Terrible. We’ve heard that other
databases can run this query orders of magnitude faster.

The TPC-H document describes Q2 as

The Minimum Cost Supplier Query finds, in a given region, for each part of a certain type and size, the supplier who can supply it at minimum cost. If several suppliers in that region offer the desired part type and size at the same (minimum) cost, the query lists the parts from suppliers with the 100 highest account balances. For each supplier, the query lists the supplier’s account balance, name and nation; the part’s number and manufacturer; the supplier’s address, phone number and comment information.

The MySQL query is

MySQL came up with this query plan:

In this query plans, we found MySQL was starting with a very small table (REGION) first, joining it’s way toward the larger tables.

A better query plan is to start at the PART table, then prune the
list of rows to get rid of parts that aren’t size 35 or made of steel,
and then joining with the smaller tables. The result : 11.5 seconds
average query time.

How did we do it?

First, we took control of query plan by specifying “STRAIGHT_JOIN” for both the main query and the dependent subquery, and then rearrange the
table order in the FROM clause:

OLD:


NEW:

We swapped the partsupp and supplier.

The resulting query plan is:

The query optimizer doesn’t like it much, but in fact the are only a few thousand size 35 steel parts, so even though the query plan must
look at all the parts, it does it fast (since the primary key in our storage engine is clustering).

Although we didn’t try it, this query rewrite should work for InnoDB and MyISAM as well. (Since InnoDB is a clustered key, and MyISAM is well optimized for table scans.)

There might be some additional performance left on the table, but
a jump of 1830/11.5 = 159x seemed worth mentioning.

If you try this at home, don’t forget to turn off the query cache:

We conclude this post with a brief discussion of whether TPC-H is useful for evaluating MySQL.

  • TPC-H has no real insertion or deletion workload. The main TPC-H insertion workload is a bulk load, which has been heavily optimized. Bulk loads are easy because you can presort the data, and then load it into your storage engine’s B-tree (or Fractal Tree) data structures sequentially. We find that people are interested in how fast you can trickle-load data into a database, and since that’s one of our strengths we don’t like TPC-H. We like iiBench. If you care about trickle-loading data and deleting data, TPC-H may not be helpful when you are evaluating databases.
  • The TPC-H rules don’t allow you to create interesting indexes. Maybe that’s because B-trees cannot maintain very many indexes. If your database can automagically define some indexes, you get to use them. Since MySQL doesn’t have that magic, MySQL is at a disadvantage. If you think you can define useful indexes, then official TPC-H results may not be helpful for your evaluation.
  • The TPC-H rules don’t allow you much freedom to change the query text (so specifying a straight join, as we did, is probably forbidden). If you think you can improve the query by rewriting it, then official TPC-H results are probably not helpful for your evaluation.
  • MySQL’s limited set of query plans may cause problems on some TPC-H queries. This is a real shortcoming of MySQL compared to other databases, and TPC-H may help identify whether those limitations are a problem. But before we can get to that, we must get as much out of MySQL as we can by using tricks like straight joins.

We plan on blogging about more TPC-H-like queries as we get to them.

Share this post

Leave a Reply