Improving TPC-H-like queries – Q17

Improving TPC-H-like queries – Q17


Executive Summary: A query like TPC-H Query 17 can be sped up by large factors by using straight_joins and clustering indexes. (This entry posted by Dave.)

In a previous post, we wrote about queries like TPC-H query 2, and the use of straight_join to improve performance.
This week, we consider Query 17, described by the TPC-H documentation as

“The Small-Quantity-Order Revenue Query considers parts of a given brand and with a given container type and determines the average lineitem quantity of such parts ordered for all orders (past and pending) in the 7-year database. What would the average yearly gross (undiscounted) loss in revenue if orders for these parts with a quantity of this average were no longer taken?”

Our initial run on Q17 (same hardware as before) timed out after 3600 seconds.

The MySQL query is:

The query plan was:

This plan does not take advantage of the selectivity of picking a specific brand and container to reduce the search set.
Using STRAIGHT_JOIN (as described in our post on Q2) and ordering the FROM clause to select the part table first sets the plan on a better course.

However, the query bogs down when searching the lineitem table.
Although an index was available in the LINEITEM table to quickly search the table (lineitem_fk3), the subsequent data lookup (along the primary key) caused a long(!) series of point queries.

Unlike many storage engines, TokuDB supports clustering indexes on any index (not just the primary key), and further supports clustering indexes on more than one index per table.

This results in the following query/plan:

Although it appears to look at more rows, each table is scanned efficiently down a key that provides all of the information needed to address the query.

We measured 101 sec on this query. Maybe some day we’ll be patient enough to let the original plan complete, but we know this approach is at least 36 (3600/101) times faster.


Share this post

Comments (9)

  • Jay Pipes Reply

    You should get much better performance if you rewrote the query to the following:

    sum(li.l_extendedprice) / 7.0 as avg_yearly
    from lineitem li
    inner join part p
    on li.l_partkey = p.p_partkey
    inner join (
    select l_partkey, 0.2 * avg(l_quantity) as quantity
    from lineitem
    group by l_partkey
    ) as quantities
    on li.l_partkey = quantities.l_partkey
    and li.quantity < quantities.quantity where p.p_brand = 'Brand#33' p.p_container = 'WRAP PACK'; and put an index on lineitem (l_partkey, l_quantity) Not sure if TPC-H allows you to optimize the terrible (for MySQL) queries it produces, but the above would likely cut the performance by a factor of 100 or more.. I'd be interested to see what EXPLAIN comes up with for the above rewritten query + additional index. Cheers, Jay Pipes Cheers, Jay

    June 16, 2009 at 1:28 am
  • Dave Reply


    Thanks for the ideas. Will try them in the next couple days and let you know how it goes.


    June 17, 2009 at 1:57 pm
  • Jay Pipes Reply

    Hi! Any feedback on the query above?



    June 23, 2009 at 4:34 pm
  • Dave Reply


    I spent some time on your recommendations, and ran enough experiments that I thought it would be worth another post. It should be up shortly.

    Regarding the question of TPC-H rules for defining indexes – my point for the post was to show how clustering indexes can help, not how to get valid TPC-H performance measurement. “TPC-H-like” was used deliberately to avoid contending with the TPC rules. All good ideas are welcome here.


    June 23, 2009 at 9:17 pm
  • Jay Pipes Reply

    Awesome, Dave! Looking forward to your post. And, yeah, as I understand it, TPC doesn’t allow for a number of things. This is one of the reasons I strongly support an open source, openly discussed, and vendor-neutral set of benchmarks 🙂



    June 23, 2009 at 9:43 pm
  • Justin Swanhart Reply

    In the future call it DBT-3 query #17. DBT-3 is the open source OSDB version of TPC-H(tm).

    I take it this was SF1? Looks like you have about 6M rows in the fact table.

    June 26, 2009 at 6:04 am
  • Justin Swanhart Reply

    I added a key to:
    lineitem(l_partkey, l_quantity)
    and too
    part(p_brand, p_container)

    and then I analyzed both tables. Without any hints MySQL comes up with this:

    mysql> explain
    -> select sum(l_extendedprice) / 7.0 as avg_yearly
    -> from lineitem , part
    -> where p_partkey = l_partkey
    -> and p_brand = ‘Brand#33’
    -> and (p_container = ‘WRAP PACK’ )
    -> and l_quantity < ( -> select 0.2 * avg(l_quantity)
    -> from lineitem
    -> where l_partkey = p_partkey
    -> );
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    | 1 | PRIMARY | part | ref | PRIMARY,p_brand | p_brand | 20 | const,const | 157 | Using where |
    | 1 | PRIMARY | lineitem | ref | lineitem_fk3,part_qty_idx | lineitem_fk3 | 4 | tpch1g_myisam.part.p_partkey | 2376 | Using where |
    | 2 | DEPENDENT SUBQUERY | lineitem | ref | lineitem_fk3,part_qty_idx | lineitem_fk3 | 4 | tpch1g_myisam.part.p_partkey | 2376 | |
    3 rows in set (0.00 sec)

    mysql> select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem , part where p_partkey = l_partkey and p_brand = ‘Brand#33’ and (p_container = ‘WRAP PACK’ ) and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey ); +---------------+ | avg_yearly | +---------------+ | 312967.325714 | +---------------+ 1 row in set (0.23 sec) I don't have a SF10 myisam lying around, but you might want to give it a whirl if you do.

    June 26, 2009 at 9:15 am
  • Justin Swanhart Reply

    these settings may make a difference:

    This query plan doesn’t use a filesort so it shouldn’t generate a temporary table. It should scale pretty well as long as key_buffer_size is >= the size of the new composite index on the lineitem table. You can check the size of that index with the information schema.

    Setting PACK_KEYS=1 on lineitem will probably result in even improved results as you will get compression on repeating integer key values in the composite key.

    June 26, 2009 at 9:43 am
  • Another look at improving TPC-H-like queries – Q17 | Tokutek Reply

    […] An alternate approach, offered in response to our original post, provides excellent improvements for smaller databases, but clustered indexes offer better […]

    August 19, 2010 at 3:40 pm

Leave a Reply