Downloads

Blog

Another look at improving TPC-H-like queries – Q17

June 24, 2009

Author

kuszmaul

MySQL

Share this Post:

Summary: An alternate approach, offered in response to our original post, provides excellent improvements for smaller databases, but clustered indexes offer better performance as database size increases. (This posting is by Dave.)

Jay Pipes suggested an alternate approach to improving MySQL performance of Query 17 on a TPC-H-like database.

1. Add the index (l_partkey, l_quantity) to the lineitem table.

Re-write the query as:

select 
   sum(li.l_extendedprice) / 7.0 as avg_yearly 
from lineitem li 
   inner join part p on li.l_partkey = p.p_partkey 
   inner join ( select 
                   l_partkey, 0.2 * avg(l_quantity) as quantity 
                from lineitem 
                group by l_partkey 
              ) as quantities 
      on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity 
where 
   p.p_brand = 'Brand#33'
   p.p_container = 'WRAP PACK';

select

sum(li.l_extendedprice) / 7.0 as avg_yearly

from lineitem li

inner join part p on li.l_partkey = p.p_partkey

inner join ( select

l_partkey, 0.2 * avg(l_quantity) as quantity

from lineitem

group by l_partkey

) as quantities

on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity

where

p.p_brand = 'Brand#33'

p.p_container = 'WRAP PACK';

I ran this suggestion on MySQL with Tokutek’s TokuDB storage engine (as before) against a 10G Scale Factor database. After waiting several hours for the EXPLAIN to complete, I punted and moved to a smaller, 1G Scale Factor database. This database nearly fits in memory, and allowed me to examine the query plan

Create the index:

mysql> alter table lineitem add index li_pkey_quan_idx(l_partkey, l_quantity);

1	mysql> alter table lineitem add index li_pkey_quan_idx(l_partkey, l_quantity);

Explain:

mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+
| id | select_type | table      | type   | possible_keys                 | key          | key_len | ref                  | rows    | Extra       |
+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL                          | NULL         | NULL    | NULL                 |  200000 |             | 
|  1 | PRIMARY     | p          | eq_ref | PRIMARY                       | PRIMARY      | 4       | quantities.l_partkey |       1 | Using where | 
|  1 | PRIMARY     | li         | ref    | lineitem_fk3,li_pkey_quan_idx | lineitem_fk3 | 4       | quantities.l_partkey |   60012 | Using where | 
|  2 | DERIVED     | lineitem   | index  | NULL                          | lineitem_fk3 | 8       | NULL                 | 6001215 |             | 
+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+
4 rows in set (1 min 18.85 sec)

mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li

-> inner join part p on li.l_partkey = p.p_partkey

-> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem group by l_partkey ) as quantities

-> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity

-> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';

+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+

+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+

| 1 | PRIMARY | li | ref | lineitem_fk3,li_pkey_quan_idx | lineitem_fk3 | 4 | quantities.l_partkey | 60012 | Using where |

+----+-------------+------------+--------+-------------------------------+--------------+---------+----------------------+---------+-------------+

4 rows in set (1 min 18.85 sec)

Notice that although the planner sees the li_pkey_quan_idx, it does not use it.

Run the Query:

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+---------------+
| avg_yearly    |
+---------------+
| 312967.325714 | 
+---------------+
1 row in set (1 min 6.73 sec)

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li

-> inner join part p on li.l_partkey = p.p_partkey

-> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem group by l_partkey ) as quantities

-> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity

-> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';

+---------------+

| avg_yearly |

+---------------+

| 312967.325714 |

+---------------+

1 row in set (1 min 6.73 sec)

The resulting query time (66.73s) is an improvement on the default query (155.83s). Nonetheless, when you run this query against the larger, 10G Scale Factor (SF) database, it does not complete after several hours. I observed that the intermediate tables can be held in memory for SF = 1G, but are written to disk for SF = 10G. This is a well-known peril of intermediate tables.

Undeterred, I tried forcing the query planner to make use of the li_pkey_quan_idx. I found that telling the planner to use the index in both the 3rd and 4th line of plan, I could get substantial speed-ups:

Explain:

mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+
| id | select_type | table      | type   | possible_keys    | key              | key_len | ref                  | rows    | Extra       |
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL             | NULL             | NULL    | NULL                 |  200000 |             | 
|  1 | PRIMARY     | p          | eq_ref | PRIMARY          | PRIMARY          | 4       | quantities.l_partkey |       1 | Using where | 
|  1 | PRIMARY     | li         | ref    | li_pkey_quan_idx | li_pkey_quan_idx | 4       | quantities.l_partkey |   60012 | Using where | 
|  2 | DERIVED     | lineitem   | index  | NULL             | li_pkey_quan_idx | 11      | NULL                 | 6001215 | Using index | 
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+
4 rows in set (5.74 sec)

mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)

-> inner join part p on li.l_partkey = p.p_partkey

-> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities

-> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity

-> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';

+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+

+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+

| 1 | PRIMARY | li | ref | li_pkey_quan_idx | li_pkey_quan_idx | 4 | quantities.l_partkey | 60012 | Using where |

+----+-------------+------------+--------+------------------+------------------+---------+----------------------+---------+-------------+

4 rows in set (5.74 sec)

Query:

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+---------------+
| avg_yearly    |
+---------------+
| 312967.325714 | 
+---------------+
1 row in set (12.38 sec)

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)

-> inner join part p on li.l_partkey = p.p_partkey

-> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities

-> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity

-> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';

+---------------+

| avg_yearly |

+---------------+

| 312967.325714 |

+---------------+

1 row in set (12.38 sec)

Now we’re talking. 12.38s vs. 155.83s (12x improvement). I’m sure this is closer to what Jay was hoping for. Emboldened, I ran the modified query against the SF = 10G database:

mysql> use tpch10G_tokudb;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+
| id | select_type | table      | type   | possible_keys    | key              | key_len | ref                  | rows     | Extra       |
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+
|  1 | PRIMARY     | derived2   | ALL    | NULL             | NULL             | NULL    | NULL                 |  2000000 |             | 
|  1 | PRIMARY     | p          | eq_ref | PRIMARY          | PRIMARY          | 4       | quantities.l_partkey |        1 | Using where | 
|  1 | PRIMARY     | li         | ref    | li_pkey_quan_idx | li_pkey_quan_idx | 4       | quantities.l_partkey |   599860 | Using where | 
|  2 | DERIVED     | lineitem   | index  | NULL             | li_pkey_quan_idx | 11      | NULL                 | 59986052 | Using index | 
+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+
4 rows in set (57.61 sec)

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)
    -> inner join part p on li.l_partkey = p.p_partkey
    -> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities
    -> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity
    -> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';
+----------------+
| avg_yearly     |
+----------------+
| 3253728.844286 | 
+----------------+
1 row in set (1 hour 7 min 8.55 sec)

mysql> use tpch10G_tokudb;

Reading table information for completion of table and column names

You can turn off this feature to get a quicker startup with -A

Database changed

mysql> explain select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)

-> inner join part p on li.l_partkey = p.p_partkey

-> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities

-> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity

-> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';

+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+

+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+

| 1 | PRIMARY | li | ref | li_pkey_quan_idx | li_pkey_quan_idx | 4 | quantities.l_partkey | 599860 | Using where |

+----+-------------+------------+--------+------------------+------------------+---------+----------------------+----------+-------------+

4 rows in set (57.61 sec)

mysql> select sum(li.l_extendedprice) / 7.0 as avg_yearly from lineitem li use index (li_pkey_quan_idx)

-> inner join part p on li.l_partkey = p.p_partkey

-> inner join ( select l_partkey, 0.2 * avg(l_quantity) as quantity from lineitem use index (li_pkey_quan_idx) group by l_partkey ) as quantities

-> on li.l_partkey = quantities.l_partkey and li.l_quantity < quantities.quantity

-> where p.p_brand = 'Brand#33' and p.p_container = 'WRAP PACK';

+----------------+

| avg_yearly |

+----------------+

| 3253728.844286 |

+----------------+

1 row in set (1 hour 7 min 8.55 sec)

So Jay’s ideas offer a solution that does not involve clustering keys and produces a result in a manageable timeframe. However, relative to clustering keys, it is still 4028 / 101 = 40x slower.

0 0 votes

Article Rating

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Dave

16 years ago

Roland,

Yes – that’s what I meant.

Dave

Jay Pipes

16 years ago

Hi Dave!

Great post! Yes, that’s around the speedup I expected. It’s a shame that you have to nudge the optimizer into using the compound index, but I’m glad you persevered and added the index hint 🙂

Cheers, and thanks for the post!

Jay