The problem I am going to describe is likely to be around since the very beginning of MySQL, however unless you carefully analyse and profile your queries, it might easily go unnoticed. I used it as one of the examples in our talk given at phpDay.it conference last week to demonstrate some pitfalls one may hit when designing schemas and queries, but then I thought it could be a good idea to publish this on the blog as well.
To demonstrate the issue let’s use a typical example – a sales query. Our data is a tiny store directory consisting of three very simple tables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
CREATE TABLE `products` ( `prd_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `prd_name` varchar(32) NOT NULL, PRIMARY KEY (`prd_id`), KEY `name` (`prd_name`) ) CREATE TABLE `tags` ( `tag_prd_id` int(10) unsigned NOT NULL, `tag_name` varchar(32) NOT NULL, PRIMARY KEY (`tag_name`, `tag_prd_id`) ) CREATE TABLE `items_ordered` ( `itm_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `itm_prd_id` int(10) unsigned NOT NULL, `itm_order_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`itm_id`), KEY `itm_prd_id__and__itm_order_timestamp` (`itm_prd_id`,`itm_order_timestamp`) ) |
“Please excuse the crudity of this model, I didn’t have time to build it to scale or to paint it.” — Dr. Emmett Brown
I populated these tables with enough data to serve our purpose.
Our hypothetical sales query could be to figure out how many LCD TVs were sold yesterday.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
SELECT COUNT(1) FROM tags t JOIN products p ON p.prd_id = t.tag_prd_id JOIN items_ordered i ON i.itm_prd_id = p.prd_id WHERE t.tag_name = 'lcd' AND i.itm_order_timestamp >= '2010-05-16 00:00:00' AND i.itm_order_timestamp < '2010-05-17 00:00:00' +----------+ | COUNT(1) | +----------+ | 4103 | +----------+ |
Seems like a very successful day! 🙂
When we look at the data structures it looks quite good – there is index on
1 |
`tag_name` |
in
1 |
`tags` |
, there is index on
1 |
(`itm_prd_id`, `itm_order_timestamp`) |
in
1 |
`items_ordered` |
and indexes on other columns used in joins. Let’s verify how the query performed in greater detail:
1 2 3 4 5 6 7 8 9 10 11 12 |
SHOW STATUS LIKE 'Handler_read%'; +-----------------------+--------+ | Variable_name | Value | +-----------------------+--------+ | Handler_read_first | 0 | | Handler_read_key | 3 | | Handler_read_next | 118181 | | Handler_read_prev | 0 | | Handler_read_rnd | 0 | | Handler_read_rnd_next | 0 | +-----------------------+--------+ |
Somehow this does not look as good as the sales numbers. Query matched 4103 rows, but almost 120000 were scanned. And we have proper indexes on all necessary columns! What does EXPLAIN have to say about this?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: t type: ref possible_keys: PRIMARY key: PRIMARY key_len: 98 ref: const rows: 1 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: p type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: example_db.t.tag_prd_id rows: 1 Extra: Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: i type: ref possible_keys: itm_prd_id__and__itm_order_timestamp key: itm_prd_id__and__itm_order_timestamp key_len: 4 ref: example_db.p.prd_id rows: 10325 Extra: Using where; Using index |
To remind – our structure design is:
1 2 3 |
`itm_prd_id` int(10) unsigned NOT NULL `itm_order_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP KEY `itm_prd_id__and__itm_order_timestamp` (`itm_prd_id`,`itm_order_timestamp`) |
In 3rd row key_len is only 4 bytes, while the full key length is 4 bytes for itm_prd_id
plus 4 bytes for itm_order_timestamp
, so 8 bytes in total! Also ref shows only one column being used by the last join.
How should we understand this then? Database reads all ordered items where tag is ‘lcd’, which totals to about 120000 rows as shown by the counters in SHOW STATUS output above, and then filters out those not matching the date range. A very inefficient approach! MySQL was unable to optimize those simple conditions to match both product id and date range by index and read only the relevant rows.
This affects joins only. When you use a range condition on the first (or the only) table, it works as expected: