Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Joining on range? Wrong!

May 17, 2010

Author

Maciej Dobrzanski

Insight for Developers

MySQL

Share this Post:

The problem I am going to describe is likely to be around since the very beginning of MySQL, however unless you carefully analyse and profile your queries, it might easily go unnoticed. I used it as one of the examples in our talk given at phpDay.it conference last week to demonstrate some pitfalls one may hit when designing schemas and queries, but then I thought it could be a good idea to publish this on the blog as well.

To demonstrate the issue letâ€™s use a typical example â€” a sales query. Our data is a tiny store directory consisting of three very simple tables:

CREATE TABLE `products` (
  `prd_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `prd_name` varchar(32) NOT NULL,
  PRIMARY KEY (`prd_id`),
  KEY `name` (`prd_name`)
)

CREATE TABLE `tags` (
  `tag_prd_id` int(10) unsigned NOT NULL,
  `tag_name` varchar(32) NOT NULL,
  PRIMARY KEY (`tag_name`, `tag_prd_id`)
)

CREATE TABLE `items_ordered` (
  `itm_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `itm_prd_id` int(10) unsigned NOT NULL,
  `itm_order_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`itm_id`),
  KEY `itm_prd_id__and__itm_order_timestamp` (`itm_prd_id`,`itm_order_timestamp`)
)

CREATE TABLE `products` (

`prd_id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`prd_name` varchar(32) NOT NULL,

PRIMARY KEY (`prd_id`),

KEY `name` (`prd_name`)

)

CREATE TABLE `tags` (

`tag_prd_id` int(10) unsigned NOT NULL,

`tag_name` varchar(32) NOT NULL,

PRIMARY KEY (`tag_name`, `tag_prd_id`)

)

CREATE TABLE `items_ordered` (

`itm_id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`itm_prd_id` int(10) unsigned NOT NULL,

`itm_order_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

PRIMARY KEY (`itm_id`),

KEY `itm_prd_id__and__itm_order_timestamp` (`itm_prd_id`,`itm_order_timestamp`)

)

“Please excuse the crudity of this model, I didn’t have time to build it to scale or to paint it.” — Dr. Emmett Brown

I populated these tables with enough data to serve our purpose.

Our hypothetical sales query could be to figure out how many LCD TVs were sold yesterday.

SELECT        COUNT(1)
       FROM   tags t
              JOIN products p
              ON     p.prd_id = t.tag_prd_id
              JOIN items_ordered i
              ON     i.itm_prd_id    = p.prd_id
       WHERE  t.tag_name             = 'lcd'
       AND    i.itm_order_timestamp >= '2010-05-16 00:00:00'
       AND    i.itm_order_timestamp  < '2010-05-17 00:00:00'
+----------+
| COUNT(1) |
+----------+
|     4103 | 
+----------+

SELECT COUNT(1)

FROM tags t

JOIN products p

ON p.prd_id = t.tag_prd_id

JOIN items_ordered i

ON i.itm_prd_id = p.prd_id

WHERE t.tag_name = 'lcd'

AND i.itm_order_timestamp >= '2010-05-16 00:00:00'

AND i.itm_order_timestamp < '2010-05-17 00:00:00'

+----------+

| COUNT(1) |

+----------+

| 4103 |

+----------+

Seems like a very successful day! 🙂

When we look at the data structures it looks quite good â€” there is index on

`tag_name`

1	`tag_name`

`tags`

`tags`

, there is index on

(`itm_prd_id`, `itm_order_timestamp`)

1	(`itm_prd_id`, `itm_order_timestamp`)

`items_ordered`

1	`items_ordered`

and indexes on other columns used in joins. Letâ€™s verify how the query performed in greater detail:

SHOW STATUS LIKE 'Handler_read%';                   

+-----------------------+--------+
| Variable_name         | Value  |
+-----------------------+--------+
| Handler_read_first    | 0      | 
| Handler_read_key      | 3      | 
| Handler_read_next     | 118181 | 
| Handler_read_prev     | 0      | 
| Handler_read_rnd      | 0      | 
| Handler_read_rnd_next | 0      | 
+-----------------------+--------+

SHOW STATUS LIKE 'Handler_read%';

+-----------------------+--------+

| Variable_name | Value |

+-----------------------+--------+

| Handler_read_first | 0 |

| Handler_read_key | 3 |

| Handler_read_next | 118181 |

| Handler_read_prev | 0 |

| Handler_read_rnd | 0 |

| Handler_read_rnd_next | 0 |

+-----------------------+--------+

Somehow this does not look as good as the sales numbers. Query matched 4103 rows, but almost 120000 were scanned. And we have proper indexes on all necessary columns! What does EXPLAIN have to say about this?

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 98
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: p
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: example_db.t.tag_prd_id
         rows: 1
        Extra: Using index
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: i
         type: ref
possible_keys: itm_prd_id__and__itm_order_timestamp
          key: itm_prd_id__and__itm_order_timestamp
      key_len: 4
          ref: example_db.p.prd_id
         rows: 10325
        Extra: Using where; Using index

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: t

type: ref

possible_keys: PRIMARY

key: PRIMARY

key_len: 98

ref: const

rows: 1

Extra: Using where; Using index

*************************** 2. row ***************************

id: 1

select_type: SIMPLE

table: p

type: eq_ref

possible_keys: PRIMARY

key: PRIMARY

key_len: 4

ref: example_db.t.tag_prd_id

rows: 1

Extra: Using index

*************************** 3. row ***************************

id: 1

select_type: SIMPLE

table: i

type: ref

possible_keys: itm_prd_id__and__itm_order_timestamp

key: itm_prd_id__and__itm_order_timestamp

key_len: 4

ref: example_db.p.prd_id

rows: 10325

Extra: Using where; Using index

To remind – our structure design is:

  `itm_prd_id` int(10) unsigned NOT NULL
  `itm_order_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
  KEY `itm_prd_id__and__itm_order_timestamp` (`itm_prd_id`,`itm_order_timestamp`)

`itm_prd_id` int(10) unsigned NOT NULL

`itm_order_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP

KEY `itm_prd_id__and__itm_order_timestamp` (`itm_prd_id`,`itm_order_timestamp`)

In 3rd row key_len is only 4 bytes, while the full key length is 4 bytes for itm_prd_id plus 4 bytes for itm_order_timestamp, so 8 bytes in total! Also ref shows only one column being used by the last join.

How should we understand this then? Database reads all ordered items where tag is ‘lcd’, which totals to about 120000 rows as shown by the counters in SHOW STATUS output above, and then filters out those not matching the date range. A very inefficient approach! MySQL was unable to optimize those simple conditions to match both product id and date range by index and read only the relevant rows.

This affects joins only. When you use a range condition on the first (or the only) table, it works as expected:

EXPLAIN
SELECT        COUNT(1)
       FROM   items_ordered i
       WHERE  i.itm_prd_id           = 5
       AND    i.itm_order_timestamp >= '2010-05-16 00:00:00'
       AND    i.itm_order_timestamp  < '2010-05-17 00:00:00'

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: i
         type: range
possible_keys: itm_prd_id__and__itm_order_timestamp
          key: itm_prd_id__and__itm_order_timestamp
      key_len: 8
          ref: NULL
         rows: 1306
        Extra: Using where; Using index

EXPLAIN

SELECT COUNT(1)

FROM items_ordered i

WHERE i.itm_prd_id = 5

AND i.itm_order_timestamp >= '2010-05-16 00:00:00'

AND i.itm_order_timestamp < '2010-05-17 00:00:00'

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: i

type: range

possible_keys: itm_prd_id__and__itm_order_timestamp

key: itm_prd_id__and__itm_order_timestamp

key_len: 8

ref: NULL

rows: 1306

Extra: Using where; Using index

In this case MySQL does not print ref at all, because there is no join, however you can notice key_len is 8 bytes, so the full index length. It means both index columns will be used to execute the query.

There may be many workarounds to this problem, all depends on the specific case you may need to solve. Essentially it always comes down to removing range condition from join one way or another. For our example query this could mean introducing additional DATE column and using it for filtering instead:

ALTER TABLE items_ordered ADD itm_order_date DATE NOT NULL, ADD INDEX itm_prd_id__and__itm_order_date (itm_prd_id, itm_order_date);
UPDATE items_ordered SET itm_order_date = DATE(itm_order_timestamp);

1 2	ALTER TABLE items_ordered ADD itm_order_date DATE NOT NULL, ADD INDEX itm_prd_id__and__itm_order_date (itm_prd_id, itm_order_date); UPDATE items_ordered SET itm_order_date = DATE(itm_order_timestamp);

Now the rewritten query:

EXPLAIN
SELECT        COUNT(1)
       FROM   tags t
              JOIN products p
              ON     p.prd_id = t.tag_prd_id
              JOIN items_ordered i
              ON     i.itm_prd_id = p.prd_id
       WHERE  t.tag_name          = 'lcd'
       AND    i.itm_order_date    = '2010-05-16'

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 98
          ref: const
         rows: 1
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: p
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: example_db.t.tag_prd_id
         rows: 1
        Extra: Using index
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: i
         type: ref
possible_keys: itm_prd_id__and__itm_order_timestamp,itm_prd_id__and__itm_order_date
          key: itm_prd_id__and__itm_order_date
      key_len: 7
          ref: example_db.p.prd_id,const
         rows: 206494
        Extra: Using where; Using index

EXPLAIN

SELECT COUNT(1)

FROM tags t

JOIN products p

ON p.prd_id = t.tag_prd_id

JOIN items_ordered i

ON i.itm_prd_id = p.prd_id

WHERE t.tag_name = 'lcd'

AND i.itm_order_date = '2010-05-16'

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: t

type: ref

possible_keys: PRIMARY

key: PRIMARY

key_len: 98

ref: const

rows: 1

Extra: Using where; Using index

*************************** 2. row ***************************

id: 1

select_type: SIMPLE

table: p

type: eq_ref

possible_keys: PRIMARY

key: PRIMARY

key_len: 4

ref: example_db.t.tag_prd_id

rows: 1

Extra: Using index

*************************** 3. row ***************************

id: 1

select_type: SIMPLE

table: i

type: ref

possible_keys: itm_prd_id__and__itm_order_timestamp,itm_prd_id__and__itm_order_date

key: itm_prd_id__and__itm_order_date

key_len: 7

ref: example_db.p.prd_id,const

rows: 206494

Extra: Using where; Using index

This query uses 7 bytes of

`itm_prd_id__and__itm_order_date`

1	`itm_prd_id__and__itm_order_date`

index â€” 4 bytes is

`itm_prd_id`

1	`itm_prd_id`

and 3 bytes is

`itm_order_date`

1	`itm_order_date`

(DATE type uses 3 bytes). Also ref shows two columns used in join.

SHOW STATUS LIKE 'Handler_read%';                   
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_first    | 0     | 
| Handler_read_key      | 3     | 
| Handler_read_next     | 4104  | 
| Handler_read_prev     | 0     | 
| Handler_read_rnd      | 0     | 
| Handler_read_rnd_next | 0     | 
+-----------------------+-------+

SHOW STATUS LIKE 'Handler_read%';

+-----------------------+-------+

| Variable_name | Value |

+-----------------------+-------+

| Handler_read_first | 0 |

| Handler_read_key | 3 |

| Handler_read_next | 4104 |

| Handler_read_prev | 0 |

| Handler_read_rnd | 0 |

| Handler_read_rnd_next | 0 |

+-----------------------+-------+

Statistics also look much better.

But remember – different query will likely need a different solution.

You can find several bug reports regarding this problem (e.g. #8569, #19548). Some replies from MySQL indicate this may be eventually fixed in 6.0 or some future version. Others say “itâ€™s a documented behaviour â€” deal with it”. But in the real world this is a serious bug, not a feature, and it needs fixing.

0 0 votes

Article Rating

8 Comments

Oldest

Newest Most Voted

zerkms

16 years ago

this doesn’t rely on article (which is good), but for simplification you could remove “JOIN products p ON p.prd_id = t.tag_prd_id”. this part is odd for examined query. and i’m pretty sure that it’s more simple to keep in mind 1 join instead of 2 😉

Laurens

16 years ago

It should be possible to trick mysql.
Basically you create a date table with a date_id (surrogate key) in it.
You place that id instead of the datefield and you should be able to join properly on the date table where you can use the range function. Yes it is more work and it would be nice if it was fixed but it could be a while.

Admin

Peter Zaitsev

16 years ago

Laurens,

Yes. That does the trick. I actually wrote about similar issue 1.5 years ago:
http://www.mysqlperformanceblog.com/2008/08/01/how-adding-another-table-to-join-can-improve-performance/

Willam

16 years ago

Would it also work to use a subquery?

SELECT COUNT(1)
FROM tags t
JOIN ( SELECT products.prd_id FROM products
JOIN items_ordered
ON items_ordered.itm_prd_id = products.prd_id
WHERE items_ordered.itm_order_timestamp>= ‘2010-05-16 00:00:00’
AND items_ordered.itm_order_timestamp <'2010-05-17 00:00:00') p

ON p.prd_id = t.tag_prd_id
WHERE t.tag_name = 'lcd'

Nikolay

16 years ago

well this problem happen often, that why on big projects, you better “double” any timestamp / datetime column with additional date column.
then if you need date without time, better use it on date column:

create table x(
id int,
something chat(40),
created datetime,
created_date date,
primary key(id),
key(created),
key(created_date)
);

then:

select * from x where created_date = ‘2010-01-06’;

and not:
select * from x where created > ‘2010-01-06’ and created < '2010-01-07';

Jeffrey Gilbert

16 years ago

William, subqueries create temp tables which are not a route you want to go if you’re looking at performance gains. I’m curious how this example compares with BETWEEN x AND y versus the range selector used. It’s important to note that this example only works where you’re comparing a range of one item to the selection of one item. once you start getting more days in there your queries would look ridiculous.