Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

The MySQL optimizer, the OS cache, and sequential versus random I/O

April 29, 2008

Author

Baron Schwartz

Insight for Developers

Share this Post:

In my post on estimating query completion time, I wrote about how I measured the performance on a join between a few tables in a typical star schema data warehousing scenario.

In short, a query that could take several days to run with one join order takes an hour with another, and the optimizer chose the poorer of the two join orders. Why is one join order so much slower than the other, and why did the optimizer not choose the faster one? That’s what this post is about.

Let’s start with the MySQL query optimizer. The optimizer tries to choose the best join order based on its cost metric; it tries to estimate the cost for a query, then choose the query plan that has the lowest cost. The unit of cost for the MySQL query optimizer is a single random 4k data page read. In general, it’s a pretty good metric, but it has one major weakness: the server doesn’t know whether a read will be satisfied from the operating system cache, or whether it’ll have to go to disk. (This distinction is abstracted away by the storage engine; the optimizer doesn’t know how a given storage engine stores its data).

I’ll try to omit the details and keep this clean. Let’s take a look at the tables.

mysql> show table status like 'fact'G
*************************** 1. row ***************************
Name: fact
Engine: MyISAM
Rows: 147045493
Avg_row_length: 117
Data_length: 17217646764
Index_length: 11993816064

mysql> show table status like 'dim1'G
*************************** 1. row ***************************
Name: dim1
Engine: MyISAM
Rows: 453193
Avg_row_length: 122
Data_length: 55605116
Index_length: 93812736

mysql> show table status like 'dim2'G
*************************** 1. row ***************************
Name: dim2
Engine: MyISAM
Rows: 811
Avg_row_length: 105
Data_length: 85368
Index_length: 154624

mysql> show table status like 'fact'G

*************************** 1. row ***************************

Name: fact

Engine: MyISAM

Rows: 147045493

Avg_row_length: 117

Data_length: 17217646764

Index_length: 11993816064

mysql> show table status like 'dim1'G

*************************** 1. row ***************************

Name: dim1

Engine: MyISAM

Rows: 453193

Avg_row_length: 122

Data_length: 55605116

Index_length: 93812736

mysql> show table status like 'dim2'G

*************************** 1. row ***************************

Name: dim2

Engine: MyISAM

Rows: 811

Avg_row_length: 105

Data_length: 85368

Index_length: 154624

It’s a big fact table and two fairly small dimension tables, which is normal. Here is the query:

select fact.col1, min(fact.col2) as min_col2
from fact, dim1, dim2
where fact.col4 = dim1.col4
and dim1.col3 <> 'hello world'
and dim2.col5 = 1
and fact.dim2_id = dim2.dim2_id
and fact.col2 > some_const
group by fact.col1

select fact.col1, min(fact.col2) as min_col2

from fact, dim1, dim2

where fact.col4 = dim1.col4

and dim1.col3 <> 'hello world'

and dim2.col5 = 1

and fact.dim2_id = dim2.dim2_id

and fact.col2 > some_const

group by fact.col1

There are indexes on all the columns in all the ways you’d expect: all the dimension columns are indexed on every table, and there’s a separate index on every column in the WHERE clause. Here’s the query plan initially.

*************************** 1. row ***************************
table: dim1
type: range
key_len: 195
rows: 18790
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
table: fact
type: ref
key_len: 4
rows: 606
Extra: Using where
*************************** 3. row ***************************
table: dim2
type: eq_ref
key_len: 2
rows: 1
Extra: Using where

*************************** 1. row ***************************

table: dim1

type: range

key_len: 195

rows: 18790

Extra: Using where; Using temporary; Using filesort

*************************** 2. row ***************************

table: fact

type: ref

key_len: 4

rows: 606

Extra: Using where

*************************** 3. row ***************************

table: dim2

type: eq_ref

key_len: 2

rows: 1

Extra: Using where

This query will run for days and never complete. No one ever let it finish to see how long it would run.

How do I know it will run for days? Here’s my train of thought:

- It’s performing index lookups into the fact table, which is big.

- An index lookup is a random I/O.

- A modern disk can do about 100 random I/O’s per second, as a rule of thumb.

- If you do the math with the rows column in EXPLAIN, you realize that this equates to about 18790 * 606 = 11386740 I/O operations, assuming that the indexes are fully in memory.

- When you divide this by 100 I/O operations per second, and then divide that by 86400 seconds in a day, you get about 2.6 days.

Ouch! That’s slow.

Now let’s look at the alternative: table-scan the fact table, and do index lookups in the two dimension tables. MySQL doesn’t want to choose this join order, so we’ll force it with STRAIGHT_JOIN:

explain select STRAIGHT_JOIN  ....
+-------+-----------+-----------+---------------------------------+
| table | type      | rows      | Extra                           |
+-------+-----------+-----------+---------------------------------+
| fact  | ALL       | 147367284 | Using temporary; Using filesort |
| dim1  | eq_ref    | 1         | Using where                     |
| dim2  | eq_ref    | 1         | Using where                     |
+-------+-----------+-----------+---------------------------------+

explain select STRAIGHT_JOIN ....

+-------+-----------+-----------+---------------------------------+

+-------+-----------+-----------+---------------------------------+

| fact | ALL | 147367284 | Using temporary; Using filesort |

| dim1 | eq_ref | 1 | Using where |

| dim2 | eq_ref | 1 | Using where |

+-------+-----------+-----------+---------------------------------+

As we saw in the previous post, which I linked at the top of this post, we can scan the fact table in less than an hour. And it turns out that joining to the dimension tables doesn’t slow the query perceptibly, because these tables are small and they stay in memory, in the OS cache. (They don’t get evicted from memory by the cache’s LRU policy, because they are frequently used — once per row in the fact table. The LRU policy evicts old blocks from the fact table instead, which is perfect — these blocks are used only once and not needed again, so they can be replaced).

The difference between the two queries — 55 minutes and 2.6 days — is basically the difference between scanning data sequentially on disk and random disk I/O.

So now you know why one join order is faster than the other. But why didn’t the optimizer know this, too? The optimizer does know that random access is slower than sequential access, but it doesn’t know that the dimension tables will stay in memory, and this is an important distinction.

Let’s put ourselves into the mindset of the optimizer. We’ll assume that every join to the dimension tables will go to disk instead of being read from cache. Now the STRAIGHT_JOIN becomes a table scan of about 313 sequential reads (150 million rows / 117 bytes per row / 4096 bytes per read), plus about 150 million random I/Os for the first dimension table, plus 150 million random I/Os for the second dimension table. That’s 300 million random I/O operations.

In contrast, the optimizer chose a plan that it thought would cause only 11.3 million random I/O operations.

The optimizer was being smart, given its lack of knowledge about the OS cache. This is why an expert is sometimes needed to provide the missing information. If the MySQL optimizer were right and each of these had to go to disk, our STRAIGHT_JOIN plan would take more than a month to complete! Good thing we know the difference between cache and disk!

0 0 votes

Article Rating

14 Comments

Oldest

Newest Most Voted

plantegg

18 years ago

Good passage! Maybe you should say : if indexs can not be loaded in memory, you query maybe slower still using index :)ï¼
If about this case, you have a server had 128G Memory, I think the result is different! Maybe the first query be finished short than 1 Minute.

Evert

18 years ago

Great post.. Very concise

Roland Bouman

18 years ago

Baron,

this (and the one about estimating query execution time) is absolutely great stuff! Well written, well explained, and very applicable to the real world.

Kudos for a great article and for sharing your insight 😉

kind regards,
Roland

Author

Baron Schwartz

18 years ago

Thanks Roland 🙂

Clint Byrum

18 years ago

Yeah this is great stuff. I recall that MS SQL Server would actually figure out that a table and its indexes could fit into RAM, and load them into their equivilent of a memory engine temp table, allowing the optimizer to figure this stuff out. Seems like that would be a worthwhile feature for the query optimizer which would perform better than the underlying OS cache.

Marki

18 years ago

Nice post. But is this still true when we have lot of different queries running in paralel? Won’t they also use the disk and “interrupt” the sequential reads with their own reads and in fact the disk will be making random reads? If this is true, there is no reason in trying to optimize one particular query to use table scan (unless it is the only query running).

Author

Baron Schwartz

18 years ago

Marki, to some extent that is true, but the effect isn’t as severe as you think. Even in the worst case, if the scan is interrupted very often, it will still read many rows each time the disk services this task, and a full scan broken into thousands of pieces is still much faster than millions of random reads. Also, both the OS and the disk controller have a lot of intelligence to recognize and handle these scenarios (batching, reordering, read-ahead, etc), so they end up being more efficient than just random reads. Full scans for this type of query under a concurrent workload are still a big win.

Triffids

18 years ago

And why MySQL doesn’t use scattered read (multiblock read) as it do rdbms leaders ?

Author

Baron Schwartz

18 years ago

Triffids, see http://www.mysqlperformanceblog.com/2008/03/25/mysql-60-vs-51-in-tpc-h-queries/

Triffids

18 years ago

Cool feature ! and how MySQL optimizer chose random read (and read some blocks from cache) or use MRR (all blocks from HDD) ?

Author

Baron Schwartz

18 years ago

Triffids, I don’t know much about the multi-range read feature. (I say that about lots of things until I have studied it a lot). It’s unfinished, and I would bet the final implementation will be somewhat different than it is now.

Yasufumi Kinoshita

18 years ago

I have tried to fix 5.0 optimizer. (But I have focused on InnoDB)
My patch may solve this type of problems.

http://bugs.mysql.com/bug.php?id=36681

Regards.

Shiv

17 years ago

Baron- wonderful article. One of those articles that establishes faith that the beast can be understood.

Question: I am not clear when at the very end you say “313 sequential reads (150 million rows / 117 bytes per row / 4096 bytes per read)”. Should’nt the math be “(150 M rows*117 bytes per row)/block_size(=4096)= 4.2 M sequential reads”.

If so, is there a rule of thumb for how fast sequential reads can be done by commodity hardware ? (I am assuming that the number of 100 IOs per second refers to random I/O).

Thanks,
Shiv

Author

Baron Schwartz

17 years ago

Oh, wow, did I really screw the math up that badly? I believe you are right. Funny that no one else caught it yet.

You can assume that each random I/O has to wait for the head to move, for the disk to rotate the starting point of the read under the head (1/2 of a rotation on average) and then the disk to rotate all the data under the head. So the spindle rotation speed really dominates, and you can just estimate the random I/O off that. You can get an absolute upper bound easily, and in practice you won’t get that performance.

For sequential I/O, I have to think about that for a moment, and suddenly I’m not confident in my math skills.