The MySQL optimizer, the OS cache, and sequential versus random I/O

April 29, 2008
Author
Baron Schwartz
Share this Post:

In my post on estimating query completion time, I wrote about how I measured the performance on a join between a few tables in a typical star schema data warehousing scenario.

In short, a query that could take several days to run with one join order takes an hour with another, and the optimizer chose the poorer of the two join orders. Why is one join order so much slower than the other, and why did the optimizer not choose the faster one? That’s what this post is about.

Let’s start with the MySQL query optimizer. The optimizer tries to choose the best join order based on its cost metric; it tries to estimate the cost for a query, then choose the query plan that has the lowest cost. The unit of cost for the MySQL query optimizer is a single random 4k data page read. In general, it’s a pretty good metric, but it has one major weakness: the server doesn’t know whether a read will be satisfied from the operating system cache, or whether it’ll have to go to disk. (This distinction is abstracted away by the storage engine; the optimizer doesn’t know how a given storage engine stores its data).

I’ll try to omit the details and keep this clean. Let’s take a look at the tables.

It’s a big fact table and two fairly small dimension tables, which is normal. Here is the query:

There are indexes on all the columns in all the ways you’d expect: all the dimension columns are indexed on every table, and there’s a separate index on every column in the WHERE clause. Here’s the query plan initially.

This query will run for days and never complete. No one ever let it finish to see how long it would run.

How do I know it will run for days? Here’s my train of thought:

  • It’s performing index lookups into the fact table, which is big.
  • An index lookup is a random I/O.
  • A modern disk can do about 100 random I/O’s per second, as a rule of thumb.
  • If you do the math with the rows column in EXPLAIN, you realize that this equates to about 18790 * 606 = 11386740 I/O operations, assuming that the indexes are fully in memory.
  • When you divide this by 100 I/O operations per second, and then divide that by 86400 seconds in a day, you get about 2.6 days.

Ouch! That’s slow.

Now let’s look at the alternative: table-scan the fact table, and do index lookups in the two dimension tables. MySQL doesn’t want to choose this join order, so we’ll force it with STRAIGHT_JOIN:

As we saw in the previous post, which I linked at the top of this post, we can scan the fact table in less than an hour. And it turns out that joining to the dimension tables doesn’t slow the query perceptibly, because these tables are small and they stay in memory, in the OS cache. (They don’t get evicted from memory by the cache’s LRU policy, because they are frequently used — once per row in the fact table. The LRU policy evicts old blocks from the fact table instead, which is perfect — these blocks are used only once and not needed again, so they can be replaced).

The difference between the two queries — 55 minutes and 2.6 days — is basically the difference between scanning data sequentially on disk and random disk I/O.

So now you know why one join order is faster than the other. But why didn’t the optimizer know this, too? The optimizer does know that random access is slower than sequential access, but it doesn’t know that the dimension tables will stay in memory, and this is an important distinction.

Let’s put ourselves into the mindset of the optimizer. We’ll assume that every join to the dimension tables will go to disk instead of being read from cache. Now the STRAIGHT_JOIN becomes a table scan of about 313 sequential reads (150 million rows / 117 bytes per row / 4096 bytes per read), plus about 150 million random I/Os for the first dimension table, plus 150 million random I/Os for the second dimension table. That’s 300 million random I/O operations.

In contrast, the optimizer chose a plan that it thought would cause only 11.3 million random I/O operations.

The optimizer was being smart, given its lack of knowledge about the OS cache. This is why an expert is sometimes needed to provide the missing information. If the MySQL optimizer were right and each of these had to go to disk, our STRAIGHT_JOIN plan would take more than a month to complete! Good thing we know the difference between cache and disk!

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Far
Enough.

Said no pioneer ever.
MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.
© 2026 Percona All Rights Reserved