Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Are You Forcing MySQL to Do Twice as Many JOINs as Necessary?

September 29, 2011

Author

Percona

MySQL

Share this Post:

Baron Schwartz

This guest post is from our friends at Percona. They’re hosting Percona Live London from October 24-25, 2011. Percona Live is a two day summit with 100% technical sessions led by some of the most established speakers in the MySQL field.

In the London area and interested in attending? We are giving away two free passes in the next few days. Watch our @tokutek twitter feed for a chance to win.

Did you know that the following query actually performs a JOIN? You can’t see it, but it’s there:

SELECT the_day, COUNT(*), SUM(clicks), SUM(cost)
FROM ad_clicks_by_day
WHERE the_day >= '2005-07-01' AND the_day < '2005-07-07'
GROUP BY the_day;

SELECT the_day, COUNT(*), SUM(clicks), SUM(cost)

FROM ad_clicks_by_day

WHERE the_day >= '2005-07-01' AND the_day < '2005-07-07'

GROUP BY the_day;

Let me explain.

Suppose you define the table as follows:

CREATE TABLE ad_clicks_by_day (
customer INT  NOT NULL,
the_day  DATE NOT NULL,
clicks   INT  NOT NULL DEFAULT 0,
cost     INT  NOT NULL DEFAULT 0,
PRIMARY KEY(customer, the_day),
INDEX(the_day)
) ENGINE=InnoDB;

CREATE TABLE ad_clicks_by_day (

customer INT NOT NULL,

the_day DATE NOT NULL,

clicks INT NOT NULL DEFAULT 0,

cost INT NOT NULL DEFAULT 0,

PRIMARY KEY(customer, the_day),

INDEX(the_day)

) ENGINE=InnoDB;

What happens when MySQL executes this query? Here’s the EXPLAIN:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ad_clicks_by_day
         type: range
possible_keys: the_day
          key: the_day
      key_len: 3
          ref: NULL
         rows: 368
        Extra: Using where

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: ad_clicks_by_day

type: range

possible_keys: the_day

key: the_day

key_len: 3

ref: NULL

rows: 368

Extra: Using where

That looks fine, doesn’t it? It’s using an index range scan to find approximately 368 matching rows, and adding up the clicks and cost for them. Can it be done any more efficiently than this?

In fact, it turns out that it’s possible to execute this query much more efficiently. What happens internally when the server executes this query is that it begins reading from the ‘the_day’ index, and for each row it finds, it performs a lookup in the primary key (which, in InnoDB, stores the whole table) to find the other columns mentioned in the query. That is, it’s joining the index to the primary key! This is not just an ivory-tower abstraction. In real-life workloads, the random I/O caused by such index-to-table joins slows queries dramatically.

The “join” here isn’t really the same as a true table-to-table SQL JOIN in some ways, and it has a little less overhead at the server level than a table-to-table join, but in terms of the way an index-to-row lookup works, it’s quite similar in many ways.

If you’re skilled at logical and physical database design, you already noticed that my example can be improved by indexing differently: just put the primary key on (the_day, customer) instead of the other way around. Then the query will use the primary key instead of the secondary key, and the whole row is available to the query without doing a join to another index. This is called a covering index, which means that the index covers the whole query–there is no need for any data outside the index. You’ll see “Using index” in the Extra column from EXPLAIN when an index covers a query.

But then what if we want to group the results by customer? No problem, we can just add another index. To achieve the same results for ad-hoc querying, we’ll need to index every column:

ALTER TABLE ad_clicks_by_day ADD INDEX(the_day, customer, clicks, cost);

1	ALTER TABLE ad_clicks_by_day ADD INDEX(the_day, customer, clicks, cost);

That works, but isn’t that going to be kind of expensive? We’re essentially duplicating the whole table. And what if we have some other columns we want to target for grouping, or filtering, or sorting, or any other operation that can be optimized with an index? Uh-oh, that’s going to be a lot of indexes.

The overhead of maintaining indexes is often mentioned, but rarely measured. It is sometimes exaggerated as a result. (For further reading on this point, see the book Relational Database Index Design and the Optimizers by Tapio Lahdenmaki.) However, if you did measure the overhead, you would notice two important things: increased disk I/O for modifications to the table, and increased disk space consumption. Even if it’s sometimes blown out of proportion, it’s real.

That’s why I like TokuDB’s indexing technology and modifications to the storage engine API. Indexes are much cheaper in TokuDB than they are in InnoDB, for several reasons:

- TokuDB compresses its data quite well, which reduces the footprint both in space consumption and in I/O operations.

- TokuDB lets you define multiple clustering indexes–indexes that sort and compare only on the key columns, but which include all columns in the table, transparently. That avoids invisible JOINs. I want this feature in InnoDB.

- TokuDB uses a different data structure for indexes, which is significantly more efficient. Instead of the classic B-Tree structure, TokuDB uses Fractal Trees, which don’t slow down when they get bigger than memory.

As a result of these three properties of TokuDB, you can maintain a lot of indexes on a lot of data relatively cheaply. And that means that you can index your data to match your query patterns, and get much faster performance by removing a lot of random I/O operations caused by finding data and then looking up the rest of the row.

Disclosure: Tokutek paid Percona to evaluate the technology, but it is our practice to offer only our honest opinion in blog posts such as this one.

About the Author

Baron is Percona’s Chief Performance Architect. He’s the lead author of High Performance MySQL, and creator of Maatkit. He consults with Percona’s customers, and develops tools and practices for Percona’s team.

0 0 votes

Article Rating

4 Comments

Oldest

Newest Most Voted

Shlomi Noach

14 years ago

Baron,
“…This is called a covering index, which means that the index covers the whole query–there is no need for any data outside the index. …”
I wish to argue this is not exact: the query mentions “SUM(clicks), SUM(cost)”, both columns are not covered by suggested index.

One can argue (*) that in InnoDB a primary key is always a covering index, just because it’s a clustering index. Nevertheless, access is required to data (leaf) pages, which, in covering index, would otherwise be accessible one level above. I do not know if current InnoDB implementation supports such optimization or not; but it is not clear that the index is indeed covering.

(*) I may argue back, then, that by same reasoning, any full table scan on InnoDB tables is a ‘covering index’ query. I’m not implying you make that argument.

Regards

Mark Callaghan

14 years ago

Great point but I think the example can be much simpler (without aggregation). Non-index only queries for InnoDB are joins between the secondary and cluster index.

TokuDB can make all/most index range scan queries index-only. This is a huge deal.

Baron Schwartz

14 years ago

Shlomi,

InnoDB doesn’t support returning values from non-leaf nodes. In MySQL, anyway, “covering” just means that the leaf nodes have the columns. Secondary indexes on non-primary-key columns actually cover more columns than those mentioned in the index explicitly, because the leaf nodes carry the primary key columns as well. These “included columns” are still useful to the query, even though they aren’t in the non-leaf nodes.

An index that covers the query and can be accessed sequentially is a huge win, even if it’s less selective. The Lahdenmaki and Leach book I mentioned lays the math out in detail, along with a model for rating indexes by “stars” to indicate their benefit for a particular query. A “three-star” index lets the query avoid sorting, covers the index, and is highly selective. But even an index that’s not very selective is a huge win if it avoids sorting or random I/O and lookups outside the index.

Mark,

You are right that I could have chosen a much simpler example. This one came to mind because it’s a sort of stereotypical query that I see a lot and could be optimized well by TokuDB.

Anonymous Hacker

9 years ago

I’d like some clarification about the following part of your post: “example can be improved by indexing differently: just put the primary key on (the_day, customer) instead of the other way around. Then the query will use the primary key instead of the secondary key, and the whole row is available to the query without doing a join to another index.”

Why does reversing the key order matter? Put another way, why is the query planner unable to reverse the terms if that would be advantageous–IIRC, logical predicates don’t associatively short-circuit in SQL, so one would think terms would be reordered to fit the available indices best.

Is this understanding incorrect, or simply inapplicable to the example given, or is it a deficiency in the query planner being used (as it exist(ed) at the time of this post)?