Multi Column indexes vs Index Merge

The mistake I commonly see among MySQL users is how indexes are created. Quite commonly people just index individual columns as they are referenced in where clause thinking this is the optimal indexing strategy. For example if I would have something like AGE=18 AND STATE=’CA’ they would create 2 separate indexes on AGE and STATE columns.

The better strategy is often to have combined multi-column index on (AGE,STATE). Lets see why it is the case.

MySQL indexes are (with few exceptions) BTREE indexes – this index type is very good to be able to quickly lookup the data on any its prefix and traversing ranges between values in sorted order. For example when you query AGE=18 with single column BTREE index MySQL will dive into the index to find first matching row and when will continue scanning index in order until it runs into the value of AGE more than 18 when it stops doing so assuming there are no more matching. The RANGES such as AGE BETWEEN 18 AND 20 work similar way – MySQL just stops at different value.

The enumerated ranges such as AGE IN (18,20,30) are more complicated as this is in fact several separate index lookups.

So we spoke about how MySQL uses the index but not exactly what it gets from the index – typically (unless it is covering index) MySQL gets a “row pointer” which can be primary key value (for Innodb tables) physical file offset (for MyISAM tables) or something else. It is important internally storage engine can use that value to lookup the full row data corresponding to the given index entry.

So what choices does MySQL have if you have just 2 separate indexes ? I will either use just one of them to look up the data (and check remaining portion on WHERE clause after data is read) or it can lookup the row pointers from all indexes, intersect them and when lookup the data.

Which method works better depends on selectivity and correlation. If where clause from first column selects 5% of the rows and applying where clause on second column brings it down to 1% using intersection makes sense. If second where clause just brings it down to 4.5% it is most likely better to use single index only and simply filter out rows we do not need after lookup.

Lets look at some examples now:

I made columns i1 and i2 independent in this case each selecting about 1% rows from this table which contains about 10M rows.

As you can see MySQL picks to use combined index and indeed the query completes in less than 10ms

Now what if we pretend we only have single column indexes (by hinting optimizer to ignore combined index)

As you can see in this case MySQL does the Intersection for the index values (the process a mentioned above) and the query now runs in about 70 ms which is quite a difference.

Now lets see if using just single index and post filtering is any better:

Now this query scans a lot of rows and completed in about 290ms
So we can see index merge indeed improves performance compared to single index only but it is by far better to use multi column indexes.

But the problems with Index Merge do not stop there. It is currently rather restricted in which conditions it would do the index merge: