Quite common beginners mistake is not to understand how indexing works and so index all columns used in the queries…. separately. So you end up with table which has say 20 indexes but all single column ones. This can be spotted with a glance view. If you have queries with multiple column restrictions in WHERE clause you most likely will need to have multiple column indexes for optimal performance. But wait. Do not go ahead and index all combinations. This would likely be poor choice too 🙂

27 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Rob Wultsch

I’d rather see 20 separate indexes than an index which covers 15 columns….
(I’ve seen both)

charles

Yeah i want to see 20 separate indexes too than 15 columns.

Nacho

how do you think it would be better approach with a query with so many restrictions in WHERE clause like this filter ? http://browseusers.myspace.com/Browse/browse.aspx?MyToken=633549779643841239 (in case that all data is on the same table)
i guess a multiple index in the main columns would be the best choice

tgabi

There are few particular cases when multi-column indexes are used by mysql. Most of the time mysql will scan the first column on the index (check EXPLAIN to verify). At least with 2 separate indexes there is a chance that mysql will use both indexes together.

Angelo Mandato

I suspect no one “tests” scenarios based on the comments left here. There are situations which can warrant any number of scenarios. I can tell you that I’ve seen a use for creating one index for 5 columns and it made a huge difference. In my case, queries went from taking over an hour to 5 minutes by adding 3 columns to my original one index.

The last column I was searching for in my WHERE clause was a string. Here’s how I took that 5 minutes and brought it down to 2 seconds: I made a new column that was an unsigned INT where I stored the string’s crc32 value. Now instead of doing WHERE string = ‘value’ I do WHERE string_crc32 = ‘4958494034’ for example. Since I now added string_crc32 to my index (now with 5 columns) my queries execute in seconds rather than minutes.

You can index a VARCHAR but from my tests it just doesn’t match the performance of indexing an INT, unsigned or not.

You can take the advice or not but even advice is just that. Real way to know if a change makes a difference in your situation is to create a development environment with the same database on a slower machine/server, setup log-queries and use EXPLAIN on queries that are slow. Once you get your creative juices flowing with knowledge like above you can take just about any slow query and make it super fast. This is how I did it, I tried every combination testing each with EXPLAIN. Eventually you’ll find the right combination. For me, the primary key had to be first, the other three columns in my index had to be in the middle and the string_crc was last.

I was somewhat cleaver with my queries till I attended a presentation by Jay Pipes on Join Fu. (http://www.jpipes.com/) You think this discussion is enlightening, then you definitely want to check out his Join Fu slides on his blog.

tgabi

As I said: few particular cases. Of course everybody do (or should) test the queries. Of course integer indexes with high cardinality will beat string indexes – unless you need to search by partial matches like ‘abc%’. But we’re talking here about mysql behaviour. I suspect that in your case the cardinality of the index was pretty bad and that improved by adding extra 3 columns. I bet that you can drop a few columns from that index without losing any performance. And there’s the problem of mysql choosing single column indexes where multi-column indexes are present, and wasting memory and disk space.
I still think that single column indexes make more sense in mysql most of the time. Index intersection should be more efficient than multi-column index for searching. Ordering is a diffent issue.

Angelo Mandato

Like I said in my case I tried multiple combinations, including removing some of the columns that you recommend that I remove and that had performance consequences. I join 2 tables to together plus my WHERE is a bit more complicated referencing account and product id’s, which I didn’t include in the example above. Again, try every combination in a test environment. Another tip, turn off caching on your testing environment MySQL that way the queries are re-executed. Cached results can make you think you got an optimized query when you really don’t.

Thinking in terms of the process of finding the records, if your first indexed column cuts the results from billions to millions, then your second, third and fourth column cut that down to about a million records, you definitely want to keep those 3 additional columns in your index. When that 4th extra column moves from millions to 20 records, it makes all the difference. Then your database doesn’t have to walk over a million records to find those 20 records.

One other thought, if your not working with millions of records, you may have better results just by tweaking the MySQL config settings rather than spending time optimizing your queries, especially if your trying to improve queries that are less than 5-10 seconds. For readability purposes, you can just index one column without putting much thought into how the data will be retrieved by the database. This allows you to focus on the end goal, the web site or application’s results. Again, depends on your situation.

I will agree with tgabi, most of the time multi-column indexes are not necessary. If you’re not having issues then you most likely have no need to read this post either. I’m just trying to help others by making aware reasons for particular solutions (multi-column index in this case) in MySQL.

tgabi

Yeah, all the queries in testing should have SQL_NO_CACHE.
The table with the index you’re talking about is the pivot of the join ? I agree that indexes are good to cut trough useless data, still, so many values for each key of the first index makes me think the data is not normalized. Not that’s a bad thing, sometimes de-normalizing the data improves performance, just not common.
I’m glad we agree on the per-case usage of multi-column indexes – contradicts Peter’s view BTW.
Billions of records ? Wow. And 5 columns indexes ? Double wow.

tgabi

Peter,
You said:”in case there is good cardinality on both columns and full index can be used”. I’d call that a “best case scenario”. In my line of work I’m searching trough many low cardinality columns in the same query. The query is very specific, yet each column has only few values.Sometimes I wish I can find a multi-column index that works. I have few multi-column that do work, particular when used for ordering (ORDER BY a,b LIMIT 5).
Index merge is the only thing that works – when it works. I do agree that nobody should rely on index merge for now.

tgabi

I think there are many cases where multiple column indexes don’t work. That’s all. By looking at the index structure alone you cannot draw any conclusion about the design quality. You need to look at the data too.
It seems that we disagree on index merge versus multiple column index. I’m saying that I see index merge working when multiple column indexes don’t. You’re saying it cannot be used – you don’t say why, but I don’t disagree. One thing is clear: single column indexes are not enough, no doubt about that.
Regarding billions of records: it’s just awesome. It’s not every day you hear something like that. Well maybe you do, I don’t. I’m few months away from “billion records club” BTW.

tgabi

Peter,
you said “assuming you can use index prefix which is selective enough”. What do you mean by that ?
In specific case – derived from Nacho’s example: say you have millions of users with gender, race and hair color. Many males, many blacks, many blondes, very few all three. How do you see this search in Mysql ?

Ruslan Zakirov

More about Nachos’ example

1) Looks like all fields are linked one-to-one to users so can be all stored in users table(may be except looking_for, but let’s keep it simple), some thing like (gender, age, available, looking_for, location, has_foto)
2) booleans – gender, has_foto; enums – available, looking_for, location; integer – age
3) All queries fall into X IN (…) AND Y IN (…) AND …

Some conditions fall into X = const – ref access that may be preferable over range. Condition on Age may have up to 60 values so it can be converted to IN instead of using BETWEEN [1].

As all boolean operators are ‘AND’, mysql can use either indexes built on multiple columns or intersection of multiple accesses by indexes. When we talk about index_merge_intersection, we should understand that mysql finds one set using an index, another set and then finds records which exist in both sets [2]. Consider this “age = 18 and available = ‘divorced'”. I’ve used Russia as target for investigation and here is results: both conditions – 58, divorced – 623, 18th – 3000 (looks like it’s limit and most probably it’s much more than that). It’s pretty obvious that using index (age, available) is much better all the time as it’s always more selective than any of its columns alone. If you don’t have such compound index then mysql most probably use index on “available” column only using ref access and after that filter by age. This is optimal choice. So in any case index intersection sucks. When it doesn’t suck? In something like “age = 68 AND available = ‘engaged'”, but even in this case compound index quickly returns a few rows.

Enough about intersections. About booleans. has_foto? How many of your users have a foto? 100%? 10%? Most probably it’s more than 70% or even close to 90%. So it’s very not selective column. Should we avoid using it in indexes? Yes, if it’s an index based on this column only. No, if it’s part of compound index. Why? Because more than 70% percents of people will mark “show people with fotos only” and mysql can use index to filter records before fetching any rows. How much it will give? Not many, worth investigation. May be impact will be negative especially when almost all people have fotos.

Gender? you can use it everywhere as peter suggested as 90% of searches will use one either ‘male’ or ‘female’ and it will give you twice smaller result set. However, you most probably should cheat a little and give mysql a hint by using “gender IN (‘male’,’female’)” when people looking for any gender.

Other. Location is mandatory with quite good selectivity comparing with other enumns. Age as well is mandatory, but it can select few rows or many that depends.

Final set?

1) (City, Age, Gender) or (City, Gender, Age) – you should benchmark and check EXPLAINS which one is better for all combinations of (male, female, both) and different ranges on Age with high selectivity and low. At the end you can end up with one index, both or may be consider using (City, Age). Whatever you select to use can be used as prefix for other indexes cuz conditions on these fields are mandatory.

2) You’ve selected mandatory prefix, for example (City, Gender, Age) is the most effective. Drop it 🙂 and instead create (City, Gender, Age, LookingFor) and (City, Gender, Age, Available). So mysql can choose between two which is more selective.

3) You can even use (City, Gender, Age, LookingFor, Available) – needs investigation.

Aaaa… so long indexes. Let’s consider you’re superb and have 1 billion of users. If you’re smart enough and read mysql performance related articles then you’ll store all those fields as enums or tiny int with dictionary (in your up or table). 4(id)+1+1+1+1+1+1 = 10 bytes per record => >=10GB for data. Each index will be between 5GB and size of data (for innodb). May be I’m wrong here, but I do think that size of index(City, Gender, Age) will be smaller than size of three (City), (Gender) and (Age). Considering quite low standalone selectivity of those columns, it’s up to you decide what to do.

1. http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/#comment-1695
2. http://dev.mysql.com/doc/refman/5.1/en/index-merge-intersection.html

Ruslan Zakirov

I was describing myspace’s search referenced by Nachos where country (sorry, it’s not city) is mandatory. Anyway, condition on age is mandatory and it’s a range, so sorting sure will be problem. Especially when result set is big. At least it will be filesort and not necessary temporary table as we select from one table.

To solve sorting issue on big sets, people can use pre-explain. Then decide either use hint to force ordered access or access based on conditions. To avoid explains local in memory statistics can be used to calculate conditions selectivity.

Anyway, question was about set of indexes with compound vs. single context. Sorting falls out of scope 🙂

David Drouin

Here’s to hoping that eventually MySQL’s InnoDB becomes close to on par with the competition – in particular that other open source db 😉 for relational database support. Going from any of those systems to MySQL is like switching from a driving modern automobile to one from the Flintstones presently. Okay it’s not that bad, but it can be downright abysmal in some respects. Declarative referential integrity index requirements is one of the areas that’s truly frustrating.

Lets say you have a clue and design your OLTP database using at least 3rd normal form and integrity constraints. You realize that ISAM was perhaps a reasonable choice in 1965 for a production database involving inserts, updates and deletes but not so much in 2010. MySQL surprisingly supports declarative integrity constraints. But wait being MySQL there must be something about it that’s bizarre and going to bite you somehow. Well it also requires that an index be created on each referencing column. At first this may seem to be tolerable, especially for cases where you’re requiring that deletes / updates cascade from the parent to child tables – naturally an index on the referencing columns in the child tables would help there – well maybe. Often though, the database is designed such that this is not the case. Updates and deletes of primary key values are instead restricted – that is not permitted. So the indexes on the referencing columns are quite useless. Actually they’re more than useless they’re detrimental. Not only do these indexes slow down insert operations, the MySQL optimizer sometimes appears to become rather overwhelmed by the number of indexes and tables to choose from for moderately complex queries and selects an extraordinarily inefficient plan resulting in horrendous performance by using one of these foreign key index with low cardinality and somewhat even distribution first instead of a primary or unique key even if such a key exists and would result in 1 row matching. So to circumvent that you need to help it out with optimizer hints like straight_join to have it look at tables in the right order and come up with a sane choice (this is using version 5.0.45 btw). This great fun when you’re using an ORM like SQL Alchemy and want to avoid writing sql directly in application code or introducing things that tie you to a particular dbms.

So as this thread started out by alluding to lots of single column indexes generally aren’t a good idea, well MySQL forces this upon you if you actually implement a proper database design and use declarative syntax to do so.

Here’s an example table

create table orders(
order_id int unsigned not null auto_increment primary key,
item_type_id tinyint unsigned not null,
locale_id smallint unsigned not null,
currency_id tinyint unsigned not null,
operator_id smallint unsigned not null,
order_amount decimal(10, 4) not null,
order_date datetime not null,
last_updated timestamp not null default current_timestamp on update current_timestamp,
foreign key(item_type_id) references item_types(item_type_id),
foreign key(locale_id) references items(locale_id),
foreign key(currency_id) references items(currency_id),
foreign key(operator_id) references items(operator_id)
);

create index order_date_idx on orders(order_date);

Okay so we know that we’ll never delete or change item_type ids, locale ids, currency ids or operator ids. And even if we did it would be a big deal – requiring system down time etc. New ones may be added but that’s it. Lets say there are about 10 million rows in this orders table. Of the referenced columns there are 10 types of items, 5 locales they could be sold in presently, 3 currencies and 20 operators.

MySQL will create the following indexes:
unique index on order_id
index on item_type_id
index on locale_id
index on currency_id
index on operator_id

You have no control over this at all – aside from not using declarative integrity constraints. You could resort to the 1980’s early 90’s way of writing triggers to implement referential integrity but I hear triggers are also pretty slow themselves in MySQL too and writing a few hundred of those isn’t going to be to enjoyable at all.

Additionally I need an index on order_date to help with finding orders for a given date.

8 columns 6 indexes on single columns. This is silly. You can imagine a table with more columns and FKs and more indexes. Sure you can vertically partition the data but you’ll end up with the same indexes regardless.

The foreign key indexes are all low cardinality with roughly even distributions. Nearly always a poor choice to use for any query. This is why most if not all widely used databases do not require indexes on FKs. The database designer should be able to choose what to do. Having them by default might be ok but they should not be required.

David Drouin

realize I pressed submit too soon and had some typos on the FKs definitions:

create table orders(
order_id int unsigned not null auto_increment primary key,
item_type_id tinyint unsigned not null,
locale_id smallint unsigned not null,
currency_id tinyint unsigned not null,
operator_id smallint unsigned not null,
order_amount decimal(10, 4) not null,
order_date datetime not null,
last_updated timestamp not null default current_timestamp on update current_timestamp,
foreign key(item_type_id) references item_types(item_type_id),
foreign key(locale_id) references locales(locale_id),
foreign key(currency_id) references currencies(currency_id),
foreign key(operator_id) references operators(operator_id)
);

Will

Thanks for the great blog. How to index a product table with columns like price, unitprice, weight, color, created_on … and sql were dynamically generated from the web to lookup products. for example:

select * from products where price > 10 and price 5 and weight <70 and color = 1 … order by created_on desc limit 10;

Note that all the conditions are dynamically generated and some of them may or may not be there, e.g., some people filter by price, some by color, some by unitprice, … numerous possible combinations. How would you index this type of tables? Appreciate it!