November 22, 2014

How to find wrong indexing with glance view

Quite common beginners mistake is not to understand how indexing works and so index all columns used in the queries…. separately. So you end up with table which has say 20 indexes but all single column ones. This can be spotted with a glance view. If you have queries with multiple column restrictions in WHERE clause you most likely will need to have multiple column indexes for optimal performance. But wait. Do not go ahead and index all combinations. This would likely be poor choice too :)

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Rob Wultsch says:

    I’d rather see 20 separate indexes than an index which covers 15 columns….
    (I’ve seen both)

  2. peter says:

    Hehe. Indeed. Putting all columns you want to lookup on in the index (in any order) is another mistake you can see. Though I think it is less frequent one.

  3. charles says:

    Yeah i want to see 20 separate indexes too than 15 columns.

  4. Nacho says:

    how do you think it would be better approach with a query with so many restrictions in WHERE clause like this filter ? http://browseusers.myspace.com/Browse/browse.aspx?MyToken=633549779643841239 (in case that all data is on the same table)
    i guess a multiple index in the main columns would be the best choice

  5. tgabi says:

    There are few particular cases when multi-column indexes are used by mysql. Most of the time mysql will scan the first column on the index (check EXPLAIN to verify). At least with 2 separate indexes there is a chance that mysql will use both indexes together.

  6. I suspect no one “tests” scenarios based on the comments left here. There are situations which can warrant any number of scenarios. I can tell you that I’ve seen a use for creating one index for 5 columns and it made a huge difference. In my case, queries went from taking over an hour to 5 minutes by adding 3 columns to my original one index.

    The last column I was searching for in my WHERE clause was a string. Here’s how I took that 5 minutes and brought it down to 2 seconds: I made a new column that was an unsigned INT where I stored the string’s crc32 value. Now instead of doing WHERE string = ‘value’ I do WHERE string_crc32 = ‘4958494034’ for example. Since I now added string_crc32 to my index (now with 5 columns) my queries execute in seconds rather than minutes.

    You can index a VARCHAR but from my tests it just doesn’t match the performance of indexing an INT, unsigned or not.

    You can take the advice or not but even advice is just that. Real way to know if a change makes a difference in your situation is to create a development environment with the same database on a slower machine/server, setup log-queries and use EXPLAIN on queries that are slow. Once you get your creative juices flowing with knowledge like above you can take just about any slow query and make it super fast. This is how I did it, I tried every combination testing each with EXPLAIN. Eventually you’ll find the right combination. For me, the primary key had to be first, the other three columns in my index had to be in the middle and the string_crc was last.

    I was somewhat cleaver with my queries till I attended a presentation by Jay Pipes on Join Fu. (http://www.jpipes.com/) You think this discussion is enlightening, then you definitely want to check out his Join Fu slides on his blog.

  7. tgabi says:

    As I said: few particular cases. Of course everybody do (or should) test the queries. Of course integer indexes with high cardinality will beat string indexes – unless you need to search by partial matches like ‘abc%’. But we’re talking here about mysql behaviour. I suspect that in your case the cardinality of the index was pretty bad and that improved by adding extra 3 columns. I bet that you can drop a few columns from that index without losing any performance. And there’s the problem of mysql choosing single column indexes where multi-column indexes are present, and wasting memory and disk space.
    I still think that single column indexes make more sense in mysql most of the time. Index intersection should be more efficient than multi-column index for searching. Ordering is a diffent issue.

  8. Like I said in my case I tried multiple combinations, including removing some of the columns that you recommend that I remove and that had performance consequences. I join 2 tables to together plus my WHERE is a bit more complicated referencing account and product id’s, which I didn’t include in the example above. Again, try every combination in a test environment. Another tip, turn off caching on your testing environment MySQL that way the queries are re-executed. Cached results can make you think you got an optimized query when you really don’t.

    Thinking in terms of the process of finding the records, if your first indexed column cuts the results from billions to millions, then your second, third and fourth column cut that down to about a million records, you definitely want to keep those 3 additional columns in your index. When that 4th extra column moves from millions to 20 records, it makes all the difference. Then your database doesn’t have to walk over a million records to find those 20 records.

    One other thought, if your not working with millions of records, you may have better results just by tweaking the MySQL config settings rather than spending time optimizing your queries, especially if your trying to improve queries that are less than 5-10 seconds. For readability purposes, you can just index one column without putting much thought into how the data will be retrieved by the database. This allows you to focus on the end goal, the web site or application’s results. Again, depends on your situation.

    I will agree with tgabi, most of the time multi-column indexes are not necessary. If you’re not having issues then you most likely have no need to read this post either. I’m just trying to help others by making aware reasons for particular solutions (multi-column index in this case) in MySQL.

  9. peter says:

    tgabi,

    MySQL Will only use only first column of multiple column index in case you have Range on the first column like col1 BETWEEN 5 and 10 – it is exactly skill of creating multi column indexes to ensure they are created smart way, so they can be used, as well as queries are written appropriate way (for example IN vs BETWEEN makes hell a lot of difference)

    Saying MySQL can’t use multiple column indexes is simular to saying MySQL always does full table scan – if you write write you queries suboptimal way or if you have poor choice of indexes (like having index on gender for general population) that is exactly what will happen.

  10. peter says:

    Angelo,

    You pick one particular query and for that query you indeed do not need the index because the extra column does not really improve selectivity. In this case you’re you exactly should not extend the index unless you’re planning to use index covered queries for this query.

  11. peter says:

    tgabi,

    Multiple column indexes will be a lot faster than multiple single column indexes in case there is good cardinality on both columns and full index can be used. Relaying on index-merge is actually very typical design mistake.

  12. peter says:

    The point:

    My point is not what single column indexes are bad and you always need multiple column indexes but the fact it is quite typical there are some queries which will benefit from multiple column indexes. If I see all tables have only single column indexes this typically means someone does not know about multiple column indexes and how MySQL can use them :)

  13. tgabi says:

    Yeah, all the queries in testing should have SQL_NO_CACHE.
    The table with the index you’re talking about is the pivot of the join ? I agree that indexes are good to cut trough useless data, still, so many values for each key of the first index makes me think the data is not normalized. Not that’s a bad thing, sometimes de-normalizing the data improves performance, just not common.
    I’m glad we agree on the per-case usage of multi-column indexes – contradicts Peter’s view BTW.
    Billions of records ? Wow. And 5 columns indexes ? Double wow.

  14. tgabi says:

    Peter,
    You said:”in case there is good cardinality on both columns and full index can be used”. I’d call that a “best case scenario”. In my line of work I’m searching trough many low cardinality columns in the same query. The query is very specific, yet each column has only few values.Sometimes I wish I can find a multi-column index that works. I have few multi-column that do work, particular when used for ordering (ORDER BY a,b LIMIT 5).
    Index merge is the only thing that works – when it works. I do agree that nobody should rely on index merge for now.

  15. peter says:

    I think adding to the column to the index (for selection not sorting) has the same question as adding the index for selection all together – if cardinality is poor it is better not to. Though increasing index length usually has less impact than going from full table scan to index ref scan (causing a lot of IOs)

    In case of low selectivity indexes you too can often build multiple column indexes, assuming you can use index prefix which is selective enough. Though there are times when bitmap indexes would be much better.

  16. peter says:

    tgabi,

    I do not see what you really disagree with. It is the question of multiple column indexing speeding up your queries. If they do not. There is surely no point using them. Billions of records ? So what. If you your main queries can use all key parts in 5 column index it will be faster than 5 single column indexes. And also much easier to maintain.
    Really it may look as columns are scary but 5 ints for example is just 20 bytes so the index would be same as index on char(20) in terms of side.

  17. tgabi says:

    I think there are many cases where multiple column indexes don’t work. That’s all. By looking at the index structure alone you cannot draw any conclusion about the design quality. You need to look at the data too.
    It seems that we disagree on index merge versus multiple column index. I’m saying that I see index merge working when multiple column indexes don’t. You’re saying it cannot be used – you don’t say why, but I don’t disagree. One thing is clear: single column indexes are not enough, no doubt about that.
    Regarding billions of records: it’s just awesome. It’s not every day you hear something like that. Well maybe you do, I don’t. I’m few months away from “billion records club” BTW.

  18. peter says:

    tgabi,

    I have cases when billions records are loaded daily so it no more excites me.

    Speaking about index merge – indeed there are cases when index merge works when multiple column index does not however if they both work multiple column index are most likely to be the winner.

    There are always exceptions. Even running with MySQL defaults can be good fit for particular case. I’m just saying this is very typical red flag thing. Sure you need to look in details to check if it is really the case but it is called red flag exactly because it shows there are the problem in say 90% of all cases. If you made deliberate choice going with multiple column indexes good for you – if you can make this deliberate choice you already stand away from the crowd :)

  19. peter says:

    Nacho,

    For cases like you mentioned – Dating site search I would rather go with Sphinx all together. It handles cases like this beautifully (even if you do not have any full text search) though if you want to stick to MySQL you’ve got to look at the cardinality for different columns and query pattern and handle it appropriately. For example gender may not be selective but specified in 99% of the cases which makes it good first column in the index.

  20. tgabi says:

    Peter,
    you said “assuming you can use index prefix which is selective enough”. What do you mean by that ?
    In specific case – derived from Nacho’s example: say you have millions of users with gender, race and hair color. Many males, many blacks, many blondes, very few all three. How do you see this search in Mysql ?

  21. peter says:

    tgabi,

    I mean if you have Index (A,B,C,D,E) and MySQL can use (A,B,C) index only for this query. Considering it is selective enough (enough meaning query response time is acceptable) you’re good. Though as your data size growths the acceptable selectivity tends to grow as well. If you have 100000 profiles selectivity of 1/10 is often good enough as scanning through 10000 rows in memory is often acceptable. With 50.000.000 profiles this would not be the case.

    With Nachos case if you have “WHERE GENDER=’M’ AND RACE=’B’ AND HAIR=’Blond’ and it is very selective index on (GENDER,RACE,HAIR) would work quite well to resolve it. Though you really need to be looking at your best queries while picking solutions and so your worst cardinaliy.

  22. Ruslan Zakirov says:

    More about Nachos’ example

    1) Looks like all fields are linked one-to-one to users so can be all stored in users table(may be except looking_for, but let’s keep it simple), some thing like (gender, age, available, looking_for, location, has_foto)
    2) booleans – gender, has_foto; enums – available, looking_for, location; integer – age
    3) All queries fall into X IN (…) AND Y IN (…) AND …

    Some conditions fall into X = const – ref access that may be preferable over range. Condition on Age may have up to 60 values so it can be converted to IN instead of using BETWEEN [1].

    As all boolean operators are ‘AND’, mysql can use either indexes built on multiple columns or intersection of multiple accesses by indexes. When we talk about index_merge_intersection, we should understand that mysql finds one set using an index, another set and then finds records which exist in both sets [2]. Consider this “age = 18 and available = ‘divorced'”. I’ve used Russia as target for investigation and here is results: both conditions – 58, divorced – 623, 18th – 3000 (looks like it’s limit and most probably it’s much more than that). It’s pretty obvious that using index (age, available) is much better all the time as it’s always more selective than any of its columns alone. If you don’t have such compound index then mysql most probably use index on “available” column only using ref access and after that filter by age. This is optimal choice. So in any case index intersection sucks. When it doesn’t suck? In something like “age = 68 AND available = ‘engaged'”, but even in this case compound index quickly returns a few rows.

    Enough about intersections. About booleans. has_foto? How many of your users have a foto? 100%? 10%? Most probably it’s more than 70% or even close to 90%. So it’s very not selective column. Should we avoid using it in indexes? Yes, if it’s an index based on this column only. No, if it’s part of compound index. Why? Because more than 70% percents of people will mark “show people with fotos only” and mysql can use index to filter records before fetching any rows. How much it will give? Not many, worth investigation. May be impact will be negative especially when almost all people have fotos.

    Gender? you can use it everywhere as peter suggested as 90% of searches will use one either ‘male’ or ‘female’ and it will give you twice smaller result set. However, you most probably should cheat a little and give mysql a hint by using “gender IN (‘male’,’female’)” when people looking for any gender.

    Other. Location is mandatory with quite good selectivity comparing with other enumns. Age as well is mandatory, but it can select few rows or many that depends.

    Final set?

    1) (City, Age, Gender) or (City, Gender, Age) – you should benchmark and check EXPLAINS which one is better for all combinations of (male, female, both) and different ranges on Age with high selectivity and low. At the end you can end up with one index, both or may be consider using (City, Age). Whatever you select to use can be used as prefix for other indexes cuz conditions on these fields are mandatory.

    2) You’ve selected mandatory prefix, for example (City, Gender, Age) is the most effective. Drop it :) and instead create (City, Gender, Age, LookingFor) and (City, Gender, Age, Available). So mysql can choose between two which is more selective.

    3) You can even use (City, Gender, Age, LookingFor, Available) – needs investigation.

    Aaaa… so long indexes. Let’s consider you’re superb and have 1 billion of users. If you’re smart enough and read mysql performance related articles then you’ll store all those fields as enums or tiny int with dictionary (in your up or table). 4(id)+1+1+1+1+1+1 = 10 bytes per record => >=10GB for data. Each index will be between 5GB and size of data (for innodb). May be I’m wrong here, but I do think that size of index(City, Gender, Age) will be smaller than size of three (City), (Gender) and (Age). Considering quite low standalone selectivity of those columns, it’s up to you decide what to do.

    1. http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/#comment-1695
    2. http://dev.mysql.com/doc/refman/5.1/en/index-merge-intersection.html

  23. peter says:

    Ruslan,

    Thanks for good explanations. The City in the start would work only in case you always specify it… As you can’t use IN trick for cities as there are too many of them.
    The problem with IN in such cases is generally related to sorting – results typically need to be sorted some way and as soon as you use IN on any key parts sorting can’t be done using index and filesort may be too slow for many matches. Though in reality problem can get even more complicated because sorting can be done by the function, for example by distance or using some sort of scoring system.

  24. Ruslan Zakirov says:

    I was describing myspace’s search referenced by Nachos where country (sorry, it’s not city) is mandatory. Anyway, condition on age is mandatory and it’s a range, so sorting sure will be problem. Especially when result set is big. At least it will be filesort and not necessary temporary table as we select from one table.

    To solve sorting issue on big sets, people can use pre-explain. Then decide either use hint to force ordered access or access based on conditions. To avoid explains local in memory statistics can be used to calculate conditions selectivity.

    Anyway, question was about set of indexes with compound vs. single context. Sorting falls out of scope :)

  25. David Drouin says:

    Here’s to hoping that eventually MySQL’s InnoDB becomes close to on par with the competition – in particular that other open source db ;) for relational database support. Going from any of those systems to MySQL is like switching from a driving modern automobile to one from the Flintstones presently. Okay it’s not that bad, but it can be downright abysmal in some respects. Declarative referential integrity index requirements is one of the areas that’s truly frustrating.

    Lets say you have a clue and design your OLTP database using at least 3rd normal form and integrity constraints. You realize that ISAM was perhaps a reasonable choice in 1965 for a production database involving inserts, updates and deletes but not so much in 2010. MySQL surprisingly supports declarative integrity constraints. But wait being MySQL there must be something about it that’s bizarre and going to bite you somehow. Well it also requires that an index be created on each referencing column. At first this may seem to be tolerable, especially for cases where you’re requiring that deletes / updates cascade from the parent to child tables – naturally an index on the referencing columns in the child tables would help there – well maybe. Often though, the database is designed such that this is not the case. Updates and deletes of primary key values are instead restricted – that is not permitted. So the indexes on the referencing columns are quite useless. Actually they’re more than useless they’re detrimental. Not only do these indexes slow down insert operations, the MySQL optimizer sometimes appears to become rather overwhelmed by the number of indexes and tables to choose from for moderately complex queries and selects an extraordinarily inefficient plan resulting in horrendous performance by using one of these foreign key index with low cardinality and somewhat even distribution first instead of a primary or unique key even if such a key exists and would result in 1 row matching. So to circumvent that you need to help it out with optimizer hints like straight_join to have it look at tables in the right order and come up with a sane choice (this is using version 5.0.45 btw). This great fun when you’re using an ORM like SQL Alchemy and want to avoid writing sql directly in application code or introducing things that tie you to a particular dbms.

    So as this thread started out by alluding to lots of single column indexes generally aren’t a good idea, well MySQL forces this upon you if you actually implement a proper database design and use declarative syntax to do so.

    Here’s an example table

    create table orders(
    order_id int unsigned not null auto_increment primary key,
    item_type_id tinyint unsigned not null,
    locale_id smallint unsigned not null,
    currency_id tinyint unsigned not null,
    operator_id smallint unsigned not null,
    order_amount decimal(10, 4) not null,
    order_date datetime not null,
    last_updated timestamp not null default current_timestamp on update current_timestamp,
    foreign key(item_type_id) references item_types(item_type_id),
    foreign key(locale_id) references items(locale_id),
    foreign key(currency_id) references items(currency_id),
    foreign key(operator_id) references items(operator_id)
    );

    create index order_date_idx on orders(order_date);

    Okay so we know that we’ll never delete or change item_type ids, locale ids, currency ids or operator ids. And even if we did it would be a big deal – requiring system down time etc. New ones may be added but that’s it. Lets say there are about 10 million rows in this orders table. Of the referenced columns there are 10 types of items, 5 locales they could be sold in presently, 3 currencies and 20 operators.

    MySQL will create the following indexes:
    unique index on order_id
    index on item_type_id
    index on locale_id
    index on currency_id
    index on operator_id

    You have no control over this at all – aside from not using declarative integrity constraints. You could resort to the 1980’s early 90’s way of writing triggers to implement referential integrity but I hear triggers are also pretty slow themselves in MySQL too and writing a few hundred of those isn’t going to be to enjoyable at all.

    Additionally I need an index on order_date to help with finding orders for a given date.

    8 columns 6 indexes on single columns. This is silly. You can imagine a table with more columns and FKs and more indexes. Sure you can vertically partition the data but you’ll end up with the same indexes regardless.

    The foreign key indexes are all low cardinality with roughly even distributions. Nearly always a poor choice to use for any query. This is why most if not all widely used databases do not require indexes on FKs. The database designer should be able to choose what to do. Having them by default might be ok but they should not be required.

  26. David Drouin says:

    realize I pressed submit too soon and had some typos on the FKs definitions:

    create table orders(
    order_id int unsigned not null auto_increment primary key,
    item_type_id tinyint unsigned not null,
    locale_id smallint unsigned not null,
    currency_id tinyint unsigned not null,
    operator_id smallint unsigned not null,
    order_amount decimal(10, 4) not null,
    order_date datetime not null,
    last_updated timestamp not null default current_timestamp on update current_timestamp,
    foreign key(item_type_id) references item_types(item_type_id),
    foreign key(locale_id) references locales(locale_id),
    foreign key(currency_id) references currencies(currency_id),
    foreign key(operator_id) references operators(operator_id)
    );

  27. Will says:

    Thanks for the great blog. How to index a product table with columns like price, unitprice, weight, color, created_on … and sql were dynamically generated from the web to lookup products. for example:

    select * from products where price > 10 and price 5 and weight <70 and color = 1 … order by created_on desc limit 10;

    Note that all the conditions are dynamically generated and some of them may or may not be there, e.g., some people filter by price, some by color, some by unitprice, … numerous possible combinations. How would you index this type of tables? Appreciate it!

Speak Your Mind

*