November 27, 2014

Do you always need index on WHERE column ?

I believe we wrote about this before, but this topic popups again and again.
Today I’ve read opinion that if we have clause WHERE has_something=1 we should have index on column has_something (the column has two values 0 and 1).

In reality the right answer is not so simple.

Let’s look next table

with 20.000.000 records.

And in first case has_something=0 for 90% of rows (with random distribution)

Let’s check execution time with and without index

As you see with index the time is by 3.5 times slower.

Good that mysql in this case choose do not use index

Let look the case when has_something = 0 for 50% of rows.

query with index is still 2 times slower.

and this time mysql is going to use index in execution plan:

What about 30% rows with has_something=0 ?

Still query without index is faster.

And finally for case with 20% rows with has_someting=0

So only in the last case we really need the index on column has_something

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Christoph says:

    Very surprising for me! Can you explain the result?

  2. Vadim says:

    Christoph,

    The reason why index scan is slower than table scan is that in case of the index scan the MySQL needs to perform more operations like :
    read index, get pointer to row, get data from row,
    that adds overhead to the operations.

    In case of table scan MySQL just goes trough the table and read all rows in continuous mode.

    When count of scanned rows more than 20% (in our example) the overhead of index scan makes it non-effective.
    In real queries the right percent of rows depends on many factors, but for random uniformly distributed values I would say it is in 20-30% interval.

  3. peter says:

    What do you find surprising ?

    Full table scan takes constant time no matter the cardinality, index traversing speed depends on the index cardinality.

    Note in this case it is not pure index scan (“Using index” in explan) but access of the large portion of the rows in the table in non sequential order.

    The 3.5 difference we see here matches given case – when data fits in memory so we’re mainly measuring CPU overhead if we would get random disk IO performance difference can be as large as 100 or even 1000 times !

    There is other interesting issue – note name is declared NOT NULL so count(name) should be same as count(*) and so covering index could be used while MySQL is not smart enough for this.

  4. pcasey says:

    Does anyone know if there are plans to include index statistics in the optimizer decisions in the future? Seems like this is a clear case where an optimizer with index cardinality stats should have recognized that the index wasn’t helpful and done a full table scan instead.

  5. Thanks for covering this Peter. This is one area where many people get confused and stay confused.

    Indexes are of no use if selectivity is down (which happens when there are a few unique values for the column. Anyone confused on this should read more on selectivity.

    Frank

  6. Christoph says:

    Thanks a lot for the detailed answers! Now it’s much clearer for me.

  7. There is a sniplet on MySQL forge to find these useless/harmfull indexes. http://forge.mysql.com/snippets/view.php?id=85

  8. peter says:

    Thanks Mark,

    Even thought this is post by Vadim not me :)

    The other important thing to remember about selectivity – it is actually selectivity of value what matters. It is well possible to have column with 2 values indexes if one of them is in 99% of values and other is in 1% and have index well used to retrieve the data for that 1%

    Too bad MySQL does not have partial indexes though (I mean so only that 1% could really be indexed)

  9. Johhny says:

    Very interesting website. Keep up the outstanding work and thank you…

  10. Son Nguyen says:

    What are the possible alternatives to this particular example? When skipping index is faster but not the best? Creating a summary table? Or something else? I’d like to learn from the experts.

  11. Sergio says:

    That’s why my queries took some time (with & without index)..

    I have a tinint column that takes only 1 and 0 values.
    There are about 164000+ rows. Almost all have 0 value, just some of the rows have 1 (in general < 10).

    To be more specific, the column name is “file_hidden”. If the value is 1, then that row is not shown to public..

    SELECT * FROM files WHERE file_hidden = 0

    I guess the only solution in this situation is to use two tables with the same structure.. one for “1” (hidden from public), one for “0” (visible to public) rows

  12. RNT says:

    Thanks for sharing.

    R&T SOlutions

  13. Nognir says:

    Couldn’t replicate your results on my machine (MySQL 5.1.61). Have a modest DB with about 1M records. I have a table in which one column points to whether the line is_deleted or not. Usually the is_deleted value is 0 and I index it. count(*) with ignore index takes abt 0.16 s. With an index, searching for 0 takes abt .13 s and for 1 takes abt 0.0s. If I increase the number of 1s to abt 50%, the time taken to count 1s and 0s is approx 0.6-0.7s. So indexing is actually faster in all cases.
    Maybe it is the size of DB

Speak Your Mind

*