September 1, 2014

MySQL Indexing Best Practices: Webinar Questions Followup

I had a lot of questions on my MySQL Indexing: Best Practices Webinar (both recording and slides are available now) We had lots of questions. I did not have time to answer some and others are better answered in writing anyway.

Q: One developer on our team wants to replace longish (25-30) indexed varchars with an additional bigint column containing the crc64, and change the indexing to be on that column. That would clearly save indexing space. Is this a reasonable performance optimization. (Keep in mind that the prefix adaptive hashing would fail here, because the first 10 or so characters usually are the same). Of course UNIQUE index optimizations can no longer be applied either.

A: This is good optimization in many cases. When you apply it though remember hash can have collisions so you will need to have your queries do something like SELECT * FROM TBL WHERE hash=crc32(‘string’) AND string=’string’ The other thing you need to consider is string comparison in MySQL is case insensitive by default while hash comparison will be done case sensitive unless you lowercase string before hashing. I also would note 25-30 bytes length is rather short for such hack as BIGINT itself is 8 bytes and the difference in the index length with all overhead is not going to be huge. I think this technique is best when you’re working with 100 bytes+ strings. Is say bytes as it is string length at which it makes sense is collation specific.

Q: ORDER By optimization issues: select * from table where A=xxx and B between 100 and 200 order by B Very common for a date range to also need to be ordered. The question is how can one have optimized indexes and sorting in such a scenario, since inequality ends index usage.

Actually in this case index on (A,B) would work well. If however you would need to sort by some 3rd column, say C index (A,B,C) would not work as range will prevent sorting from using the index. In this case you can use trick mentioned in the presentation to convert sort to the union for small ranges.

Q: In the case of a junction table, would indexing on (foreignkey1,foreignkey2) AND on (foreignkey2,foreignkey1) be a good idea?

A: Yes. This is a good practice. Normally I’d do something like CREATE TABLE LINK (id1 int unsigned not null ,id2 int unsigned not null, PRIMARY KEY(id1,id2), KEY K(id2)) engine=INNODB; when table has to be traversed in both directions for different queries. This will use fast primary key for some queries and use key K as covering index for lookup in other direction.
For Innodb Table id2 is not needed pas part of second key as PRIMARY key is appended to it internally anyway. For MyISAM table you should use K(id2,id1) in the same case. Some people would prefer to define second key as UNIQUE this has benefits and drawbacks. Benefit being you can get extra optimizations by optimizer knowing index is UNIQUE the drawback is insert buffer will not be able to be used, which can be important for large, heavily written tables.

Q: in trick #1 will “WHERE a IN (2-4)” be worse then “WHERE a IN (2,3,4)”? Another word is range for IN clause better than BETWEEN?

A: IN(2-4) will not do what you’re implying here. 2-4 will be evaluated as math expression and the result will be IN(-2) which is not what you’re looking for.

Q: I have a primary index on an int (ID) and other indexes on columns idx1(X,A,B,C) idx2(Y,A,B,C) etc (there are 5) would I be better off making the primary A,B,C,ID and Having other indexes on one column, idx1(X) idx2(Y) etc?

A: I would wonder whenever it is best setup for 5 indexes to differ only by first column. Regarding changing primary key to include such column prefix it depends on what you’re looking for a lot. This will cause data clustering done by these columns which can be helpful if you’re doing a lot of range scans on what would be primary key but it also can slow down your inserts and make primary key significantly fragmented. I also would note there are some MySQL optimizer restrictions in how well it can deal with primary key appended to the index column, especially in such case as you’re suggesting. In the end I would seek for a lot of performance gains before I move to such unusual setup.

Q: Table1 has a primary key. Table2 joins to table1 using Table1′s primary key. Should table2 have an index on the field that is used to join the two tables?

A: The question in this case is how MySQL will execute the join. If it will fist lookup Table2 using some other index and when go to Table1 to lookup row by primary key when you do not need an index on a field which is used to join tables in table2

Q: In regards to extending an index being better than adding a new one: Let’s say I have a table named PO that has a primary key of PO # and 2 additional fields for vendor_id and order_id. If I have an index on vendor_id, order_id but my query is only selecting on vendor, will the index have any impact on the speed of the query?

A: If you extend the index from (vendor_id) to (vendor_id,order_id) you will make it 4 bytes longer (assuming order_id is int) which will impact your queries which only use vendor_id but unlikely significantly. It is likely to be a lot less expensive than having another index on (vendor_id,order_id) in addition to index on (vendor_id) alone. The cases when you really should worry about performance impact of extending index is when you increase its length dramatically, for example adding long varchar column. In such cases it indeed might be better to add another index.

Q: We have a database that has about 400GB of indexes. The indexes can’t fit in memory anymore. How does this affect performance?

A: Typically you do not need all your indexes to be in memory only those portions of them which are accessed frequently. The size of this “working set” can greatly depend on application and can range from 5% of total size or less to almost 100%. When your go from working set what fits in memory to the one which does not any more performance can degrade 10 or more times.

Q: In which cases should auto-increment be used as primary key?

A: Auto-increment is a good default primary key. You should pick something else if you have a good reason to do it – if you would benefit from data clustering in the different way or if you have some other natural candidate for primary key which gets a lot of lookups is frequent reason to use something else as primary key.

Q: How many indexes is too many?

A: There is a hard limit on amount of indexes you can have, which is 64 per table in recent MySQL versions. However it is often too many. Instead of thinking about hard limit of indexes I prefer to add indexes only in cases which provides positive impact to performance. At some point the gains from indexes you add will be less than performance loss because of having too many indexes.

Q: Is there a difference between “`id` = 5″ and “`id` IN (5)” regarding indexes and performance?

A: Recent MySQL versions are smart enough to convert id IN (5) to ID=5 (for single item in-lists). There were the times when when it would make a difference though.

Q: Would you recommend creating an index in every table you create? Example: CREATE TABLE user_competition_entry user_id (INT), competition_id(INT);
The table is only used to record a user_id and competition_id, nothing more. Would doing a SELECT competition_id, COUNT(user_id) AS user_count FROM user_competition_entry GROUP BY competition_id; be slower without an index?

A: I would define (competition_id,user_id) as a PRIMARY KEY for such table. It also will help the query you’re mentioning allowing group by to be performed without temporary table or external sort.

Q: How can we manage indexes on servers from DBA point of view ? Is there any management required or server does everything itself. Especially when using a CMS where DB structure is prepdefined

A: MySQL Server will not automatically define any indexes for you. Hopefully your CMS already comes with reasonable set of indexes, if not you will need to add indexes manually.

Q: What are some methods to overcome vastly differing cardinality on a primary key. After running an analyze on a table with 11M rows I’ve seen cardinality range from 19 to over 19,000?

Cardinality is property of data so you usually would deal with it not overcome it. The best thing to start with is looking at the queries around “outliers” – the keys which have a lot of values to see if you can make them work well. You might need to redesign schema and queries to make it work well.

Q: how does a index be used having an index on one column and using order by on another column. do i need to add the index on column using order by clause.

A: If index is used for ORDER BY the same index must be used for selection for the same table, not other index, also you only can have equality comparison as a search condition WHERE A=5 ORDER BY B will use index (A,B) for sorting optimization, for more complicated conditions you will need to use something like Trick “Unionizing Order by” described in presentation.

Q: what is the impact on indexing to use wider UUID such as VARCHAR(36) instead of auto-increment

A: If you’re using UUID it is at least good to convert it in binary form and store as VARBINARY(16) for performance reasons. In any case you would likely to get table which is larger than if you would use auto increment. Having said that there are many people using UUID rather successful in applications which do not need to be optimized for peak performance or the cases when this does not become the bottleneck. Also check out my old article on the topic which goes into a lot more details.

Q: how mysql use index for group by?

A: If you have Index on the column MySQL can avoid temporary table or filesort for group by by this column. This works because by scanning data in index order MySQL gets data in already sorted order and looks at “one group at the time”, computing aggregate functions as needed.

Q: Is there any special concerns or tricks for selecting using some date ranges? or between dates? or after a date?

A: Date comparisons work very similar to other comparison and same tricks may apply, for example you may benefit to convert BETWEEN into IN-ranges in some cases for better index usage.

Q: Is the b+ tree innodb index a single or double linked list at the leaf nodes? your slide showed single but the fact that you can use and index for “order by desc” indicates a double linked list.

Innodb has double linked list – each leaf page contains pointers to both previous and next pages in index order. Note however it is not really requirement for ORDER BY DESC optimization – you can still traverse BTREE in any direction, even if there are no page leaf pointers, it is just what it becomes relatively more expensive.

Everyone, Thank you for attending and your questions!

Check out more MySQL Webinars from Percona!

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Jose Luis Martinez says:

    Hi,

    In the case of junction tables I always have doubts with the optimal reverse index for InnoDB:
    – should I only index (id2) because Innodb will already store id1 in the index (after all, it is part of the PK, so InnoDB will include it)
    – should I index (id2, id1) with UNIQUE to help out with the execution plans

    What do you think is the best way to index these junction tables?

  2. Nicolas Chiappero says:

    I guess question :

    Q: in trick #1 will “WHERE a IN (2-4)” be worse then “WHERE a IN (2,3,4)”? Another word is range for IN clause better than BETWEEN?

    should be read:

    Q: in trick #1 will “WHERE a BETWEEN 2 AND 4” be worse then “WHERE a IN (2,3,4)”? Another word is range for IN clause better than BETWEEN?

  3. Jose,

    I think you’re most likely getting close to splitting hair in these cases. With primary key (id1,id2) indeed index on (id2) should be enough in many cases as it is storing primary key already. Good catch. For MyISAM you would like to have setup with complete reverse. UNIQUE will allow you to avoid one read_next call but will prevent insert buffer from working.

  4. Nicolas,

    The questions have been typed in, so I replied to the question as it read. I have mentioned in presentations there are cases when IN clause can work a lot faster compared to BETWEEN range. Reverse though might also be true for very large ranges.

Speak Your Mind

*