As part of our support services, we do a lot of query optimization. This is where most performance gains come from. Here’s an example of the work we do.
Some days ago a customer arrived with the following table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | CREATE TABLE `infamous_table` ( `id` int(11) NOT NULL AUTO_INCREMENT, `member_id` int(11) NOT NULL DEFAULT '0', `email` varchar(200) NOT NULL DEFAULT '', `msg_type` varchar(255) NOT NULL DEFAULT '', `t2send` int(11) NOT NULL DEFAULT '0', `flag` char(1) NOT NULL DEFAULT '', `sent` varchar(100) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `f` (`flag`), KEY `email` (`email`), KEY `msg_type` (`msg_type`(5)), KEY `t_msg` (`t2send`,`msg_type`(5)) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
And a query that looked like this:
1 2 3 4 | SELECT COUNT(*) FROM `infamous_table` WHERE `t2send` > 1234 AND `msg_type` LIKE 'prefix%'; |
The table had an index t_msg
that wasn’t helping at all: the EXPLAIN
for our 1000000 rows test table looked like this:
1 2 3 4 5 6 7 8 9 10 | id: 1 select_type: SIMPLE table: infamous_table type: range possible_keys: t_msg key: t_msg key_len: 4 ref: NULL rows: 107478 Extra: Using where |
You can see the index is the on that was expected: “t_msg”. But the key_len
is 4. This indicates that the INT part was used, but that the msg_type(5)
part was ignored. This resulted examining 100k+ rows. If you have MySQL 5.6, you can see it more clearly with EXPLAIN FORMAT=JSON
under used_key_parts
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | EXPLAIN: { "query_block": { "select_id": 1, "table": { "table_name": "infamous_table", "access_type": "range", "possible_keys": [ "t_msg" ], "key": "t_msg", "used_key_parts": [ "t2send" ], "key_length": "4", "rows": 107478, "filtered": 100, "index_condition": "(`test`.`infamous_table`.`t2send` > 1234)", "attached_condition": "(`test`.`infamous_table`.`msg_type` like 'prefix%')" } } } |
The customer had multi-valued strings like “PREFIX:INT:OTHER-STRING” stored in the columnmsg_type
, and that made it impossible to convert it to an enum or similar field type that allowed changing the LIKE
for an equity.
So the solution was rather simple: just like for point and range queries over numeric values, you must define the index with the ranged field as the rightmost part. This means the correct index would have looked like msg_type(5),t2send
. The EXPLAIN
for the new index provided the customer with some happiness:
1 2 3 4 5 6 7 8 9 10 | id: 1 select_type: SIMPLE table: infamous_table type: range possible_keys: t_msg,better_multicolumn_index key: better_multicolumn_index key_len: 11 ref: NULL rows: 4716 Extra: Using where |
You can see the key_len
is now what we would have expected: four bytes for the INT and another seven bytes for the VARCHAR
(five for our chosen prefix + two for prefix length). More importantly, you can notice the rows count decreased by approximately 22 times.
We used pt-online-schema on the customer’s environment to apply ALTER
to avoid downtime. This made it an easy and painless solution, and the query effectively executed in under 1/20 of the time! So, all fine and dandy? Well, almost. We did a further test, and the query looked like this:
1 2 3 4 | SELECT COUNT(*) FROM `infamous_table` WHERE `t2send` > 1234 AND `msg_type` LIKE 'abc%'; |
So where’s the difference? The length of the string used for the LIKE
condition is shorter than the prefix length we choose for the VARCHAR
part of the index (the customer intended to look-up strings with only three chars, so we needed to check this). This query also scanned 100k rows, and EXPLAIN
showed the key_len
was 4, meaning the VARCHAR
part was being ignored once again.
This means the index prefix needed to be shorter. We ALTERed the table and made the prefix four characters long, counting on the fact that the multi-valued strings were using “:” to separate the values, so we suggested the customer include the colon in the look-up string for the shortest strings. In this case, 'abc%'
would be 'abc:%'
(which is also four characters long).
As a final optimization, we suggested dropping old indexes that were covered by the new better_multicolumn_index
, and that were most likely created by the customer while testing optimization.
Conclusion
Just like in point-and-range queries, the right order for multi-column indexes is putting the ranged part last. Equally important is that the length of the string prefix needs to match the length of the shortest string you intend to look-up. Just remember, you can’t make this prefix too short or you’ll lose specificity and the query will end up scanning rows unnecessarily.
Comments (4)
Hi Marcos,
Thanks for clarifying the key indexing and matching process in MySQL.
Quick question: What will you do if someone searches for let’s say for “ab%” or “a%”? Typically, it is the responsibility of the application to limit the minimum search characters – and not the database layer. Do you agree?
Also, which tools do you use to catch these queries?
Thanks for the share this update to all web users follow here http://windowstuts.net/connections and find the all windows 10 bluetooth audios device and wirelesses displays.
Great!
To tie this in with what I said earlier, Brent is definitely active in PASS. In fact, he’s presenting at the next indypass.org meeting tomorrow night (my local chapter).
I’ve learned a lot from consultants in his group (both current and former), and all are active in PASS. You’ll find that the vast majority of the experts in the field are all tied closely to PASS. https://www.essayswritingservice.co.uk/
Comments are closed.
Use Percona's Technical Forum to ask any follow-up questions on this blog topic.