In this blog post, we’ll compare the performance of pattern matching queries vs. full-text indexes.
In my previous blog post, I looked for a solution on how we can search only a part of the email address and how can we make faster queries where the condition is email LIKE '%n.pierre%'. I showed two possible ways that could work. Of course, they had some pros and cons as well but were more efficient and faster than a like '%n.prierre%'.
But you could also ask why I would bother with this? Let’s add a FULLTEXT index, and everybody is happy! Are you sure about that? I’m not. Let’s investigate and test a bit. (We have some nice blog posts that explain how FULLTEXT indexes work: Post 1, Post 2, Post 3.)
Let’s see if it works in our case where we were looking for email addresses. Here is the table:
|
1 |
CREATE TABLE `email` (<br> `id` int(10) unsigned NOT NULL AUTO_INCREMENT,<br> `email` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',<br> PRIMARY KEY (`id`),<br> KEY `idx_email` (`email`)<br>) ENGINE=InnoDB AUTO_INCREMENT=318465 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
Add the default full-text index:
|
1 |
ALTER TABLE email ADD FULLTEXT KEY (email); |
It took only five seconds for 320K email addresses.
Let’s run a search:
|
1 |
SELECT id, email FROM email where MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE);<br>+--------+--------------------------------+<br>| id | email |<br>+--------+--------------------------------+<br>| 2940 | pierre.west@example.org |<br>| 10775 | pierre.beier@example.org |<br>| 24267 | schroeder.pierre@example.org |<br>| 26285 | bode.pierre@example.org |<br>| 27104 | pierre.franecki@example.org |<br>| 31792 | pierre.jaskolski@example.com |<br>| 39369 | kuphal.pierre@example.org |<br>| 58625 | olson.pierre@example.org |<br>| 59526 | larkin.pierre@example.net |<br>| 64718 | boyle.pierre@example.com |<br>| 72033 | pierre.wolf@example.net |<br>| 90587 | anderson.pierre@example.org |<br>| 108806 | fadel.pierre@example.org |<br>| 113897 | jacobs.pierre@example.com |<br>| 118579 | hudson.pierre@example.com |<br>| 118798 | pierre.wuckert@example.org |<br>| 118937 | green.pierre@example.net |<br>| 125451 | hauck.pierre@example.net |<br>| 133352 | friesen.pierre@example.net |<br>| 134594 | windler.pierre@example.com |<br>| 135406 | dietrich.pierre@example.org |<br>| 190451 | daugherty.pierre@example.org |<br>... |
Immediately, we have issues with the results. It returns 43 rows, but there are only 11 rows with string n.pierre. Why? It is because of . The manual says:
The built-in
FULLTEXTparser determines where words start and end by looking for certain delimiter characters; for example,(space),,(comma), and.(period).
The parser believes that a . starts a new word, so it is going to search for pierre instead of n.pierre. That’s not good news as many email addresses contain .. What can we do? The manual says:
It is possible to write a plugin that replaces the built-in full-text parser. For details, see Section 28.2, “The MySQL Plugin API”. For example parser plugin source code, see the
plugin/fulltextdirectory of a MySQL source distribution.
If you are willing to write your own plugin in C/C++, you can try that route. Until then, it is going to give us back a lot of irrelevant matches.
We can order the results by relevancy:
|
1 |
SELECT id,email,MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE)<br> AS score FROM email where MATCH(email) AGAINST <br>('n.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc limit 10;<br>+-------+------------------------------+-------------------+<br>| id | email | score |<br>+-------+------------------------------+-------------------+<br>| 2940 | pierre.west@example.org | 14.96491813659668 |<br>| 10775 | pierre.beier@example.org | 14.96491813659668 |<br>| 24267 | schroeder.pierre@example.org | 14.96491813659668 |<br>| 26285 | bode.pierre@example.org | 14.96491813659668 |<br>| 27104 | pierre.franecki@example.org | 14.96491813659668 |<br>| 31792 | pierre.jaskolski@example.com | 14.96491813659668 |<br>| 39369 | kuphal.pierre@example.org | 14.96491813659668 |<br>| 58625 | olson.pierre@example.org | 14.96491813659668 |<br>| 59526 | larkin.pierre@example.net | 14.96491813659668 |<br>| 64718 | boyle.pierre@example.com | 14.96491813659668 |<br>+-------+------------------------------+-------------------+ |
This does not guarantee we get back the lines that we are looking for, however. I tried to change innodb_ft_min_token_size as well, but it did not affect the results.
Let’s see what happens when I search for williamson pierre. Two separate words. I know there is only one email address with these names.
|
1 |
SELECT id,email,MATCH(email) AGAINST <br>('williamson.pierre' IN NATURAL LANGUAGE MODE) AS score <br>FROM email where MATCH(email) AGAINST <br>('williamson.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc limit 50;<br>+--------+---------------------------------+-------------------+<br>| id | email | score |<br>+--------+---------------------------------+-------------------+<br>| 238396 | williamson.pierre@example.net | 24.08820343017578 |<br>| 2940 | pierre.west@example.org | 14.96491813659668 |<br>| 10775 | pierre.beier@example.org | 14.96491813659668 |<br>| 24267 | schroeder.pierre@example.org | 14.96491813659668 |<br>| 26285 | bode.pierre@example.org | 14.96491813659668 |<br>| 27104 | pierre.franecki@example.org | 14.96491813659668 |<br>| 31792 | pierre.jaskolski@example.com | 14.96491813659668 |<br>| 39369 | kuphal.pierre@example.org | 14.96491813659668 |<br>| 58625 | olson.pierre@example.org | 14.96491813659668 |<br>... |
The first result is that we still got another 49 addresses. How can the application decide which email address is relevant and which is not? I am still not happy.
Can I somehow tell the parser to use n.pierre as one word? The manual says:
A phrase that is enclosed within double quote (
") characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in theFULLTEXTindex for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example,"test phrase"matches"test, phrase".
I can use double quotes, but it will still split at . and the results are the same. I did not find a solution except writing your own plugin. If someone knows a solution, please write a comment.
The built-in MySQL full-text parser uses delimiters between words, but we can create an Ngram-based full-text index.
|
1 |
mysql> alter table email ADD FULLTEXT KEY (email) WITH PARSER ngram;<br>Query OK, 0 rows affected (20.10 sec)<br>Records: 0 Duplicates: 0 Warnings: 0 |
Before that, I changed the ngram_token_size to 3.
|
1 |
mysql> SELECT id,email,MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE) AS score FROM email where MATCH(email) AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE) ORDER BY 3 desc;<br>+--------+----------------------------------+--------------------+<br>| id | email | score |<br>+--------+----------------------------------+--------------------+<br>| 58625 | olson.pierre@example.org | 16.56794548034668 |<br>| 59526 | larkin.pierre@example.net | 16.56794548034668 |<br>| 90587 | anderson.pierre@example.org | 16.56794548034668 |<br>| 118579 | hudson.pierre@example.com | 16.56794548034668 |<br>| 118937 | green.pierre@example.net | 16.56794548034668 |<br>| 133352 | friesen.pierre@example.net | 16.56794548034668 |<br>| 200608 | wilkinson.pierre@example.org | 16.56794548034668 |<br>| 237928 | johnson.pierre@example.org | 16.56794548034668 |<br>| 238396 | williamson.pierre@example.net | 16.56794548034668 |<br>| 278384 | monahan.pierre@example.net | 16.56794548034668 |<br>| 306718 | rohan.pierre@example.com | 16.56794548034668 |<br>| 226737 | warren.pfeffer@example.net | 12.156486511230469 |<br>| 74278 | stiedemann.perry@example.net | 11.52701187133789 |<br>| 75234 | bogan.perry@example.org | 11.52701187133789 |<br>...<br>4697 rows in set (0.03 sec) |
Finally, we are getting somewhere. But it gives back 4697 rows. How can the application decide which results are relevant? Should we just use the score?
I dropped the Ngram FULLTEXT index and created a normal one because that gives me back only 43 results instead of 4697. I thought a full-text search might be good to narrow down the results from a million to a few thousand, and then we can run a select based on that. Example:
|
1 |
mysql> Select e2.id,e2.email from <br>(SELECT id,email FROM email where MATCH(email) <br>AGAINST ('n.pierre' IN NATURAL LANGUAGE MODE)) <br>as e2 where e2.email like '%n.pierre%';<br>+--------+-------------------------------+<br>| id | email |<br>+--------+-------------------------------+<br>| 58625 | olson.pierre@example.org |<br>| 59526 | larkin.pierre@example.net |<br>| 90587 | anderson.pierre@example.org |<br>| 118579 | hudson.pierre@example.com |<br>| 118937 | green.pierre@example.net |<br>| 133352 | friesen.pierre@example.net |<br>| 200608 | wilkinson.pierre@example.org |<br>| 237928 | johnson.pierre@example.org |<br>| 238396 | williamson.pierre@example.net |<br>| 278384 | monahan.pierre@example.net |<br>| 306718 | rohan.pierre@example.com |<br>+--------+-------------------------------+<br>11 rows in set (0.00 sec) |
Wow, this can work and it looks quite fast as well. BUT (there is always a but), if I run the following query (searching for ierre):
|
1 |
mysql> Select e2.id,e2.email from <br>(SELECT id,email FROM email where MATCH(email) <br>AGAINST ('ierre' IN NATURAL LANGUAGE MODE)) <br>as e2 where e2.email like '%ierre%';<br>Empty set (0.00 sec) |
It gives back nothing because the default full-text parser uses only full words! In our case, that is not very helpful. Let’s switch back to Ngram and re-run the query:
|
1 |
mysql> Select e2.id,e2.email from <br>(SELECT id,email FROM email where MATCH(email) <br>AGAINST ('ierre' IN NATURAL LANGUAGE MODE)) <br>as e2 where e2.email like '%ierre%';<br>+--------+--------------------------------+<br>| id | email |<br>+--------+--------------------------------+<br>| 2940 | pierre.west@example.org |<br>| 10775 | pierre.beier@example.org |<br>| 16958 | pierre68@example.com |<br>| 24267 | schroeder.pierre@example.org |<br>...<br>65 rows in set (0.05 sec)<br><br>+-------------------------+----------+<br>| Status | Duration |<br>+-------------------------+----------+<br>| starting | 0.000072 |<br>| checking permissions | 0.000006 |<br>| Opening tables | 0.000014 |<br>| init | 0.000027 |<br>| System lock | 0.000007 |<br>| optimizing | 0.000006 |<br>| statistics | 0.000013 |<br>| preparing | 0.000006 |<br>| FULLTEXT initialization | 0.006384 |<br>| executing | 0.000012 |<br>| Sending data | 0.020735 |<br>| end | 0.000014 |<br>| query end | 0.000014 |<br>| closing tables | 0.000013 |<br>| freeing items | 0.001383 |<br>| cleaning up | 0.000024 |<br>+-------------------------+----------+ |
It gives us back 65 rows, and it takes between 0.02-0.05s because the subquery results in many rows.
With my “shorting method”:
|
1 |
select e.email from email as e right join email_tib as t <br>on t.email_id=e.id where t.email_parts like "ierre%";<br>+--------------------------------+<br>| email |<br>+--------------------------------+<br>| anderson.pierre@example.org |<br>| bode.pierre@example.org |<br>| bode.pierre@example.org |<br>| boyle.pierre@example.com |<br>| bradtke.pierre@example.org |<br>| bradtke.pierre@example.org |<br>...<br>65 rows in set (0.00 sec)<br><br>mysql> show profile;<br>+----------------------+----------+<br>| Status | Duration |<br>+----------------------+----------+<br>| starting | 0.000069 |<br>| checking permissions | 0.000011 |<br>| checking permissions | 0.000003 |<br>| Opening tables | 0.000020 |<br>| init | 0.000021 |<br>| System lock | 0.000008 |<br>| optimizing | 0.000009 |<br>| statistics | 0.000070 |<br>| preparing | 0.000011 |<br>| executing | 0.000001 |<br>| Sending data | 0.000330 |<br>| end | 0.000002 |<br>| query end | 0.000007 |<br>| closing tables | 0.000005 |<br>| freeing items | 0.000014 |<br>| cleaning up | 0.000010 |<br>+----------------------+----------+ |
It reads and gives back exactly 65 rows and it takes 0.000s.
When it comes to pattern matching queries vs. full-text indexes, it looks like full-text index can be helpful, and it is built in. Unfortunately, we do not have many metrics regarding full-text indexes. We cannot see how many rows were read, etc. I don’t want to make any conclusions on which one is faster. I still have to run some tests with our favorite benchmark tool sysbench on a much bigger dataset.
I should mention that full-text indexes and my previous solutions won’t solve all the problems. In this and my other blog I was trying to find an answer to a specific problem, but there are cases where my solutions would not work that well.