Pattern Matching Queries vs. Full-Text Indexes

Pattern Matching Queries vs. Full-Text IndexesIn this blog post, we’ll compare the performance of pattern matching queries vs. full-text indexes.

In my previous blog post, I looked for a solution on how we can search only a part of the email address and how can we make faster queries where the condition is email LIKE '%n.pierre%'. I showed two possible ways that could work. Of course, they had some pros and cons as well but were more efficient and faster than a like '%n.prierre%'.

But you could also ask why I would bother with this? Let’s add a FULLTEXT index, and everybody is happy! Are you sure about that? I’m not. Let’s investigate and test a bit. (We have some nice blog posts that explain how FULLTEXT indexes work: Post 1, Post 2, Post 3.)

Let’s see if it works in our case where we were looking for email addresses. Here is the table:

Add the default full-text index:

It took only five seconds for 320K email addresses.

Let’s run a search:

Immediately, we have issues with the results. It returns 43 rows, but there are only 11 rows with string n.pierre. Why? It is because of . The manual says:

The built-in FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example,   (space), , (comma), and . (period).

The parser believes that a . starts a new word, so it is going to search for pierre instead of n.pierre. That’s not good news as many email addresses contain ..  What can we do? The manual says:

It is possible to write a plugin that replaces the built-in full-text parser. For details, see Section 28.2, “The MySQL Plugin API”. For example parser plugin source code, see the plugin/fulltext directory of a MySQL source distribution.

If you are willing to write your own plugin in C/C++, you can try that route. Until then, it is going to give us back a lot of irrelevant matches.

We can order the results by relevancy:

This does not guarantee we get back the lines that we are looking for, however. I tried to change innodb_ft_min_token_size as well, but it did not affect the results.

Let’s see what happens when I search for williamson pierre. Two separate words. I know there is only one email address with these names.

The first result is that we still got another 49 addresses. How can the application decide which email address is relevant and which is not? I am still not happy.

Are there any other options without writing our own plugin?

Can I somehow tell the parser to use n.pierre as one word? The manual says:

A phrase that is enclosed within double quote (") characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase".

I can use double quotes, but it will still split at . and the results are the same. I did not find a solution except writing your own plugin. If someone knows a solution, please write a comment.

With Parser Ngram

The built-in MySQL full-text parser uses delimiters between words, but we can create an Ngram-based full-text index.