Pattern Matching Queries vs. Full-Text Indexes

Pattern Matching Queries vs. Full-Text Indexes

PREVIOUS POST
NEXT POST

Pattern Matching Queries vs. Full-Text IndexesIn this blog post, we’ll compare the performance of pattern matching queries vs. full-text indexes.

In my previous blog post, I looked for a solution on how we can search only a part of the email address and how can we make faster queries where the condition is email LIKE '%n.pierre%'. I showed two possible ways that could work. Of course, they had some pros and cons as well but were more efficient and faster than a like '%n.prierre%'.

But you could also ask why I would bother with this? Let’s add a FULLTEXT index, and everybody is happy! Are you sure about that? I’m not. Let’s investigate and test a bit. (We have some nice blog posts that explain how FULLTEXT indexes work: Post 1, Post 2, Post 3.)

Let’s see if it works in our case where we were looking for email addresses. Here is the table:

Add the default full-text index:

It took only five seconds for 320K email addresses.

Let’s run a search:

Immediately, we have issues with the results. It returns 43 rows, but there are only 11 rows with string n.pierre. Why? It is because of . The manual says:

The built-in FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example,   (space), , (comma), and . (period).

The parser believes that a . starts a new word, so it is going to search for pierre instead of n.pierre. That’s not good news as many email addresses contain ..  What can we do? The manual says:

It is possible to write a plugin that replaces the built-in full-text parser. For details, see Section 28.2, “The MySQL Plugin API”. For example parser plugin source code, see the plugin/fulltext directory of a MySQL source distribution.

If you are willing to write your own plugin in C/C++, you can try that route. Until then, it is going to give us back a lot of irrelevant matches.

We can order the results by relevancy:

This does not guarantee we get back the lines that we are looking for, however. I tried to change innodb_ft_min_token_size as well, but it did not affect the results.

Let’s see what happens when I search for williamson pierre. Two separate words. I know there is only one email address with these names.

The first result is that we still got another 49 addresses. How can the application decide which email address is relevant and which is not? I am still not happy.

Are there any other options without writing our own plugin?

Can I somehow tell the parser to use n.pierre as one word? The manual says:

A phrase that is enclosed within double quote (") characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase".

I can use double quotes, but it will still split at . and the results are the same. I did not find a solution except writing your own plugin. If someone knows a solution, please write a comment.

With Parser Ngram

The built-in MySQL full-text parser uses delimiters between words, but we can create an Ngram-based full-text index.

Before that, I changed the ngram_token_size to 3.

Finally, we are getting somewhere. But it gives back 4697 rows. How can the application decide which results are relevant? Should we just use the score?

Subselect?

I dropped the Ngram FULLTEXT index and created a normal one because that gives me back only 43 results instead of 4697. I thought a full-text search might be good to narrow down the results from a million to a few thousand, and then we can run a select based on that. Example:

Wow, this can work and it looks quite fast as well. BUT (there is always a but), if I run the following query (searching for ierre):

It gives back nothing because the default full-text parser uses only full words! In our case, that is not very helpful. Let’s switch back to Ngram and re-run the query:

It gives us back 65 rows, and it takes between 0.02-0.05s because the subquery results in many rows.

With my “shorting method”:

It reads and gives back exactly 65 rows and it takes 0.000s.

Conclusion

When it comes to pattern matching queries vs. full-text indexes, it looks like full-text index can be helpful, and it is built in. Unfortunately, we do not have many metrics regarding full-text indexes. We cannot see how many rows were read, etc. I don’t want to make any conclusions on which one is faster. I still have to run some tests with our favorite benchmark tool sysbench on a much bigger dataset.

I should mention that full-text indexes and my previous solutions won’t solve all the problems. In this and my other blog I was trying to find an answer to a specific problem, but there are cases where my solutions would not work that well.

PREVIOUS POST
NEXT POST

Share this post

Comments (4)

  • Jannes Reply

    This for me to looking into fulltext indexes again, thanks. Unfortunately still not a fit for my use case. Anyway, I’m probably missing something, but why the sub query at all?

    SELECT id,email FROM email
    where MATCH(email) AGAINST (‘ierre’ IN NATURAL LANGUAGE MODE))
    AND email like ‘%ierre%’;

    seems to give the same result, but faster.

    April 7, 2018 at 6:00 pm
    • Tibor Korocz Reply

      I have already wrote this post before your comments on the other one. I might going to write a third one with all the new inputs and ideas. 🙂

      Thanks.

      April 9, 2018 at 6:31 am
  • Jouni Järvinen Reply

    What the fuck were the devs thinking when they designed the §FULLTEXT§ parser ?! This is retarded, no less !

    April 13, 2018 at 1:35 pm
  • vinod Reply

    how to configure ngram_token_size

    October 6, 2018 at 4:13 am

Leave a Reply