InnoDB MySQL full text search: Part 2, the queries!

March 4, 2013
Author
Ernie Souhrada
Share this Post:
InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!
InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

This is part 2 in a 3 part series on InnoDB MySQL full-text searches. In part 1, we took a quick look at some initial configuration of InnoDB full-text search and discovered a little bit of quirky behavior; here, we are going to run some queries and compare the result sets. Our hope is that the one of two things will happen; either the results returned from a MyISAM FTS query will be exactly identical to the same query when performed against InnoDB data, OR that the results returned by InnoDB FTS will somehow be “better” (as much as it’s actually possible to do this in a single blog post) than what MyISAM gives us.

Recall that we have two different sets of data, one which is the text of roughly 8000 SEO-stuffed webpage bodies (we call that one SEO) and the other, which we call DIR, that is roughly 800,000 directory records with name, address, and the like. We are using MySQL 5.5.30 and MySQL 5.6.10 with no configuration tuning other than to set innodb_ft_min_token_size to 4 (rather than the default of 3) so that it matches MyISAM’s default ft_min_word_length.

First, MyISAM, with MySQL 5.5, on the SEO data set:

The same query, run against InnoDB on 5.6.10:

Wow. I’m not sure if I should be concerned so much that the *scores* are different, but the *matches* are COMPLETELY DIFFERENT between 5.5/MyISAM and 5.6/InnoDB. Now, we know that MyISAM FTS does have the caveat with natural language searches whereby a word that’s present in 50% or more of the rows is treated as a stopword, so does that account for our problem? It might, because the word ‘arizona’ appears in over 6900 of the 7150 rows, and the word ‘records’ appears in 7082 of them. So let’s try something else that’s less likely to have that issue. The word “corporation” appears in 143 of the documents; the word “forms” appears in 439 of them, and the word “commission” appears in 130. There might be some overlap here, but even if there isn’t, 143+130+439 < 0.5 * 7150, so none of these should be treated as stopwords in MyISAM.

With 5.5:

With 5.6:

OK, now I’m starting to get a little worried. The docs do tell us that the default stopword list is substantially different between InnoDB and MyISAM, and as it turns out, there are only 36 stopwords in the default InnoDB list, but there are 543 stopwords in the default MyISAM list. What happens if we take the MyISAM stopwords, insert them into a table, and configure that table to be our stopword list for InnoDB?

This is the table that we’re trying to emulate:

The docs tell us that we need to create an *InnoDB* table with a single VARCHAR column named “value”. OK, sounds easy enough:

But, when we try to use this table, here’s what comes back:

And here’s what appeared in the server’s error log:

Uh… Does this mean that my next blog post should be entitled, “When is a VARCHAR Not Really a VARCHAR?” Thinking that maybe this was a case of GEN_CLUST_INDEX causing me issues, I tried adding a second column to the table which was an integer PK, and in another attempt, I tried just making the “value” column the PK, but neither of those worked. Also, trying to set innodb_ft_user_stopword_table produced the same error. I submitted a bug report (68450), and as you can see from the bug discussion, it turns out that this table is character-set-sensitive. If you’re going to use your own stopword table for InnoDB FTS, at least for the moment, this table must use the latin1 character set.

As far as I can tell, this little gotcha doesn’t appear to be mentioned anywhere in the MySQL 5.6 documentation; every place where it talks about creating one of these stopword tables, it simply mentions the table engine and the column name/type, so I’m not sure if this is an intentional restriction that just needs to be better documented or if it’s a limitation with the InnoDB FTS feature that will be removed in a later version.

Now that we’ve sorted this out, let’s drop and rebuild our FT index on the InnoDB table and try the above queries one more time. We already know what the MyISAM results are going to be; do our InnoDB results change? No, they are exactly the same, although the scores did change slightly.

What about a Boolean mode query? The docs tell us that if we use Boolean mode, and we put a “+” in front of our search term, then that term *must* appear in the search results. But does it?

With 5.5:

And with 5.6:

There’s only one row in the table that actually matches all three search terms, and in this case, both MyISAM and InnoDB FTS performed identically and found it. I’m not really concerned about the fact that the next four rows are completely different; the scores are zero, which means “no match.” This looks promising, so let’s explore further. Again, from the docs, if we run a boolean mode query where some of the search terms are prefixed with “+” and others have no prefix, results that have the unprefixed term will be ranked higher than those with out it. So, for example, if we change the above query to be “+james +peterson arizona” then we might expect to get back multiple matches containing the words “James” and “Peterson”, and we should expect the record from Arizona to be towards the top of the list.

With 5.5, this is exactly what happens:

With 5.6, we’re not so fortunate.

These results aren’t even close to identical. As it turns out, the full record for “Alphonso Lee Peterson Sr” does also contain the name “James”, and the word “Peterson” is listed in there several times, but “Arizona” is not present at all, whereas the record for “James R Peterson” had all three search terms and no significant repetition of any of them. Using this particular query, “James R Peterson” is #15 on the list.

At this point, it’s pretty obvious that the way MyISAM is calculating the scores is much different from the way that InnoDB is doing it, and given what I said earlier about the repetition of words in the “Alphonso Lee Peterson Sr” record versus the “James R Peterson” one, we might argue that InnoDB is actually behaving more correctly than MyISAM. Imagine if we were searching through newspaper articles or something of that sort, and we were looking for queries containing the word “MySQL” – odds are that an article which has 10 instances of “MySQL” might be more desirable to us than an article which only has it mentioned once. So if I look at these results from that persepctive, I can understand the how and the why of it. My concern is that there are likely going to be people who believe that switching to InnoDB FTS is simply a matter of upgrading to 5.6 and running ALTER TABLE foo ENGINE=InnoDB. In theory, yes. In practice, not even close.

I tried one more Boolean search, this time looking for someone’s full name, which I knew to be present only once in the database, and I used double quotes to group the search terms as a single phrase:

With 5.5:

Looks good, there he is. Now what happens under 5.6?

In the immortal words of Homer J. Simpson, “D’OH!!” Why is MyISAM able to locate this record but InnoDB cannot find it at all? I suspect that the “B” is causing problems for InnoDB, because it’s only a single character and we’ve set innodb_ft_min_token_size to 4. Thus, when InnoDB is parsing the data and building the word list, it’s completely ignoring Mr. Smith’s middle initial. To test this hypothesis, I reset innodb_ft_min_token_size to 1, dropped/rebuilt the InnoDB index, and tried again.

Aha, there he is! Based on that result, I would caution anyone designing an application that’s going to use InnoDB FTS to be quite mindful of the types of queries that you’re expecting your users to run. In particular, if you expect or are going to allow users to enter search phrases that include initials, numbers, or any other string of length less than 3 (the default), I think you’re going to be forced to set innodb_ft_min_token_size to 1. Otherwise you’ll run into the same problem as our Mr. Smith here. [This does raise the question of why it works with MyISAM when ft_min_word_length defaults to 4, but that is a topic for another day.]

Note that there may or may not be some performance implications to cranking this value all the way down; that is something I have not yet tested but will be reporting on in part 3 of this series. I can, however, confirm that the on-disk size of my DIR dataset is exactly the same with a setting of 1 versus a setting of 4. This may or may not be the case with multi-byte character sets or with ideographic languages such as Japanese, although Japanese poses its own unique problems for FTS of any kind due to its lack of traditional word boundaries.

In any event, it appears that we’ve solved the Boolean-mode search issue, but we still have vastly different results with the natural-language-mode search. For those of you who are expecting and need to have the MyISAM-style search results, there is at least one potential escape hatch from this rabbit hole. When defining a FULLTEXT KEY, you can use the “WITH PARSER” modifier to specify the name of a UDF which references your own custom-written fulltext parser plugin. Thus I am thinking that it may be possible to take the MyISAM full-text parser code, convert it to a plugin, and use it for InnoDB FT indexes where you’re expecting MyISAM-style results. Verifying or refuting this conjecture is left as an exercise for the reader. 🙂

A quick recap of what we’ve learned so far:

  • There are parts of InnoDB FTS configuration which are both letter-case and character-set sensitive. Watch out!
  • When you add your first FULLTEXT KEY to an InnoDB table, be prepared for a table rebuild.
  • Calculation of match score is completely different between the two engines; sometimes this leads to wildly different results.
  • If you were hoping to use InnoDB FTS as a simple drop-in replacement for your current MyISAM FTS, the results may surprise you.

That last point bears particular emphasis, as it also illustrates an important best practice even if FTS isn’t involved. Always test how your application behaves as a result of a major MySQL version upgrade before rolling it into production! Percona has tools (pt-upgrade and Percona Playback) that can help you with this. These tools are free and open source, please use them. You, and your users, will be happy that you did.

In the third and final installment of this series, we will take a look at performance. How does the speed of InnoDB FTS compare to its MyISAM counterpart, and what kinds of tricks can we use to make it go faster? Stay tuned to find out in the next blog on MySQL full-text searches!

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Far
Enough.

Said no pioneer ever.
MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.
© 2026 Percona All Rights Reserved