InnoDB Full-text Search in MySQL 5.6 (part 1)

I’ve never been a very big fan of MyISAM; I would argue that in most situations, any possible advantages to using MyISAM are far outweighed by the potential disadvantages and the strengths of InnoDB. However, up until MySQL 5.6, MyISAM was the only storage engine with support for full-text search (FTS). And I’ve encountered many customers for whom the prudent move would be a migration to InnoDB, but due to their use of MyISAM FTS, the idea of a complete or partial migration was, for one reason or another, an impractical solution. So, when FTS for InnoDB was first announced, I thought this might end up being the magic bullet that would help these sorts of customers realize all of the benefits that have been engineered into InnoDB over the past few years and still keep their FTS capability without having to make any significant code changes.

Unfortunately, I think that hope may be premature. While it is true that InnoDB full-text search in MySQL 5.6 (part 1) in MySQL 5.6 is syntactically identical to MyISAM full-text search, in the sense that the SQL required to run a MATCH .. AGAINST is the same (modulo any new features introduced with InnoDB full-text search), that’s largely where the similarities end.

NOTE 1: I was originally planning to cover everything I wanted to discuss with respect to InnoDB full-text search in this one post, but I think there’s a lot of interesting stuff here, so I will break it into three pieces instead. The first part (this one) will be a very quick overview of FTS in InnoDB and some observations that I’ve made while getting it configured. The second part will compare query results between MyISAM FTS and InnoDB FTS over the same data sets, and then finally in the third installment, we’ll look at query performance. In the event that a new release of 5.6 appears between now and part 3, I’ll also revisit some of the “quirks” from parts 1 and 2 to see if the behavior has changed.

NOTE 2: For purposes of this discussion, I used two separate data sets. The first one is a set of about 8K very SEO-stuffed web pages, wherein the document title is the page title, and the document body is the HTML-tag-stripped body of the page. We’ll call this data set “SEO” – it’s about 20MB of actual data. The other one is a set of about 790K directory records, each one containing the name, address, and some other public-records-type information about each person. We’ll call this data set “DIR”, and it’s about 155MB of actual data.

NOTE 3: Also, keep in mind that I used the community editions of MySQL 5.5.30 and MySQL 5.6.10 with no tuning whatsoever (with one exception that I’ll explain below) – the idea behind this investigation wasn’t to find out how to make InnoDB FTS blazingly-fast, but simply to get a sense of how it works compared to traditional MyISAM FTS. We’ll get to performance in the third installment. For now, the important number here is to note that the InnoDB buffer pool for my 5.6 instance is 128MB – smaller than the size of my DIR data.

So, with all of that out of the way, let’s get to it.

Here is our basic table definition for the DIR dataset. The table for the SEO dataset looks identical, except that we replace “full_name” with a VARCHAR(255) “title” and “details” with a TEXT “body”.

We also have identical tables created in 5.5.30 where, of course, the only difference is that the engine is MyISAM rather than InnoDB. Loading the data was done via a simple Perl script, inserting one row at a time with AutoCommit enabled – remember, the focus here isn’t on performance just yet.

Having loaded the data, the first thing we notice is that there are a lot of “new” InnoDB tablespace files in our database directory:

By comparison, this is what we see on the MyISAM side:

I also observed that if I simply load the data into an InnoDB table that has never had a full-text index on it, and then I create one, the following warning is generated:

This doesn’t make a lot of sense to me. Why does InnoDB need to add a hidden column (similar to GEN_CLUST_INDEX when you don’t define a PRIMARY KEY, I assume) when I already have an INT UNSIGNED PK that should suffice as any sort of document ID ? As it turns out, if you create a column called FTS_DOC_ID which is a BIGINT UNSIGNED NOT NULL with a unique index on it, your table doesn’t need to be rebuilt. The most important item to note here – FTS_DOC_ID must be spelled and specified exactly that way – IN ALL CAPS. If you try “fts_doc_id” or any other mixture of lettercase, you’ll get an error:

Various points in the documentation mention the notion of a “document ID” that “might reflect the value of an ID column that you defined for the underlying table, or it can be a sequence value generated by InnoDB when the table does not contain a suitable column,” but there are only a handful of references to FTS_DOC_ID found when searching the MySQL 5.6 manual, and the only page which appears to suggest how using an explicitly-defined column is done is this one, which discusses improving bulk insert performance. At the very bottom, the page claims that you can speed up bulk loading into an InnoDB FT index by declaring a column called FTS_DOC_ID at table creation time of type BIGINT UNSIGNED NOT NULL with a unique index on it, loading your data, and then creating the FT index after the data is loaded.

One obvious problem wih those instructions is that if you define a column and a unique key as they suggest, your data won’t load due to unique key constraint violations unless you also do something to provide some sort of sequence value for that column, whether as an auto_increment value or via some other means, but the bit that troubles me further is the introduction of a column-level case-sensitivity requirement that only seems to actually matter at table creation time. Once I’ve got a table with an explicit FTS_DOC_ID column, however, MySQL apparently has no problem with either of the following statements:

Philosophically, I find that kind of behavior unsettling. I don’t like case-sensitivity in my table or column names to begin with (I may be one of the few people that likes lower_case_table_names = 1 in /etc/my.cnf), but I think it’s even worse that the case-sensitivity only matters some of the time. That strikes me as a good recipe for DBA frustration.

Now, let’s return to all of those FTS_*.ibd files. What, exactly, are they? In short, the _CONFIG.ibd file contains configuration info about the FT index (the same sort of configuration data that can be queried out of the I_S.INNODB_FT_CONFIG table, as I’ll discuss momentarily), and the others contain document IDs of new rows that are added to or removed from the table and which need to be merged back into or removed from the main index. I’m not entirely sure about the _STOPWORDS.ibd file just yet; I thought it might be somehow related to a custom stopwords table, but that doesn’t seem to be the case (or at least not in the way that I had thought), so I will need to look through the source to figure out what’s going on there.

In any case, for each new FULLTEXT KEY that you create, you’ll get a corresponding FTS_*_DOC_ID.ibd file (but none of the others), and if you drop a FT index, the its corresponding FTS_*_DOC_ID.ibd file will also be removed. HOWEVER, even if you drop all of the FT indexes for a given table, you’re still left with all of the other FTS_*.ibd files, and it appears that the only way to get rid of them is to actually rebuild the table.