MySQL 5.6 Full Text Search Throwdown: Webinar Q&ABill Karwin
Thanks to everyone who attended, and especially to folks who asked the great questions. I answered as many as we had time for during the session, but here are all the questions with my complete answers:
Q: Does Solr automatically maintain its index against MySQL? Do you have to hit the Solr server with a specific query to keep the index ‘warm’?
There are several strategies for updating a Solr index. In my examples, I showed only a “full import” which is what you would do to create an index by reading all the source data.
You can also perform a “delta import” periodically, to add a subset of the source data to an existing index, for example to add data that has changed since the last time you updated the Solr index. See the documentation for Using delta-import command and also Using query attribute for both full and delta import.
The delta import would typically be something you would invoke from a cron job, perhaps every hour. But that means that a Solr search might not find data that has changed in MySQL more recently than the last delta import. Depending on the application, a delay of up to 60 minutes might be acceptable, or else maybe you have strict requirements that all data must be in sync instantly.
You could also update the Solr index one document at a time using its Java API or web service API. This would require you to write code in your application. Every time you INSERT or UPDATE or DELETE a document in MySQL that you want to be kept in sync with the Solr index, you would write more code to do a similar operation in the Solr index. That way every single text change would be searchable nearly immediately.
Q: Did you test Elasticsearch? (several people asked about this)
I did not test Elasticsearch, but according to their technology overview: “Elasticsearch uses Lucene under the covers.” So I expect that this part of Elasticsearch performs similarly to what I saw from Apache Solr, which also uses Lucene internally.
Q: One question I could not understand, how to maintain Sphinx index in sync with data? Can be it in real time?
The Sphinx Search index does not automatically refresh as your MySQL data changes. You would have to write application code to invoke the indexing process. There’s a page in the Sphinx Search documentation about Live Index Updates, that gives an overview of the two methods, and links to further reading.
This is definitely the most inconvenient aspect of Sphinx Search. Queries are very fast, but it’s expensive to do incremental updates to an index. So it’s ideal for indexing an archive of text that doesn’t change very frequently, but not as easy to use it for indexing rapidly-changing content.
Q: I have over 800,000 PDF documents to index (in several languages), any recommendations?
I said during the webinar that I recalled there exists tools to extract searchable text from a PDF file. I found one such project called Apache PDFBox includes this capability, and they have a page describing a helper class for doing PDF parsing and extraction combined with Lucene indexing. I haven’t used it myself, so I can’t comment on its performance for indexing 800,000 PDF documents, but it seems like you could write a Java program to iterate over your collection of PDF’s, and index them using this class.
Q: What is your suggestion to use Sphinx Search for single column searches?
You can use any SQL query in the sphinx.conf to define the source data to index. You can select one column, multiple columns, or even multiple columns from joined tables. The result from any SQL query you write can be used as the data source.
Q: Which modules did you use with Sphinx Search? Did you use its built-in stemmers and metaphone package, etc.?
I installed the default modules. I don’t know if there is a significant performance difference from using optional packages.
Q: What about quality of results from each solution? I remember reading an article on percona.com several months ago comparing MyISAM fulltext vs InnoDB fulltext, and there were concerns about the results from InnoDB. Did you do any testing on this?
Indeed, here’s a link to the excellent blog post by my colleague Ernie Souhrada in which he found some surprises in the results from InnoDB FTS: InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!
I was just doing some comparison for performance in the current MySQL 5.7 milestone. I didn’t compare the query results this time.
Q: Is there any full text search in Percona Server with XtraDB?
Percona Server is based on the upstream MySQL Community Edition of the respective version number. So Percona Server has the builtin FULLTEXT index types for MyISAM and InnoDB, and we have not changed this part of the code. Percona Server does not bundle Sphinx Search, but it’s not too difficult to install Sphinx Search as a complementary technology, just as you would install other packages that are commonly used parts of an application infrastructure, for example Memcached or HA-proxy.
Q: Is MySQL going to improve the built-in InnoDB FTS in the near future?
They are continuing to add features that improve FTS, for example:
- You can now write your own plugins for fulltext parsing (that is, parsing the input data to identify “words” to index; you may have your own idea about how to split text into words).
- Both B-tree and full-text types now uses bulk-loading to make it faster and more efficient to build the index.
I’m not aware of any work to improve the performance of fulltext queries significantly.
Q: What is the performance comparison between MyISAM and InnoDB for inline index updating?
I didn’t test performance of incremental index updates this time. I only populated my tables from the StackOverflow data using LOAD XML, and then I created fulltext indexes on the populated tables. But I generally favor moving all important data to InnoDB, and not using MyISAM tables. It’s hard to imagine that the performance of index updates would be so much better that would convince me to use MyISAM. It’s more likely that the accuracy of search results would be a good reason to use MyISAM. Even then, I’d keep the original data in InnoDB and use MyISAM only as a copy of the data, to create a disposable fulltext index.
Thanks again for attending my webinar! For more great content, please join Percona and the MySQL community at our conference events. The next one is Percona Live London 2014 on November 3-4. We also look forward to the Open Stack Live 2015 in Santa Clara, California April 13-14, in the same venue with Percona Live MySQL Conference and Expo 2015, April 13-16.
Also watch more webinars from Percona in the future!