EmergencyEMERGENCY? Get 24/7 Help Now!

MySQL 5.6 Full Text Search Throwdown: Webinar Q&A

 | October 23, 2014 |  Posted In: Insight for Developers, MySQL, Technical Webinars

PREVIOUS POST
NEXT POST

MySQL 5.6 Full Text Search Throwdown: Webinar Q&AYesterday (Oct. 22) I gave a presentation titled “MySQL 5.6 Full Text Search Throwdown.” If you missed it, you can still register to view the recording and my slides.

Thanks to everyone who attended, and especially to folks who asked the great questions. I answered as many as we had time for during the session, but here are all the questions with my complete answers:

Q: Does Solr automatically maintain its index against MySQL? Do you have to hit the Solr server with a specific query to keep the index ‘warm’?

There are several strategies for updating a Solr index. In my examples, I showed only a “full import” which is what you would do to create an index by reading all the source data.

You can also perform a “delta import” periodically, to add a subset of the source data to an existing index, for example to add data that has changed since the last time you updated the Solr index. See the documentation for Using delta-import command and also Using query attribute for both full and delta import.

The delta import would typically be something you would invoke from a cron job, perhaps every hour. But that means that a Solr search might not find data that has changed in MySQL more recently than the last delta import. Depending on the application, a delay of up to 60 minutes might be acceptable, or else maybe you have strict requirements that all data must be in sync instantly.

You could also update the Solr index one document at a time using its Java API or web service API. This would require you to write code in your application. Every time you INSERT or UPDATE or DELETE a document in MySQL that you want to be kept in sync with the Solr index, you would write more code to do a similar operation in the Solr index. That way every single text change would be searchable nearly immediately.

Q: Did you test Elasticsearch? (several people asked about this)

I did not test Elasticsearch, but according to their technology overview: “Elasticsearch uses Lucene under the covers.” So I expect that this part of Elasticsearch performs similarly to what I saw from Apache Solr, which also uses Lucene internally.

Q: One question I could not understand, how to maintain Sphinx index in sync with data? Can be it in real time?

The Sphinx Search index does not automatically refresh as your MySQL data changes. You would have to write application code to invoke the indexing process. There’s a page in the Sphinx Search documentation about Live Index Updates, that gives an overview of the two methods, and links to further reading.

This is definitely the most inconvenient aspect of Sphinx Search. Queries are very fast, but it’s expensive to do incremental updates to an index. So it’s ideal for indexing an archive of text that doesn’t change very frequently, but not as easy to use it for indexing rapidly-changing content.

Q: I have over 800,000 PDF documents to index (in several languages), any recommendations?

I said during the webinar that I recalled there exists tools to extract searchable text from a PDF file. I found one such project called Apache PDFBox includes this capability, and they have a page describing a helper class for doing PDF parsing and extraction combined with Lucene indexing. I haven’t used it myself, so I can’t comment on its performance for indexing 800,000 PDF documents, but it seems like you could write a Java program to iterate over your collection of PDF’s, and index them using this class.

Q: What is your suggestion to use Sphinx Search for single column searches?

You can use any SQL query in the sphinx.conf to define the source data to index. You can select one column, multiple columns, or even multiple columns from joined tables. The result from any SQL query you write can be used as the data source.

Q: Which modules did you use with Sphinx Search? Did you use its built-in stemmers and metaphone package, etc.?

I installed the default modules. I don’t know if there is a significant performance difference from using optional packages.

Q: What about quality of results from each solution? I remember reading an article on percona.com several months ago comparing MyISAM fulltext vs InnoDB fulltext, and there were concerns about the results from InnoDB. Did you do any testing on this?

Indeed, here’s a link to the excellent blog post by my colleague Ernie Souhrada in which he found some surprises in the results from InnoDB FTS: InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

I was just doing some comparison for performance in the current MySQL 5.7 milestone. I didn’t compare the query results this time.

Q: Is there any full text search in Percona Server with XtraDB?

Percona Server is based on the upstream MySQL Community Edition of the respective version number. So Percona Server has the builtin FULLTEXT index types for MyISAM and InnoDB, and we have not changed this part of the code. Percona Server does not bundle Sphinx Search, but it’s not too difficult to install Sphinx Search as a complementary technology, just as you would install other packages that are commonly used parts of an application infrastructure, for example Memcached or HA-proxy.

Q: Is MySQL going to improve the built-in InnoDB FTS in the near future?

They are continuing to add features that improve FTS, for example:

  • You can now write your own plugins for fulltext parsing (that is, parsing the input data to identify “words” to index; you may have your own idea about how to split text into words).
  • Both B-tree and full-text types now uses bulk-loading to make it faster and more efficient to build the index.

I’m not aware of any work to improve the performance of fulltext queries significantly.

Q: What is the performance comparison between MyISAM and InnoDB for inline index updating?

I didn’t test performance of incremental index updates this time. I only populated my tables from the StackOverflow data using LOAD XML, and then I created fulltext indexes on the populated tables. But I generally favor moving all important data to InnoDB, and not using MyISAM tables. It’s hard to imagine that the performance of index updates would be so much better that would convince me to use MyISAM. It’s more likely that the accuracy of search results would be a good reason to use MyISAM. Even then, I’d keep the original data in InnoDB and use MyISAM only as a copy of the data, to create a disposable fulltext index.

Thanks again for attending my webinar! For more great content, please join Percona and the MySQL community at our conference events. The next one is Percona Live London 2014 on November 3-4. We also look forward to the Open Stack Live 2015 in Santa Clara, California April 13-14, in the same venue with Percona Live MySQL Conference and Expo 2015, April 13-16.

Also watch more webinars from Percona in the future!

PREVIOUS POST
NEXT POST
Bill Karwin

Bill Karwin has been a software professional for over 20 years. He's helped thousands of developers with SQL technology. Bill authored the book "SQL Antipatterns," collecting frequent blunders and showing better solutions.

5 Comments

  • A colleague asked me to clarify on the topic of updating a Sphinx Search index. There’s a perception that Sphinx Search is a “magic bullet” because queries against a Sphinx Search index are so fast.

    The reality is that adding a document to an existing Sphinx Search index is approximately as expensive as reindexing the whole collection from scratch. They offer two solutions for handling changing data:

    1. Maintain two Sphinx Search indexes, one large index that is mostly unchanging for legacy data, and one small index for recent changes. Updates to the small index still require the costly reindexing, but since it is small that might be bearable overhead. Periodically, you merge the small index into the large index. Merging two on-disk indexes is not so expensive, and Sphinx Search provides tools to do it. Then start over with an empty small “recent” index.

    You would have to write code in your application to do two searches, and then write code to merge the results.

    2. Similar to above, in that you use two indexes, but the secondary index is an “RT index” which is an alternative implementation that Sphinx Search provides. RT is for real-time. It’s an in-memory (volatile) index. Adding documents to it is reasonably quick, but it’s not quite as efficient for searches. Also, it cannot be merged into the on-disk index, so you would just have to periodically reindex the whole on-disk index, and drop the RT index.

    This solution also requires that you write custom application code to run two searches and merge the results.

    Neither of these solutions is good if you have a very high rate of change to your searchable content. And solution 1 really works only if you have a pattern of old data becoming static (like a forum or chatroom). Not so good if the changes can happen randomly to old and new content (like a wiki).

  • Just to clarify some things about Sphinx Real Time index. Starting from 2.1.1-beta you can issue OPTIMIZE INDEX command which will optimize on-disk parts of the real-time index so search efficiency is the same as for on-disk indexes or faster as recently added part of data is stored in memory. Of course if you plan to search against large document collections you may want to combine RT and on-disk indexes to support load balancing, failover, etc,

  • Hi Bill,

    I finally got around to watching the recorded version of this, and I wanted to ask about using SpinxSE with the Percona Server packages (e.g. for Debian).

    The documentation says mysql needs to be (re)compiled from source to enable the Sphinx module. What are the chances of Percona providing pre-built packages with SphinxSE available?

    Thanks!

    Stephen

  • Thanks for the comment Stephen. This has been a wishlist item for some time: https://bugs.launchpad.net/percona-server/+bug/671614

    If you log in to launchpad.net, you can register your interest on that page by clicking the link for “Does this bug affect you?” As people register interest in a bug/feature request, this increases the “heat” and raises the priority for implementing it in a future version of Percona Server.

    That said, you can use Sphinx Search directly from any application. You don’t necessarily need the Sphinx Search SE plugin enabled in the instance of MySQL/Percona Server. You might even want your Sphinx Search instance on a different host than your MySQL host. You need the SphinxSE plugin only if you want to query the Sphinx Search index on the same connection that you use for other SQL queries against conventional tables.

Leave a Reply