Full Text Search Webinar Questions Followup

PREVIOUS POST
NEXT POST

I presented a webinar this week to give an overview of several Full Text Search solutions and compare their performance.  Even if you missed the webinar, you can register for it, and you’ll be emailed a link to the recording.

During my webinar, a number of attendees asked some good questions.  Here are their questions and my answers.

Adrian B. commented:

Q: Would’ve been a good idea to retrieve the same number of rows on each benchmark (I noticed 100 rows on SQL and 20 on Sphinx). Also Sphinx does the relevance sorting by default, adding relevance sorting to the MySQL queries would make them even slower, I’m sure.

Indeed, the result set of 20 rows from SphinxQL queries is merely the default.  And sorting by relevance must have some effect on the speed of the query.  I didn’t control for these variations, but that would be worthwhile to do.

Geoffrey L. asked:

Q: Do you have any data on memory usage for these tests?

No, I didn’t measure the memory usage for each test.  I did allocate buffers in what I hope was a realistic way.  For instance, when testing the InnoDB FT index, I made sure to increase innodb_buffer_pool_size to 50% of my RAM, and then when I tested MyISAM FT index, I reallocated that memory to into key_buffer_size.

The other thing that happened with regards to memory was that when I tried to create an InnoDB FT index before declaring the FTS_DOC_ID primary key column properly, MySQL crashed with an out of memory fatal error, and the Windows 7 server I was testing on prompted me to shut down the operating system.  So be careful about declaring the correct primary key!

Mario asked:

Q: Did you compare relevancy of search results between each engine?  If yes, which one seems the best?

No, I didn’t compare relevancy.  I was focused solely on query response time for this test. You’re right that the specific results probably vary based on the search implementation, and that’s an important factor in choosing a solution.

David S. asked:

Q: If searching on multiple terms, can you get Sphinx to report which matched?

The SHOW META command in Sphinx shows how many rows each keyword matched, and then a total for the search expression.  But I don’t know a way to report exactly which rows matched, without doing additional searches for each keyword.

Jessy B. asked:

Q: With respect to Solr and Sphinx, do indexes stay up to date with changes (inserts, updates and deletes)?

That’s a good question, because both of these external indexing technologies depend on being able to reindex as data in the source database changes.  This is an important consideration for choosing a full-text solution, because updating the index can become quite complex.

Solr is easier to add documents to, either individually as you add new data to the MySQL database or else periodically batch-insert data that has changed since last time you updated the index.  You can use the DataImportHandler to import the result of any SQL query, and if you can form a SELECT query that returns the “new” data (for example, WHERE updated_at > ‘2012-08-22 00:00:00′, when you did the prior update), you can do this anytime.

Sphinx Search is a bit harder, because it’s quite costly to update an index incrementally — it’s basically as expensive as creating the whole index.  For that reason, there are a couple of strategies used by Sphinx Search users to support changing data.  One is to store two indexes, one for historical data, and the other for the “delta” of recently-changed data.  Your application would have to be coded to search both indexes to find all matches among the most current data.  You would merge the delta index with the main index periodically.

Sphinx Search also supports a supplementary in-memory RT index type that supports direct row-by-row updates.  But you would still have to update the RT index as data changes, using application code.  Since RT indexes are in volatile memory, not stored on disk, you are responsible for integrating new data with the on-disk Sphinx index periodically by reindexing the whole collection.  There doesn’t currently seem to be a function to merge an RT index with an on-disk index, so integrating recent changes with the full index may require reindexing the whole dataset from time to time.

Mike W. asked:

Q: If the Sphinx index isn’t in sync, will the out-of-sync rows not be found?

No, only the documents included in the Sphinx Search indexes will be returned by Sphinx Search queries.

Mike W. also asked:

Q: What about MemSQL and indexes.  Have you benchmarked it?

According to their documentation, MemSQL supports hash indexes and skip list indexes, but not full-text indexes, so comparisons would not be meaningful.

Since MemSQL is an in-memory database, you can get a lot of speed improvement because you’re searching data without touching the disk, but I assume the search would necessarily do table-scans.

Jessy B. also asked:

Q: Were these tests performed on a single machine and a common/share set of disk?

The test machine I used is a Windows 7 tower with an Intel i3 processor, 8GB of RAM, and two SSD drives: one for the Windows partition, and one for the MySQL data partition.  I performed all the tests on this machine.  I realize this isn’t representative of modern production systems, but hopefully by performing all the tests on the same hardware, I got results that are least comparable to each other.

Hernan S. commented on the blog post where I announced the webinar:

We evaluated MySQL vs Solr.  I was able to index all the data from the database into Solr and make it queryable from a browser within four days plus some customization on the search algorithm. It would have taken me two to three weeks to do something equivalent with MySQL and it wouldn’t be as flexible and customizable as Solr. With Solr, I was able to fine tune search and I still feel there are tons of additional features that will help me address future needs.

Great points Hernan.  Each application project is different, and has different requirements for the FT functionality.  So one solution may include advanced features that are must-have for your application, but are not so important for another application.  I tried to test only the functionality all these solutions had in common, so I tested only simple queries without customizing the indexing.

Jeffrey S. asked:

Q: Do you know if there’s a good reference that discusses what technologies might be most adequate for various application types? (I can only think of my company’s video library search).  Talking about the relevance of search results, sorry if it wasn’t clear.

No, I don’t know of a reference that compares these technologies for different application types.  There are books that describe how to use one technology or the other, and some may compare one technology to the other, but typically these comparisons are made with respect to individual features, not in the context of application types.  You’d have to evaluate how relevant the search results are for your application needs, this isn’t something a benchmark can tell you.

Thanks for all the questions!

I’d like to see some of the folks who viewed my Full-Text Search Throwdown webinar when I present the popular Percona Training in Salt Lake City, September 24-27.  See http://www.percona.com/training/ for details on our training offerings and the schedule for upcoming events.

PREVIOUS POST
NEXT POST

Comments

  1. says

    Hey Bill,

    many thanks for mentioning Sphinx, and everything else seems good, but this particular bit:

    > Since RT indexes are in volatile memory, not stored on disk, you are responsible for integrating new data with the on-disk Sphinx index periodically by reindexing the whole collection.

    is, well, not entirely right. ;)

    First, by default RT indexes are safe: they do their magic in RAM indeed, but they save incoming writes into a binary write-ahead log, just like a database would. You can explicitly disable that for performance reasons, but that is not the out-of-the-box behavior.

    Second, you never have to somehow forcibly store RT to disk manually and “integrate” new data. RT should handle that automatically.

    Third, one of the ultimate goals with RT you should never have to reindex. As of 2.0 you might indeed occasionally want to do that to handle fragmentation. However as of 2.1 we are adding OPTIMIZE INDEX to do that.

    > There doesn’t currently seem to be a function to merge an RT index with an on-disk index, so integrating recent changes with the full index may require reindexing the whole dataset from time to time.

    And, for the record, there never will be. Full batch rebuilds are still very useful and are *not* going anywhere, of course. But the suggested solution for this particular case is the other way round just as well. ;) Instead of merging RT into a disk based index, you would (quickly) batch build a disk index, and then convert it to RT using ATTACH INDEX statement.

  2. says

    And one more thing that I initially missed…

    > You’d have to evaluate how relevant the search results are for your application needs, this isn’t something a benchmark can tell you.

    In fact, this is exactly something a relevancy “benchmark” could tell you. But you can’t get that benchmark easily.

    There are well-defined relevancy metrics that let you compare different ranking formulas (and therefore different systems). All those metrics are however based on hand-made relevancy judgements. Lots of manual labor, so they’re pretty difficult to produce at a large scale. And on a tiny scale (read: not too many judgments), they would only be marginally usable, not really telling one very much about the actual search quality.

    So the problem that there just isn’t a proper test that can evaluate your relevance. It very well can. However there are not many test collections with relevance judgments available at all, not to mention collections tailored for many different domains. And I can’t recall a single free/open collection. Duh.

  3. says

    Thanks Andrew! I’m glad to get some of your insight into the nuances of Sphinx. It remains true that keeping Sphinx indexes synchronized with the source data is a more complex story than keeping MySQL’s built-in FULLTEXT indexes synchronized with source data. When one is evaluating FTS solutions, one should have in mind the tradeoff between absolute best performance on the one hand versus simplicity for operations and maintenance on the other hand.

  4. says

    Sure, keeping data in sync with *any* external FT solution will be a bit more complicated than doing nothing with built-in FT indexes. My point was that even with RT indexes in Sphinx it’s not *that* complicated really.

  5. says

    Very nice post. I just stumbled upon your weblog and wanted to mention that I
    have truly enjoyed surfing around your weblog posts. In any case
    I’ll be subscribing to your rss feed and I am hoping you write once more soon!

Leave a Reply

Your email address will not be published. Required fields are marked *