Full Text Search Webinar Questions FollowupBill Karwin
I presented a webinar this week to give an overview of several Full Text Search solutions and compare their performance. Even if you missed the webinar, you can register for it, and you’ll be emailed a link to the recording.
During my webinar, a number of attendees asked some good questions. Here are their questions and my answers.
Adrian B. commented:
Q: Would’ve been a good idea to retrieve the same number of rows on each benchmark (I noticed 100 rows on SQL and 20 on Sphinx). Also Sphinx does the relevance sorting by default, adding relevance sorting to the MySQL queries would make them even slower, I’m sure.
Indeed, the result set of 20 rows from SphinxQL queries is merely the default. And sorting by relevance must have some effect on the speed of the query. I didn’t control for these variations, but that would be worthwhile to do.
Geoffrey L. asked:
Q: Do you have any data on memory usage for these tests?
No, I didn’t measure the memory usage for each test. I did allocate buffers in what I hope was a realistic way. For instance, when testing the InnoDB FT index, I made sure to increase innodb_buffer_pool_size to 50% of my RAM, and then when I tested MyISAM FT index, I reallocated that memory to into key_buffer_size.
The other thing that happened with regards to memory was that when I tried to create an InnoDB FT index before declaring the
FTS_DOC_ID primary key column properly, MySQL crashed with an out of memory fatal error, and the Windows 7 server I was testing on prompted me to shut down the operating system. So be careful about declaring the correct primary key!
Q: Did you compare relevancy of search results between each engine? If yes, which one seems the best?
No, I didn’t compare relevancy. I was focused solely on query response time for this test. You’re right that the specific results probably vary based on the search implementation, and that’s an important factor in choosing a solution.
David S. asked:
Q: If searching on multiple terms, can you get Sphinx to report which matched?
The SHOW META command in Sphinx shows how many rows each keyword matched, and then a total for the search expression. But I don’t know a way to report exactly which rows matched, without doing additional searches for each keyword.
Jessy B. asked:
Q: With respect to Solr and Sphinx, do indexes stay up to date with changes (inserts, updates and deletes)?
That’s a good question, because both of these external indexing technologies depend on being able to reindex as data in the source database changes. This is an important consideration for choosing a full-text solution, because updating the index can become quite complex.
Solr is easier to add documents to, either individually as you add new data to the MySQL database or else periodically batch-insert data that has changed since last time you updated the index. You can use the DataImportHandler to import the result of any SQL query, and if you can form a SELECT query that returns the “new” data (for example, WHERE updated_at > ‘2012-08-22 00:00:00’, when you did the prior update), you can do this anytime.
Sphinx Search is a bit harder, because it’s quite costly to update an index incrementally — it’s basically as expensive as creating the whole index. For that reason, there are a couple of strategies used by Sphinx Search users to support changing data. One is to store two indexes, one for historical data, and the other for the “delta” of recently-changed data. Your application would have to be coded to search both indexes to find all matches among the most current data. You would merge the delta index with the main index periodically.
Sphinx Search also supports a supplementary in-memory RT index type that supports direct row-by-row updates. But you would still have to update the RT index as data changes, using application code. Since RT indexes are in volatile memory, not stored on disk, you are responsible for integrating new data with the on-disk Sphinx index periodically by reindexing the whole collection. There doesn’t currently seem to be a function to merge an RT index with an on-disk index, so integrating recent changes with the full index may require reindexing the whole dataset from time to time.
Mike W. asked:
Q: If the Sphinx index isn’t in sync, will the out-of-sync rows not be found?
No, only the documents included in the Sphinx Search indexes will be returned by Sphinx Search queries.
Mike W. also asked:
Q: What about MemSQL and indexes. Have you benchmarked it?
According to their documentation, MemSQL supports hash indexes and skip list indexes, but not full-text indexes, so comparisons would not be meaningful.
Since MemSQL is an in-memory database, you can get a lot of speed improvement because you’re searching data without touching the disk, but I assume the search would necessarily do table-scans.
Jessy B. also asked:
Q: Were these tests performed on a single machine and a common/share set of disk?
The test machine I used is a Windows 7 tower with an Intel i3 processor, 8GB of RAM, and two SSD drives: one for the Windows partition, and one for the MySQL data partition. I performed all the tests on this machine. I realize this isn’t representative of modern production systems, but hopefully by performing all the tests on the same hardware, I got results that are least comparable to each other.
Hernan S. commented on the blog post where I announced the webinar:
We evaluated MySQL vs Solr. I was able to index all the data from the database into Solr and make it queryable from a browser within four days plus some customization on the search algorithm. It would have taken me two to three weeks to do something equivalent with MySQL and it wouldn’t be as flexible and customizable as Solr. With Solr, I was able to fine tune search and I still feel there are tons of additional features that will help me address future needs.
Jeffrey S. asked:
Q: Do you know if there’s a good reference that discusses what technologies might be most adequate for various application types? (I can only think of my company’s video library search). Talking about the relevance of search results, sorry if it wasn’t clear.
No, I don’t know of a reference that compares these technologies for different application types. There are books that describe how to use one technology or the other, and some may compare one technology to the other, but typically these comparisons are made with respect to individual features, not in the context of application types. You’d have to evaluate how relevant the search results are for your application needs, this isn’t something a benchmark can tell you.
Thanks for all the questions!
I’d like to see some of the folks who viewed my Full-Text Search Throwdown webinar when I present the popular Percona Training in Salt Lake City, September 24-27. See https://www.percona.com/training/ for details on our training offerings and the schedule for upcoming events.