EmergencyEMERGENCY? Get 24/7 Help Now!

Using Sphinx as MySQL data retrieval accelerator

 | October 30, 2006 |  Posted In: Insight for Developers


I’ve run into the following thread couple of days ago:

Basically someone is using sphinx to perform search simply on attributes (date, group etc) and get sorted result set and claiming it is way faster than getting it with MySQL. Honestly I can well believe it for cases when you want to know number of matching rows as well as if you can’t build efficient indexes so selectivity is done by index and index used to resolve order by.

Funny enough to filter by attributes or sort sphinx does not use indexes – indexes are only used for full text search matching, but it is still extremely fast doing data crunching.

I just tested right now performing search of “the” which matched 100.000.000 of documents out of 200.000.000 collection (200GB) completed in 0.7 second. This is system we’re building for one of our clients which uses cluster of 3 nodes to handle search. In this case no shortcuts are taken all 100.000.000 of matching document are traversed and priority queue sorting is performed to generate 1.000 best matching results. Quite impressive

Yeah I know it should be stop word but I currently have index without stop words for testing purposes.

Now what I’m hoping for as developments:

Andrew to continue improving sphinx so it would have more advanced filtering clauses and types of attributes, plus there would be an option to retrieve by filters only with no full text query. Sphinx should not be replacement for Database Server but for many data retrieval needs it will work great. Especially as it can be used with other databases which may be slower than MySQL.

MySQL Make it so one would not need to use sphinx to get great performance for such kind of queries. This includes parallel processing, fast count(*) and bitmap indexes to help non selective clauses. Also some form of fast sort like priority queue could be used if only few first elements are needed.

Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.


  • Dathan,

    Sphinx is external full text search engine which can be hooked up as storage engine (as a client) but it is not full feature storage engine as you might think.

    What it does however it does really fast.

  • Apachez,

    The question is not “single world” but the fact 100.000.000 of matches are sorted by given attribute with top matches provided.

    I know you will be speaking about fastest ever tbgsearch which will stop after finding 1000 matches and would not sort them at all. These are however completely different tasks.

  • Actually sphinx is not sorting the results at all and that can be proven by just changing the sourcecode into actually selecting 100_000_000 “rows” and not just the 20 which are statically compiled. You will then see a major increase in processing time…

    Searching on just “the” (or any other indexed word) takes far less than 0.7 seconds with TBGsearch or any searchengine which is using mysql as backend and like sphinx stores the frequency for each word. Actually number of “matched” documents in this case doesnt matter – the time will be the same with 100 million documents as with 1 billion documents because the lookup will be against the word table and not the hit table itself. The time will be more affected of number of unique indexed “words” than how many documents there have been indexed.

  • Apachez,

    Please re-read again what I have written – it DOES perform sorting however it does not return you ALL rows – it uses priority queue to deliver you top 1000 matching results. You for example may see the matches returned on top if sorting done by relevance would contain “the” both in title and in text body.

    Now about TBGSearch – again LIKE “%the%” LIMIT 100 will also normally find you 100 matches from almost any data size almost instantly but it can not guaranty you these are matches best by relevance.

    Finally sphinx does NOT store number of matches for “the” word it actually computes true number of matches in this case. It may not be always smart to do but as it sorts result set anyway it is limited overhead.

    “The” is just simiple example – you can do phrase searches and apply number of filters so having number of matches for single keyword does not help much.

  • Actually neither sphinx can guarantee any true relevance matching. And since when does sphinx support wildcard searches? 😛

    In sql the query would be more like “word = ‘the'” and not “word LIKE ‘%the%'” (even thou TBGsearch among others does support wildcards).

    In the case of TBGsearch “relevance is irrelevent” where the “relevance” is based on date in the case of TBGsearch but can easily be made on any number of own relevance matching (regarding how you position one message in front of another where there are many different type of “relevances”…)

    Another note is that IF sphinx actually performs a sort each time and that on 100 million rows in just 0.7 seconds that will be way faster than mysql itself performs sorting which should give you a clue that what you said first time perhaps is not the full true story…

  • Apachez,

    I do not understand what you’re arguing about. Sphinx DOES perform sort according to relevance. It may not find same things relevance as you personally would like to see but same applies to Google. Anyway it is normally much better than just returning random matching values.

    Sphinx does not support wildcards at this point. But again it is not the point. You can make it text LIKE “% the %” or text like “the %” or might be some regular expression – it is not the point. I just mean for locating first few matches keyword with such poor selectivity (over 50% matching rows) simple full table scan with limit actually would offer best performance.

    Again I’m not sure what are you trying to argue about ? If you’re saying TBGSearch is faster if you do not need relevance I agree with you. Everything has its purpose – sphinx is designed exactly for usages where relevance is very important.

    No speaking about why Sphinx is faster (by the way if you would check the link I have provided in the top of the story you would see it is not just my words) I can explain you why:

    – It does not do full sort but uses priority queue sorting to get top 1000 results. This is faster. MySQL simply does not have this feature even if you might only need first few rows from result set

    – As it is specially designed solution it is faster for what it does. This is quite typical.

    – Sort is performend in parallel. This search is performed on 4 nodes with 2 CPUs each. MySQL has no way to use multiple CPUs for such query. Even with MySQL Cluster sort would be done on MySQL server – single node.

    And again take a look at the original post – people are interested about SORTING they do not want full text search on the contrary they are asking about query to return all result set from full text search keyword matching standpoint and use it for extra filtering and sorting.

  • Well to my knowledge “ORDER BY” is pretty common when using a sqlserver such as mysql so this would perhaps be a hint to the mysql developers to adopt new sorting algos.

    Regarding relevance in my case I use date as the major relevance part which is not “just returning random matching values” as you call it.

    The problem with “relevance” is that there is no way to check if you really get the hits with highest relevance in return. Just take Google as example, they claim to use pagerank as relevance rankning among other parameters which means that there most likely exists more relevant matches than you get with a single search query against Google however Google has decided to cheat in such way so the search result is almost or at least good enough in the relevance scale. (Pjuh long sentence but its early in the morning, the reader is free to put in dots on their own 😉

    While a search based on a static measurement such as a date column makes it a lot harder to cheat with the results (unless you do like Google and hide the actual source).

    Its also interresting that you mentioned 4 nodes (wasnt it just 3 nodes in the original post? :P) x 2 cpus which makes it 8 calculation units which in average means that each sorted 100/8 = 12.5 million rows and not 100 million rows as it was claimed in the original post. It would also be interresting to see how many rows each processing unit returned to the main server. Was it 1000 rows out of 12.5 million and then the main server took 8 * 1000 rows and performed an additional sort to just return 1000 rows to your client?

  • Right. ORDER BY Is important and can be optimized and paralellized however it is not done yet and it is not on the roadmap.

    Speaking about relevance – please correct me if I’m wrong but my understanding was with TBGSearch you get 1000 rows from search and only sort these by date. This would not guaranty you sorting by date. I however can see you building indexes in Date order. In this case you can retrieve data from index pretty fast but it does not allow to do other sorts effectively.

    Anyway you’re trying to pull me into comparison with TBGSearch this is not the point of this post.

    Regarding cheating and full text search in general. Generally you want to provide results which are relevant to the user first. Users which search the stuff do not care if these were selected automatically, moderated or “cheated” as you say. Webmasters may scream “I should be on top, it is not fair” but it is not obligation of Google to give you free traffic. There may be some obligations – ie if it is paid advertisement but it is whole other story.

    Sorry I made a typo. This is 3 nodes. In the post I mention total number of posts that is right but that is what is important – being able to easily build large scale systems. You can do same things with MySQL but this require you to do that manually which is pretty ugly.

    Yes you’re right. Nodes do local search to provide requested amount of top rows and they are later merged together.

  • Hi to the eperts here,

    I have a simple question that I couldn’t find an answer for it untill now:
    Is it possible to specify MySQL the amount of raws I want to get from the query result and on each specific event (for example:user clicking ob the “next” button) it will return the next “X” raws?
    The thing is that if for example the complete result contains 100,000 raws, I need only 20 raws each time and don’t want the MySQL to calculate the entire result at one time in order to save time and space. Instead I want it to calculate only the first 20 and then the next 20 and so on…

    Thanks in advance,

  • Roy,

    There is LIMIT you can do LIMIT 10 LIMIT 10,10 etc.

    If you’re not familiar with this functionality you really should get good MySQL Book.

  • I’m very happy with Sphinx (Andrew Aksyonoff thank you a lot for that !), it doesn’t take too much CPU (the Mysql’s one – Full text from myisam – takes a lot, maybe there’s some bug in 5.1 I don’t know, but after migrating mysql takes half less cpu..).

    I really recomand to everyone Sphinx, a very very fast thing !

  • Ed,

    No there are no bitmap indexes. Why they are not implemented ? Well there are a lot of features which are not (yes) – hash join, sort merge join, a lot of subqueries optimizations. It should come but over time.

Leave a Reply