Sphinx search performance optimization: multi-threaded search

Posted on:



Share Button

Queries in MySQL, Sphinx and many other database or search engines are typically single-threaded. That is when you issue a single query on your brand new r910 with 32 CPU cores and 16 disks, the maximum that is going to be used to process this query at any given point is 1 CPU core and 1 disk. In fact, only one or the other.

Seriously, if query is CPU intensive, it is only going to be using 3% of the available CPU capacity (for the same 32-core machine). If disk IO intensive – 6% of the available IO capacity (for the 16-disk RAID10 or RAID0 for that matter).

Let me put it another way. If your MySQL or Sphinx query takes 10s to run on a machine with a single CPU core and single disk, putting it on a machine with 32 such cores and 16 such disks will not make it any better.

But you knew this already. Question is – can you do something about it?

In case of Sphinx – indeed you can! And with very little effort. In fact, it does not require any changes to your application or database whatsoever, it is only a matter of small changes to the sphinx configuration.

The Plan

First of all, let me explain what we are trying to achieve here.

Sphinx has the ability to do distributed search out of the box – it was designed to scale out that way very early on. If your sphinx index does not fit to one machine, you would index different parts of it from different machines and then you would have an aggregator node that receives the request from application, issues search requests to all data nodes in parallel, merges results from all of the data nodes and returns results back to the application as if it was just one server serving the request in the first place.

Well, guess what – you can actually utilize this feature to your advantage even if your data can easily fit into one machine and all you want is your queries to be many times faster. Even more so, Sphinx now supports this out of the box, so you don’t need to pretend you are querying remote nodes.

Also, you get a bonus: once you configure server for distributed search, you can do indexing in parallel too!

Word of caution: while this technique will improve most types of search queries, there are some that aren’t going to benefit greatly from parallel execution. The reason is that while search is done on data nodes (even if local ones) and in parallel, merging of results is done by the aggregator and therefore it is single-threaded. Merging includes some CPU-intensive operations such as ranking, ordering or even COUNT with GROUP BY and if data nodes return large amounts of data to post-process, aggregator may well become a bottle-neck due to its single-threaded nature.

This is however easy to check – look at your Sphinx query log and look at the number of results matched per query – this should give you a clue.


Let us assume you have this one index on one server with the following basic configuration (many irrelevant details omitted):

And now we want it to utilize 3 CPU cores and/or disks on a local machine for this index of ours – idx1. Here’s how we would change the configuration:

And you’re done. Of course, you need to reindex all of the indexes, but you can now do it in parallel – just run a separate indexer for every index idx1p0 through idx1p2.

BTW, using div operator is not necessarily the best way to split data, you may want these to be ranges by using a helper table in MySQL to define those ranges, mixing it with sql_query_range or something else, depending on how your data looks like.

Finishing line

I always loved how Sphinx scales out easily with as many machines you need and have been running it this way for many years now, however I think I don’t utilize this feature to make queries even faster on a one-machine show nearly as often as I should. Well, it’s not like it is slow or anything, but queries are never too fast, are they? :)

Share Button

Aurimas Mikalauskas

Aurimas joined Percona in 2006, a few months after Peter and Vadim founded the company. His primary focus is on high performance, but he also specializes in full text search, high availability, content caching techniques and MySQL data recovery.


, , ,



  • Thank for this great post!

    What do you think? Does it make sense to distribute my index?
    I have an 4 Core CPU (Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz) with 32GB of RAM
    but only with an RAID-1 system (containing 2 disks).

    My sphinx index contains about 12 millions records with a disk size of about 2,7 GB.


  • Stefan, –

    If this machine is running sphinx only, you are definitely going to benefit. Even if it’s a shared box – if you don’t have many disk reads hence sphinx serves data from OS cache, it should help just as well – making 3-4 partitions should improve performance of sphinx queries 2-3 times. Note this also assumes that your queries are rather selective and do not return millions of records.


  • Is the above config example correct? I’m trying to replicate it on my end and I see some issues.

    ERROR: index ‘idx1p0’: source ‘src0’ not found.

    I don’t see src0 defined anywhere in the above config.

    indx1p1 as configured is going to contain the entire dataset, not just 1/3rd of it as expected because it is using src1 which is defined without a % in the where query.

    Same error for the 3rd index:
    indexing index ‘idx1p2’…
    ERROR: index ‘idx1p2’: source ‘src2’ not found.
    ERROR: index ‘idx1p2’: no valid sources configured; skipping.

  • Brian, –

    thanks – good catch. That was an error on my side while making the pseudo-config. You should replace “source = src0” to “source = src1p0”, “source = src1” to “source = src1p1” and “source = src2” to “source = src1p2”. I have fixed that in the post already.

    Let me know if that fixes the issue.


  • Thanks, the config works perfectly now.

    I’m troubleshooting an issue where we get zero results back after sharding the index into 4 chunks. We’ve been running a single index for a few years now, and we have 16 24 core servers handling the search side of things. For whatever reason, when we shard the index, we no longer get any results back in production. Nothing changes software side, just the sphinx config, index, and a searchd restart.

  • Brian, –

    I would try to query the individual shards and see if they return results. If not, chances are something went wrong with indexing and indexes are just empty. If they do, then it could be a bug on the sphinx side.

    If you’re unable to figure it out, please create a forum entry and just post a link here:



  • How would you merge the indexes after this? I assume you would want to get 1 index. So, I can run the indexer for idx1p0, idx1p1, idx1p2 but how would I then get the idx1 index? If I try to run indexer for idx1, I get “skipping non-plain index ‘idx1’…


  • Hi Robert, –

    you don’t actually need to merge sphinx indexes – it does that on the fly. If you are using sphinxql, you can actually just list all the indexes you want to merge on the fly (e.g. SELECT cols FROM idx1p0,idx1p1,idx1p2 WHERE …), otherwise if it’s more convenient that the index looks as if it was one index, you can create a virtual index with type = distributed that has all the local indexes “merged”. See here for some explanation.

    And btw, you don’t want to run indexer for the virtual index, you want to index the local parts instead, hence only run indexer on idx1p0,idx1p1,idx1p2.

    Have a nice weekend!

  • Thanks Aurimas, I misunderstood the nature of how the indexes would work. Thanks for posting this also, it was a great help to me. I really does speed up the searches.

  • Thanks for the feedback, Robert. There are some types of queries where it does not help (those that do a lot of post-processing after matching the search), but in many simple cases it should help indeed.

  • Nice Idea,
    i tested it on my local machine ( core i3 + 3G ram + sphinx 2.1.1) on 300K rows with an index of 48 M.B and with using stopwords list , min_word_len = 1 , searching only one row and storing 5 rows as attr_string , with 6 languages in charset_table

    i developed a script that search for 2000 random queries and return the mean time for each single query and the result was as follow:
    1 thread >>maximum (0.019923) and minimum (0.019618) second per query
    2 threads >>maximum (0.018417) and minimum (0.017915) second per query

  • Osama, I realize in your case the improvement is not significant, but it really depends on quite a few things:
    – first of all, naturally the difference will be more significant for IO intensive workloads and in your case the test is definitely CPU intensive
    – secondly, search selectivity is very important here. The more selective the search is, the more benefit you get from parallel processing (because there’s less work remaining to combine the results from different threads). In your case, considering the response times, search queries are likely not very selective and so naturally post-processing (which is single-threaded) is where most time is spent
    – and third, type of queries is important. simple search queries are more likely to benefit than those doing a lot of post-processing for other reasons such as group-by with some aggregate functions.

  • Hi, coming at this question from an ‘Architect’ perspective so pardon any non-tech misunderstandings; will this be helpful in the case of using sphinx as auto-complete? We have a search form where each input pushes the characters to sphinx index to return the field into a selection container (Am= American/Armenia/Amsterdam). We do this since we don’t want to cache our tables client side but do want a responsive Auto-Suggest.

    In load-testing we consistently find we max our user at about 20% of CPU (meaning the response times for the inputs are no longer acceptable after that cpu utilization). This is whether on VPS with 2 cores or dedicated with 64 cores.

    Would this solution help in this case? In other words would making sure Sphinx was involving all cores help make sure that the cpu is not overloaded at 20%?

  • Michael, –

    if your benchmark is multi-threaded, then having it max out at 20% CPU utilization does not look right, you should be able to get it way higher as you increase the number of parallel users and the response time should only increase slightly until you reach 70-80% CPU utilization.

    That said, I’m guessing that either your benchmark is serializing requests (or there’s serialization at any other level), or your bottle-neck is not CPU, rather Disk IO. It is especially suspicious that 2 -vs 64 cores makes no difference.

    Making a single request execute on multiple CPU cores (if that is possible depends on the design of your request) may help to increase CPU utilization (and decrease response time) for a single request and with that you will also increase the overall utilization of machine CPU (as well as you will be able to handle more user requests per given time if they are serialized). However, if you are happy wih the current response time and just want to handle more requests, then (a) make sure your benchmark is multi-threaded, (b) make sure you’re not saturating disks (and if you are, double check you are saturating a capacity of multiple disks and not one disk, which, again, realtes to request queue depth (or number of parallel requests you execute)).

    I hope I’m making sense.

  • Hi Aurimas, thanks for the detailed response. More cores does give us a higher # of concurrent sessions before the response times goes haywire. Mainly I’ve noticed it is when the cpu/networks stop scaling in a smooth way and fluctuate the response times get killed. I agree I was looking for 75% or before things got overloaded. In all tests the memory usage was pretty flat and very low (on 32GB + boxes) at less then 10%.

    I wonder though since there are about a dozen inputs, perhaps 4-5 ‘core’ inputs, if explicitly separating those into groups might help? For instance push Input A to Core 1, Input B to Core 2, etc. Otherwise as I see it we have multiple users typing into different inputs and each keystroke in each input for each user in theory is overloading the same CPU. Is this even possible? If so I could take the 4 main inputs user search from and push to a core or cores.

    I am not in love with the current response time which is, when not being slammed, 350ms to 600ms. I’ve noticed to return auto-suggest in a perfectly timely manner is around 250ms. Since we create urls from the input chracters e.g. /fieldA=ABC and then push that to Sphinx, maybe we can cache those somewhere.

    If you think it is possible somehow to push each field input to different cores then we will try that.

    Thanks again!

  • Thanks for your response Michael. To answer your comment, unless I’m missing something, I don’t think separating inputs to different groups would do any good. Every new request should end up on a different CPU (and depending on what threading model you are using, you should see either a few threads with ps auxH or few processes with ps aux). If they end up overloading the same CPU thread, something is wrong.

    As for response time, splitting data into several chunks and querying them all in parallel should help you with that, as long as most of the work is done by the data nodes and not by the aggregator (or proxy).

    Caching is a different story, but yes – you can absolutely implement it. It’s not straight forward, you would most likely want to implement something like memcached cache in front, but it is definitely doable and should give you great results especially if people tend to search same stuff.

Leave a Reply

Your email address will not be published. Required fields are marked *