Sphinx search performance optimization: attribute-based filters

One of the most common causes of a poor Sphinx search performance I find our customers face is misuse of search filters. In this article I will cover how Sphinx attributes (which are normally used for filtering) work, when they are a good idea to use and what to do when they are not, but you still want to take advantage of otherwise superb Sphinx performance.

The Problem

While Sphinx is great for full text search, you can certainly go beyond full text search, but before you go there, it is a good idea to make sure you’re doing it the right way.

In Sphinx, columns are basically one of two kinds:

a) full text
b) attributes

Speaking in MySQL terms, Full text columns are always indexed and using the very powerful extended query syntax you can do search against certain columns only, or against all of them – the fewer results your full text query matches, the faster the query will be. That’s self-evident, I guess. Every time a keyword matches a document, Sphinx needs to resolve an appropriate document and evaluate the result. If your keywords match all of your 100M records, it is going to be a lot of work to do this. However with just a few hundred thousand records it is going to be much much faster.

Attributes on the other hand are sort of like unindexed MySQL columns. They are the extra details you may want to filter by, which is usually things like gender, status, age, group_id etc. The effect of them being unindexed is that whenever you are using attributes – it is a full scan of this attribute for all the records that matched the full text search query. For few hundred thousand of matched records, checking attributes is not going to slow down performance of queries significantly. But – and this is the misuse that I see a lot – when you are NOT doing full text search, only using attribute-based filters means a full scan of all records for that attribute.

Because attributes are not B-tree structured (and therefore are slow to work with), by default Sphinx actually stores them in memory. In fact, it requires that all attributes fit in memory or the server performance will be simply unbearable. However, that still does not mean that you can use attributes to find all the records that match group_id=10 – that query will have to check all 100M of records.

BTW internally there’s some differences between numeric attributes and character based as well as multi-value attributes (MVAs), but for the purpose of our discussion it’s enough to know that attributes are not indexed.

For example..

Now let me give you few examples so it’s not just an empty talk. For the examples below I will be using SphinxQL protocol which looks like talking to MySQL server, but it’s not. It is me talking to Sphinx server.

First of all, let us see how many records we have in this index and how long does it take to do a full scan:

Note this is a real index used in production – a catalog of books, so if same query happens to give slightly different results it could be because the indexing occurred between different iterations.

If you are seeing this SphinxQL output first time, it maybe a little confusing, but let me explain. Query returned 20 rows because unless you specify an explicit LIMIT, it defaults to 20. Total says 1000 because by default query is limited to 1000 best results to process (it still searches the entire index though).

Otherwise, takeaway is that this index has 10M records and it takes 700ms to do a full scan.

Now, let us find all records that match user_id = 50:

Pretty bad, isn’t it? 287 records returned in 155ms. Doing the same thing in MySQL, assuming user_id is indexed and in cache, would take less than a millisecond, so it is definitely not the best use case for Sphinx.

When you have full text search keywords that match many documents (and therefore are slow), using attributes may reduce the number of results matched significantly. But not the amount of time it takes to do that:

Solution

Solution may not be obvious first, but you will see that it makes sense. So, the strength of Sphinx is full text search. I suggest we exploit that to get good performance on attributes that are highly selective, such as the user_id example above. In fact, I’ve been using this technique with great success for many years now.

First, I would add the following extra item to fetch when indexing the catalog:

CONCAT(‘userkey_’, user_id) userkey

And now I have an extra column in a full text index that I can use for filtering:

That looks much better and I can mix it with other search keywords:

Highly selective columns only

I thought I would emphasize – while it is a neat performance optimization for highly selective attributes, this is certainly not something you would want to use for every attribute. There’s few reasons for that:

  • it does use more disk space for the index (although it’s not as bad as you might think)
  • attributes are still a good way to filter out data when your search queries don’t match many records
  • in fact, it could reduce performance of queries that otherwise match few records

Let me illustrate that. I have created another full text indexed column for a skewed boolean attribute “ancient” which identifies books published before year 1500 and after, and now I will run some different queries against the two: