One of the most common causes of a poor Sphinx search performance I find our customers face is misuse of search filters. In this article I will cover how Sphinx attributes (which are normally used for filtering) work, when they are a good idea to use and what to do when they are not, but you still want to take advantage of otherwise superb Sphinx performance.
The Problem
While Sphinx is great for full text search, you can certainly go beyond full text search, but before you go there, it is a good idea to make sure you’re doing it the right way.
In Sphinx, columns are basically one of two kinds:
a) full text
b) attributes
Speaking in MySQL terms, Full text columns are always indexed and using the very powerful extended query syntax you can do search against certain columns only, or against all of them – the fewer results your full text query matches, the faster the query will be. That’s self-evident, I guess. Every time a keyword matches a document, Sphinx needs to resolve an appropriate document and evaluate the result. If your keywords match all of your 100M records, it is going to be a lot of work to do this. However with just a few hundred thousand records it is going to be much much faster.
Attributes on the other hand are sort of like unindexed MySQL columns. They are the extra details you may want to filter by, which is usually things like gender, status, age, group_id etc. The effect of them being unindexed is that whenever you are using attributes – it is a full scan of this attribute for all the records that matched the full text search query. For few hundred thousand of matched records, checking attributes is not going to slow down performance of queries significantly. But – and this is the misuse that I see a lot – when you are NOT doing full text search, only using attribute-based filters means a full scan of all records for that attribute.
Because attributes are not B-tree structured (and therefore are slow to work with), by default Sphinx actually stores them in memory. In fact, it requires that all attributes fit in memory or the server performance will be simply unbearable. However, that still does not mean that you can use attributes to find all the records that match group_id=10 – that query will have to check all 100M of records.
BTW internally there’s some differences between numeric attributes and character based as well as multi-value attributes (MVAs), but for the purpose of our discussion it’s enough to know that attributes are not indexed.
For example..
Now let me give you few examples so it’s not just an empty talk. For the examples below I will be using SphinxQL protocol which looks like talking to MySQL server, but it’s not. It is me talking to Sphinx server.
First of all, let us see how many records we have in this index and how long does it take to do a full scan:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 | sphinx> select * from catalog; ... 20 rows in set (0.70 sec) sphinx> show meta; +---------------+----------+ | Variable_name | Value    | +---------------+----------+ | total         | 1000     | | total_found   | 10309972 | | time          | 0.700    | +---------------+----------+ 3 rows in set (0.00 sec) | 
Note this is a real index used in production – a catalog of books, so if same query happens to give slightly different results it could be because the indexing occurred between different iterations.
If you are seeing this SphinxQL output first time, it maybe a little confusing, but let me explain. Query returned 20 rows because unless you specify an explicit LIMIT, it defaults to 20. Total says 1000 because by default query is limited to 1000 best results to process (it still searches the entire index though).
Otherwise, takeaway is that this index has 10M records and it takes 700ms to do a full scan.
Now, let us find all records that match user_id = 50:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 | sphinx> select * from catalog where user_id = 50; ... 20 rows in set (0.16 sec) sphinx> show meta; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | total         | 287   | | total_found   | 287   | | time          | 0.155 | +---------------+-------+ 3 rows in set (0.00 sec) | 
Pretty bad, isn’t it? 287 records returned in 155ms. Doing the same thing in MySQL, assuming user_id is indexed and in cache, would take less than a millisecond, so it is definitely not the best use case for Sphinx.
When you have full text search keywords that match many documents (and therefore are slow), using attributes may reduce the number of results matched significantly. But not the amount of time it takes to do that:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | sphinx> select * from catalog where match('Great') and user_id = 50; ... 5 rows in set (0.10 sec) sphinx> show meta; +---------------+--------+ | Variable_name | Value  | +---------------+--------+ | total         | 5      | | total_found   | 5      | | time          | 0.096  | | keyword[0]    | great  | | docs[0]       | 200084 | | hits[0]       | 216948 | +---------------+--------+ 6 rows in set (0.00 sec) | 
Solution
Solution may not be obvious first, but you will see that it makes sense. So, the strength of Sphinx is full text search. I suggest we exploit that to get good performance on attributes that are highly selective, such as the user_id example above. In fact, I’ve been using this technique with great success for many years now.
First, I would add the following extra item to fetch when indexing the catalog:
CONCAT(‘userkey_’, user_id) userkey
And now I have an extra column in a full text index that I can use for filtering:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | sphinx> select * from catalog where match('@userkey userkey_50'); ... 20 rows in set (0.00 sec) sphinx> show meta; +---------------+------------+ | Variable_name | Value      | +---------------+------------+ | total         | 287        | | total_found   | 287        | | time          | 0.000      | | keyword[0]    | userkey_50 | | docs[0]       | 287        | | hits[0]       | 287        | +---------------+------------+ 6 rows in set (0.00 sec) | 
That looks much better and I can mix it with other search keywords:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | sphinx> select * from catalog where match('Great @userkey userkey_50'); ... 5 rows in set (0.01 sec) sphinx> show meta; +---------------+------------+ | Variable_name | Value      | +---------------+------------+ | total         | 5          | | total_found   | 5          | | time          | 0.013      | | keyword[0]    | great      | | docs[0]       | 200088     | | hits[0]       | 216952     | | keyword[1]    | userkey_50 | | docs[1]       | 287        | | hits[1]       | 287        | +---------------+------------+ 9 rows in set (0.00 sec) | 
Highly selective columns only
I thought I would emphasize – while it is a neat performance optimization for highly selective attributes, this is certainly not something you would want to use for every attribute. There’s few reasons for that:
- it does use more disk space for the index (although it’s not as bad as you might think)
- attributes are still a good way to filter out data when your search queries don’t match many records
- in fact, it could reduce performance of queries that otherwise match few records
Let me illustrate that. I have created another full text indexed column for a skewed boolean attribute “ancient” which identifies books published before year 1500 and after, and now I will run some different queries against the two:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | sphinx> select * from catalog where ancient = 1; | total_found   | 1499266 | 14% | time          | 0.552   | sphinx> select * from catalog where ancient = 0; | total_found   | 8852086 | 86% | time          | 0.662   | sphinx> select * from catalog where match('ancientkey_1'); | total_found   | 1499266 | | time          | 0.227   | sphinx> select * from catalog where match('ancientkey_0'); | total_found   | 8852086 | | time          | 1.309   | sphinx> select * from catalog where match('solar'); | total_found   | 2510  | | time          | 0.001 | sphinx> select * from catalog where match('solar @ancientkey ancientkey_0'); | total_found   | 2176  | | time          | 0.434 | sphinx> select * from catalog where match('solar @ancientkey ancientkey_1'); | total_found   | 334   | | time          | 0.077 | sphinx> select * from catalog where match('solar') and ancient = 1; | total_found   | 334   | | time          | 0.002 | sphinx> select * from catalog where match('solar') and ancient = 0; | total_found   | 2176  | | time          | 0.003 | | 
What you can see here is that while there’s very little difference when using only the check against “ancient” – it is very slow in both cases – when doing a selective search and then filtering, using attribute is orders of magnitude better than using a fulltext key.
That being said, for highly selective columns, even more so if they will be used in queries without full text search keywords, having them full text indexed is a very good way to improve such Sphinx search query performance.
 
 
 
 
						 
						 
						 
						 
						 
						
Thanks for the explanations. This didn’t seem obvious from the Sphinx docs. It seems this is a side-effect that demands careful consideration of stop words (something I was hoping to be able to simply disable).
Too bad the query match(‘solar @ancientkey ancientkey_0’) can’t optimize on the high selectivity of the ‘solar’ keyword (checking for ‘@ancientkey ancientkey_0’ only for documents that already matched ‘solar’).
By default, match(‘solar @ancientkey ancientkey_0’) uses boolean AND right? That’s why thought the optmization would be possible.
A full text index maps lexical tokens to documents. So when you do a boolean AND against a FTS index of just about any type the FTS much intersect the document ids which match each term to return the correct results. Efficient intersection would probably be O(a + b + … + z) time complexity where each operand is the number of results for a search term. A search over two terms should be O( a + b) complexity, where a is the number of results that match the first term and b the number of results that match the second.
Edmar,
Consider looking at fastbit if you have low cardinality columns that you need to intersect with high ones. Fastbit uses WAH compressed bitmaps which can be compared without decompressing the bitmaps. This makes multidimensional lookups very efficient.
https://sdm.lbl.gov/fastbit/
Thanks Justin. Fastbit does seem cool,
http://crd-legacy.lbl.gov/~kewu/ps/LBNL-61768.pdf . I wonder if something like Sphinx could be made to automagically use Fastbit to index low-selectivity words…
For my relatively-low-volume insert-only dataset (private trouble ticket messages), I’ll probably go for regular full-text indexing (probably Sphinx or MySQL), sharded by year. And let users know about low-selectivity/”common” search terms significantly increasing query times, regardless of the presence of high-selectivity terms (argh).
Aurimas,
You mention attributes in sphinx are like unindexed columns (hence full table scan) yet I remember sphinx did have some optimization like storing attributes in blocks and having ability to skip the whole block if there are for sure no matching rows in it ?
The details of this implementation do not seem to be available in the docs though
http://sphinxsearch.com/docs/2.0.6/attributes.html
Aurimas,
Thx for this article. We have been facing the issue of attribute-based filtering and the full scan it generates. You speak of misuse for many of your clients but, in our case, it’s something we cannot avoid for full functionality. We have been using the technique of fake keyword as you describe.
It works but does not fulfill the case where the value(s) of the attribute changes over time. Unless you re-index your documents accordingly which is costly and somehow inefficient.
It would be great that Sphinx tackles this general issue by proposing to index attributes on a case-by-case basis with a B-tree structure. We’re even discussing this evolution with the Sphinx dev team and, if you know some companies interested, it would be great if they could join the discussion.
Peter,
you remember right 😉 we do indeed have a small block index. We save min/max column values for every 128 rows. When doing a full scan, we scan blocks first, then maybe scan the rows within the block. That isn’t a generic solution but lets us speed up particular scans where the target column is more or less correlated with the ID.
Peter, Andrew – thanks, that’s good to know. I’m sure in some cases this can be a big help.
Laurent – I’m not sure I understand why it will not work properly with delta index? I assume you would reindex all of the documents that have changed since last full indexing (even if it was only attributes data that changed) in which case any such changes should be available after delta reindexing? I agree B-tree on attributes would be great, but I’m not sure it’s not too much to ask from a full text search engine.
Anyway, if I get to work with any customers who would benefit from that, I’ll definitely mention it.
Aurimas,
It could properly work with delta index but only to a certain extent. Said differently, it does not scale up if the value(s) of the (multi-value) attribute often change (like potentially several times a day) and/or the volume of documents impacted is high (x00 thousands docs impacted over a corpus of x00 millions docs for every change). In such a situation, the time and resources taken by producing (and eventually merging) the delta index is a show-stopper.
In our case, we provide a kind of ‘dynamic grouping’. Let’s say each document has an author. We’ve got 1 (big) index for all the docs. But our users are free to group authors arbitrarily and create new groups on demand, when they want. They want to search inside each group indenpently, immediately after they created it. While we’re clearly very satisfied with Sphinx functionality and performance, we have not been able to fulfill this requirement based on the current feature set. We have thought of B-tree indexed attribute as the solution to this need.
we use sphinx to our website where we have about 1200 tables each with hundreds of thousands of records, for we use sphinx based on attributes, but memory consumption is exaggerated, so we applied the technique described here, but the search results have gone from milliseconds to more than 43 seconds
Fabian, –
please note that, as stated in the article, this technique only works well for highly selective attributes. If attributes only have few distinct values (e.g status, gender, group_id with only 10 groups per 10M records), then you don’t want to use this. Please see “Highly selective columns only” section for examples where it doesn’t work well for us too.
We can use fastbit bitmap index, from mysql.
Check this https://github.com/greenlion/FastBit_UDF
Though peripherally related I thought I’d add this note since I recently ran into a related issue. I was pushing a fairly large # of ids in by attribute but the issue as they needed to be part of an “OR” query which, unfortunately, can’d be done with attributes (yet). In the index therefor I did an extra select field_id as field_id_text which allowed me to then push the same block into the full-text query (i.e. inside MATCH(”) vs after it). This helped solve the limitation but it is interesting to note in this article that I’m adding a fair amount of overhead I don’t need into the full_text to find the field_ix_text when that part of the query is reducing my results to 100s vs 100s of 1000s of records. I wonder how long then it will be before Sphinx allows for MATCH(”) and attri in (1,2) OR attr2 in (2,3) since clearly that would speed up the query I was forced to put inside the full text
Just wanted to share my experience, when you are using integer MVA or integer attributes, there are no performance increase if you put them into “fields” (indexed). Maybe “full text indexing” just doesnt work for integer attributes and only for strings – so if you are using Sphinx/Manticoresearch for filtering just by attributes, there are no performance gains.