I’ve run into the following thread couple of days ago:

Basically someone is using sphinx to perform search simply on attributes (date, group etc) and get sorted result set and claiming it is way faster than getting it with MySQL. Honestly I can well believe it for cases when you want to know number of matching rows as well as if you can’t build efficient indexes so selectivity is done by index and index used to resolve order by.

Funny enough to filter by attributes or sort sphinx does not use indexes – indexes are only used for full text search matching, but it is still extremely fast doing data crunching.

I just tested right now performing search of “the” which matched 100.000.000 of documents out of 200.000.000 collection (200GB) completed in 0.7 second. This is system we’re building for one of our clients which uses cluster of 3 nodes to handle search. In this case no shortcuts are taken all 100.000.000 of matching document are traversed and priority queue sorting is performed to generate 1.000 best matching results. Quite impressive

Yeah I know it should be stop word but I currently have index without stop words for testing purposes.

Now what I’m hoping for as developments:

Andrew to continue improving sphinx so it would have more advanced filtering clauses and types of attributes, plus there would be an option to retrieve by filters only with no full text query. Sphinx should not be replacement for Database Server but for many data retrieval needs it will work great. Especially as it can be used with other databases which may be slower than MySQL.

MySQL Make it so one would not need to use sphinx to get great performance for such kind of queries. This includes parallel processing, fast count(*) and bitmap indexes to help non selective clauses. Also some form of fast sort like priority queue could be used if only few first elements are needed.

24 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dathan Pattishall

Nice! Sphinx sounds like a great storage engine! I’ll give it a try.

Apachez

However, 0.7s is kinda slow for just performing a counterlookup for the specific word 😛

Apachez

Actually sphinx is not sorting the results at all and that can be proven by just changing the sourcecode into actually selecting 100_000_000 “rows” and not just the 20 which are statically compiled. You will then see a major increase in processing time…

Searching on just “the” (or any other indexed word) takes far less than 0.7 seconds with TBGsearch or any searchengine which is using mysql as backend and like sphinx stores the frequency for each word. Actually number of “matched” documents in this case doesnt matter – the time will be the same with 100 million documents as with 1 billion documents because the lookup will be against the word table and not the hit table itself. The time will be more affected of number of unique indexed “words” than how many documents there have been indexed.

Apachez

Actually neither sphinx can guarantee any true relevance matching. And since when does sphinx support wildcard searches? 😛

In sql the query would be more like “word = ‘the'” and not “word LIKE ‘%the%'” (even thou TBGsearch among others does support wildcards).

In the case of TBGsearch “relevance is irrelevent” where the “relevance” is based on date in the case of TBGsearch but can easily be made on any number of own relevance matching (regarding how you position one message in front of another where there are many different type of “relevances”…)

Another note is that IF sphinx actually performs a sort each time and that on 100 million rows in just 0.7 seconds that will be way faster than mysql itself performs sorting which should give you a clue that what you said first time perhaps is not the full true story…

Apachez

Well to my knowledge “ORDER BY” is pretty common when using a sqlserver such as mysql so this would perhaps be a hint to the mysql developers to adopt new sorting algos.

Regarding relevance in my case I use date as the major relevance part which is not “just returning random matching values” as you call it.

The problem with “relevance” is that there is no way to check if you really get the hits with highest relevance in return. Just take Google as example, they claim to use pagerank as relevance rankning among other parameters which means that there most likely exists more relevant matches than you get with a single search query against Google however Google has decided to cheat in such way so the search result is almost or at least good enough in the relevance scale. (Pjuh long sentence but its early in the morning, the reader is free to put in dots on their own 😉

While a search based on a static measurement such as a date column makes it a lot harder to cheat with the results (unless you do like Google and hide the actual source).

Its also interresting that you mentioned 4 nodes (wasnt it just 3 nodes in the original post? :P) x 2 cpus which makes it 8 calculation units which in average means that each sorted 100/8 = 12.5 million rows and not 100 million rows as it was claimed in the original post. It would also be interresting to see how many rows each processing unit returned to the main server. Was it 1000 rows out of 12.5 million and then the main server took 8 * 1000 rows and performed an additional sort to just return 1000 rows to your client?

Andrew Aksyonoff

Hi guys (especially Apachez),

So I’m just wondering what’s the conclusion – does Sphinx honestly sort those 100M matches or not?

Out of pure curiosity, you know…

🙂

Apachez

Andrew, isnt it a bad thing if the developer of sphinx doesnt know how sphinx works? 😛

Andrew Aksyonoff

I’m not (yet) asking how it works – I’m just asking what the conclusions were 🙂

Roy

Hi to the eperts here,

I have a simple question that I couldn’t find an answer for it untill now:
Is it possible to specify MySQL the amount of raws I want to get from the query result and on each specific event (for example:user clicking ob the “next” button) it will return the next “X” raws?
The thing is that if for example the complete result contains 100,000 raws, I need only 20 raws each time and don’t want the MySQL to calculate the entire result at one time in order to save time and space. Instead I want it to calculate only the first 20 and then the next 20 and so on…

Thanks in advance,
Roy

Nicolae Namolovan

I’m very happy with Sphinx (Andrew Aksyonoff thank you a lot for that !), it doesn’t take too much CPU (the Mysql’s one – Full text from myisam – takes a lot, maybe there’s some bug in 5.1 I don’t know, but after migrating mysql takes half less cpu..).

I really recomand to everyone Sphinx, a very very fast thing !

flicksty

Apachez, you are so irrelevant and stupid in this thread. Did you have gasoline for breakfast when you wrote your posts?

Ed

Is there something like bitmap indexes in MySQL ? And if not, why have they not implemented such a great feature?

Haitha

Hi, peter

How did you make the full scan search? I want to know how can I make a full scan in sphinix

Haitham

can u give me an example..i tried but it was returning 0 results 🙁

Akhil Bansal

Guys, I have used sphinx for one of my project. It is simply great!!!

Akhil Bansal

Is there an efficient way to get the Top N results from a large (~10-15Million rows) result set ?