October 20, 2014

Sphinx: Going Beyond full text search

I’ve already wrote a few times about various projects using Sphinx with MySQL for scalable Full Text Search applications. For example on BoardReader we’re using this combination to build search against over 1 billion of forum posts totaling over 1.5TB of data handling hundreds of thousands of search queries per day.

The count of forum posts being large, is however not the largest we’ve got to deal in the project – number of links originating from forum posts is a bit larger number.

The task we had for links is to be able to search for links pointing to given web site as well. The challenge in this case is we do not only want to match links directed to “mysql.com” but links to “www.mysql.com” or “dev.mysql.com/download/” as well as they are all considered to belong to mysql.com domain, while searching for “dev.mysql.com/download”” will only match dev.mysql.com domain and files within /download/ directory.

Initially we implemented it in MySQL using partitioning by domain which link was pointing to. So “mysql.com” links were stored in one table group and “google.co.uk” on another. We still had serious challenges however – as each applies to many search URLS,
such as “dev.mysql.com/download/mysql-5.1.html” would match “mysql.com”, “dev.mysql.com”, “dev.mysql.com/download/” and
“dev.mysql.com/download/mysql-5.1.html” we could not use link=const where clause but had to use link like “prefix%” which means index could not be used to get 20 last links and filesort over millions of links we had to youtube.com wikipedia.org and other top domains was extremely slow. Not to mention counting number of links (and number number of distinct forum sites) pointing to the given URL or graphs showing number of links per day. To fight this problem we had to restrict number of days we allow to cover based on the amount of links to the domain… but for some top domains it was slow even with just 3 days worth of data.

You might point out if we had link_date between X and Y and link like “prefix%” kind of where clause we would not be able to use index past link_date part, it is true so we had to use link_date in ( ) and link like “prefix%” which allows to use both keyparts which is much better but not good enough.

Caching is not good enough in such case as we do not want a single user to wait for minutes. large variety of problematic search urls does not allow to use pre-caching not to mention general load on server such batch processing would put.

The first alternative to this approach was to store duplicate data storing link to “dev.mysql.com/download/mysql-5.1.html” as links to 4 url prefixes I mentioned above. Unfortunately this would blow up data stored quite significantly, requiring in average of 6 rows for each link and it does not solve all the problems – result counting and number of distinct sites were still pretty slow and we did not want to go into creating all this data as summary tables.

Instead we decided to use Sphinx for this kind of task which proved to be extremely good idea. We converted all URLs to search keywords and now these 6 rows become simply one row in sphinx index with 6 “keywords” – specially crafter strings which corresponded to the URLs. Of course we did not store these in the table but instead used UDF to convert URL to list of “keywords” on the fly.

As results we now can pull up results even for youtube.com for fractions of the second and we could show 3 months worth of data for any URLs. (We could use longer time span but we did not have enough memory for Sphinx attribute storage). It is especially great as there is still room for optimization – Sphinx stores word positions in the index, while we do not need them in this case as we’re doing kind of “boolean full text search”. Plus we can make index built sorted by timestamp which would allow to same on sorting which is now still happening.

Using Sphinx such non-traditional way required implementing some features more traditional for SQL databases rather than full text search applications. Group By was added to Sphinx so we could search number of matches per day, or number of matches per language.

For Domain Profile we’ve got to use even more of those features such as counting number of distinct sites pointing to the given url or domain etc. Honestly this is where we cheated a bit and distinct number is bit approximate for large numbers but it still works really well for our needs.

Sure we could use summary tables for domains but it would be a lot of work and raver inflexible if we would like to add some more features and take a lot of resources to update or periodically build for millions of domains.

As this part worked pretty well we also decided to use Sphinx for other part of the web site – Forum Site Profile. This uses some pre-generated data such as number of posts in total for forum or in thread but most of other stuff is built with Sphinx. This also uses fair amount of tricks using fake full text search to retrieve all posts from given web site or forum from the global sphinx index.

So in general we find parallel processing using sphinx pretty good solution for many data crunching needs especially when lack of parallel abilities in MySQL makes its use rather inconvenient and pre-generating data is inconvenient or impossible.

If you’re attending OSCON 2007 and would like to learn more about Sphinx we have a BOF on Thursday to talk on the topic.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Parvesh says:

    Peter,

    Out of curiosity, how do the search results of Sphinx compare with the likes of say, Lucene. We must agree that there are pros and cons to using a embedded search engine in MySQL. What’s your take on this?

  2. Pedro Melo says:

    Peter,

    did you tested storing the URLs in reverse domain + path style and using like?

    Just curious.

    Best regards,

  3. peter says:

    Parvesh,

    What do you mean by “how results compare” – do you mean search relevance performance or something else ?
    Note Sphinx is also really an external full text search engine, there is simply a client which allows to use it as MySQL Storage Engine.

    We believe Sphinx is faster especially if you’re working with very large data sets and use special stuff like post filtering by attributes grouping etc. Lucene has more features and Relevance is always arguable thing.

  4. peter says:

    Pedro Melo,

    Sure. I even mentioned it in the article. It however does not help because if you have something like
    path like “download/%” order by published desc you will get filesort which will kill performance.

  5. Pedro Melo says:

    Ahhs… I misread the part of “prefix%”… Shouldn’t mysql be able to use an index in this case?

    If I have my links stored as “moc.lqsym.ved/download/mysql-5.1.html” we could match all of those queries:

    * moc.lqsym%
    * moc.lqsym.ved%
    * moc.lqsym.ved/download%
    * moc.lqsym.ved/download/mysql-5.1.html%

    Prefix match should be usable…

    Thanks in advance,

  6. Michael says:

    Peter,

    What do you think about using this?

    SphinxSE as a pluggable storage engine for mysql?

    Currently we have a forum with about 17million posts, and a dedicated search server (a supercomputer, SGI Altix). It’s beginning to choke on that amount of data. I’m thinking of migrating vbulletin to use sphinx for searching.

    http://sphinxsearch.com/doc.html#sphinxse

    Michael

  7. peter says:

    Pedro Melo,

    We used to use similar approach. You’re right in this case MySQL uses index to retrieve data, the problem however for some URLs there would be millions of of links which have to be sorted with “filesort” which is too bad.

    If the goal would be to show top 10 any links this approach would work just fine :)

  8. peter says:

    Michael,

    I think Sphinx SE Is a good way for MySQL Users to start using Sphinx not getting out of MySQL environment.

    In our projects we typically use native Sphinx API instead which allows to keep stock MySQL version and is more transparent in terms of understanding performance properties.

    It is actually quite easy – get list of row IDs from sphinx, retrieve data from MySQL, use sphinx to do filtering, ordering and grouping.

  9. maht says:

    You can break the uris yourself into the components you want to search for.

    Here’s my postgresql table layout though it looks fairly SQL standard to me.

    CREATE TABLE domains (
    domain VARCHAR NOT NULL PRIMARY KEY
    );
    CREATE TABLE subdomains (
    sub VARCHAR NOT NULL,
    domain REFERENCES domains ON CASCADE DELETE,
    PRIMARY KEY (sub, domain)
    );
    CREATE TABLE paths (
    proto VARCHAR NOT NULL;
    path VARCHAR NOT NULL,
    port INT,
    domain varchar NOT NULL REFERENCES domains ON CASCADE DELETE,
    sub varchar REFERENCES subdomains ON CASCADE DELETE,
    );
    CREATE INDEX paths(sub, domain, port, path);

    SELECT proto || “://” ||
    coalesce(sub || “.”, “”) || domain ||
    coalesce(“:” || port, “”)|| path
    FROM paths
    WHERE domain=”mysql.com”
    AND path LIKE “/downloads/%”;

    fun with I/LIKE – some web servers are case insensitive!

  10. maht says:

    stupid blog software, these ‘double’ quotes are supposed to be single.

    What a lame way to sql escape.

  11. maht says:

    lol, they came from the pastebin I used first, not my day :)

  12. peter says:

    Maht,

    Your example does not solve the main problem I’m mentioning – sorting together with prefix/range index.

    Regarding LIKE – you can have it case sensitive or case insensitive depending on collation you’re using.

  13. maht says:

    peter : I’m sorry I still don’t understand the problem, in that case, even after re-reading.

    Is it just adding some extra columns to paths? presumably not :)

    Full text search breaks down the words into lexemes (as it does for searches queries) so that animation, animator and animated will all be returned for searches for animate.

    I’m still under the impression that this is not the behaviour you are exploiting ergo there must be a better way of doing it.

  14. peter says:

    SELECT * FROM TBL WHERE KEYPART1=CONST and KEYPART2 LIKE “PREFIX%” ORDER BY KEYPART3 LIMIT 10;

    This uses FileSort (external sort) which is fatally slow for millions of items we’re looking at.

    Regarding your comment about Full Text Search this is too simplified understanding :) The process you’re describing is stemming – which can be adjusted. In this case we do not use stemming.

  15. Michael says:

    Peter, thanks for the answer on SphinxSE. We’ll probably implement native sphinx. My new question, do you know if sphinx allows the index to be read in reverse?

    Michael

  16. Hi Peter,

    I am facing some issue in Sphinx and its like i need all the columns of table in result set and i tried various method by doing google and in some article i got that its not possible to fetch all the columns because it only fetch few fields.

    Second problem i have is i want to setup sphinx on one server and want to use it on three different server by using that server

    Please Help me out

    Thanks in advance

  17. Siddharth says:

    Server: 4-core CPU with 8Gb RAM, running Windows 2008 server

    Config:
    dist_threads = 4
    max_children = 1000

    I have a multi-threaded client making simultaneous connections via SQL. Each query is a
    random string of length 10 characters. Performance drops exponentially with increasing
    number of queries.

    Queries Elapsed seconds
    1-3 0
    4-10 5
    11-20 15
    20+ over 40

    CPU time is close to zero and nearly independent of the number of queries. There is 2Gb
    of available physical RAM at any time (no swapping).

    If more than 30 threads, the connections will time out after 10 seconds.

    Is this expexted behavior? Is there a remedy?

Speak Your Mind

*