October 31, 2014

BoardReader – Forum Search Engine

One may have notice we were not blogging too much recently, this is because we were quite busy, mainly building BoardReader.com – Search Engine which indexes tens of thousands of forums from all over the world. This project was built by us as consulting project so too bad we do not own it completely but we’re still quite excited it is live now. We did not work on crawler in this project only on database Backend and full text search engine implementation. In this part it is standard LAMPS application. I guess you know what LAMP is and S Stands for Sphinx – Full Text Search Engine which we love to use where large scale search is needed. At this point we have over 300 millions of posts indexed with only 3 search servers and still counting. I guess we’ll have half a billion of forum posts soon.

To share few more technical details – it is implemented using pretty standard “manual partitioning” scheme with different forum sites mapped to different “table groups” with each server handling bunch of these. This would make it easier to re balance groups if needed as traffic growths as well as makes ALTER TABLE much less painful. The other technique which I covered in some of my presentations is using double data storage with different partitioning. In our case we wanted to track links between sites. It is easy for outgoing part as we already cluster by sites but It is hard for incoming links as they are scattered among many tables and servers. To target this problem we also store inbound links clustered by second level domain which allows to get inbound links pretty efficiently. It turns out however some domains still get way too many links and we’ll likely redesign it in the future to use sphinx instead (it can do extremely fast parallel group-by on many servers, in google style).

Few features which I would like to highlight – first you can use it to Search MySQL Forums Notice simple link structure – you can replace mysql.com in it with any other domain to search forums from that domain. For example you can use this link to search our MySQL Performance Forums

Second – note the graph which shows how many results were found matching this terms right from search results. It can show quite interesting data, for example searching Britney Divorce will show huge spike then news came out and quick calm down in about in week. You can click on the bar in the graph to get search results focused on that period. Can be quite fun.

Another nice feature is domain profile – by using it you can see how actively this domain is getting links, which pages are most frequently linked on domain as well as which pages and domains forum users tend to link to. So far reporting period is restricted by performance reasons – there is too much data to group and quite a hassle to build summary tables as we want to count uniques, but it should be fixed once we rewrite it using sphinx. From that page you can also get to inbound link report which allows you to see what recent links do you have from forums to whole web site or particular url

I also should mention couple of ratings we have implemented. Love for ratings probably comes from my SpyLOG background. At this point we have implemented rating of YouTube videos and rating of Domains In both cases we check how many links each of domains is getting and from how many unique sites. For domains we split domains which are getting normal links as well as domains which have images on them referenced.

There are still a lot things to do and quite probably quite a lot of bugs to kill. We would welcome any feedback such as suggestions or bug reports. Also if you know the forum which is not indexed please free to submit it.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Odin says:

    It’d be interesting to know the hardware setup of such a site. The site and graphing appears to be VERY responsive. Bravo!

  2. peter says:

    Thanks,

    This system uses bunch of Dell Poweredge 2900 for Database and Search nodes and Poweredge 1900 for Web. I think PowerEdge 2900 with couple of dual core CPUs 6 hard drives and 16GB of RAM is very good combination at this point. Going higher than that makes it much more expensive and MySQL may have problems scaling past 4 cores anyway.

    Generally for search nodes I would use something less expensive but in this case we wanted to make sure it is simple to maintain. If you’re using Google approach with huge amount of crapy servers you need to make sure you have staff and infrastructure to fix and recycle them.

  3. ryan says:

    hmm, does boardreader work with custom “home-brew” forum software?

  4. peter says:

    Ryan,

    Custom forum software may require some extra work with parsing to be added.
    But please feel free to submit it depending on the size of site crawl guys can get to it sooner or later.

  5. yokr says:

    You are now about 1,2 Billion of indexed forum posts ! I know that with Sphinx you can do distributed searching along the nodes!
    My questions are :

    – what is the size of sphinx index you use and what is their number per machine?
    – Does one index is fully allocated to one CPU? or a branch of them ?
    – What is the criteria to subdivide data into different machines and different index ?

    I appreciate your board.

    Thanks & BRAVO

  6. peter says:

    York,

    It is even more now :) We use 4 indexes per box each having dedicated hard drive. We do not use CPU affinity for the search nodes though we could try and see. There is probably 200Gb forth of index per node so some 50GB per CPU. The criteria is random distribution. Though indexes themselves are also date partitioned so when you search for last 3 months only you do not have to go over all data.

  7. York says:

    Peter,

    I know SPHINX very well and I want you explain me how do you deal with the problem of high availability !

    i.e : If 4 nodes are down, how you can imagine the situation to give to the last user the same results as if they are not down ?

    I tried on my own to imagine that there is a bench of old machines contains same indexes just for backup.
    And in sphinx configuration with the SetIndexWeights() function , give those a low weight and those on normal ones a high weight. And then Query all the nodes, backup machines included… what do you think ?

    I have an unrelated question :

    do you think does an Quad Xeon processor is better then a 2 * Dual core Xeon ? to implement sphinx with distributed searching !

  8. peter says:

    York,

    If you always need to have HA keep 2 boxes having same indexes. You can load balance between them and take one to take over full load if other fails.

    We use Dual QuadCore CPUs now.

  9. Peter,

    In High Performance MySQL 2nd Edition (congratulations for all the work done), you (I don’t forget the other contributors) say that there are 6 servers with 4 searchd instances per server, and that one of the four searchd aggregates the result from the other three. In that situation, what happens if this one fail ? Cause, in this case, if I always go to this searchd instance I’ll lose 1/6 of indexes and not just 1/24.

  10. peter says:

    The aggregator node is data less and just forwards data so we never seen it failing without all box going down. This also makes it easy to run multiple copies of it if you need it to. But technically yes if this one fails you lose 1/6 of the data.

  11. james says:

    Peter, would you please share with us which mysql storage engine is being used to store that much amount of data?

    thanks!

  12. Vadim says:

    James,

    It’s mainly InnoDB.
    MyISAM is used to store performance log statistics.

  13. Peter,
    I use InnoDB for my Mysql Engine. And I know it is safe for deals of bank trades. Does it get something relative with the cpu? Will it take much cpu? And how these are calculated?

  14. This post presents clear idea for the new viewers of
    blogging, that truly how to do blogging and site-building.

Speak Your Mind

*