One may have notice we were not blogging too much recently, this is because we were quite busy, mainly building BoardReader.com – Search Engine which indexes tens of thousands of forums from all over the world. This project was built by us as consulting project so too bad we do not own it completely but we’re still quite excited it is live now. We did not work on crawler in this project only on database Backend and full text search engine implementation. In this part it is standard LAMPS application. I guess you know what LAMP is and S Stands for Sphinx – Full Text Search Engine which we love to use where large scale search is needed. At this point we have over 300 millions of posts indexed with only 3 search servers and still counting. I guess we’ll have half a billion of forum posts soon.
To share few more technical details – it is implemented using pretty standard “manual partitioning” scheme with different forum sites mapped to different “table groups” with each server handling bunch of these. This would make it easier to re balance groups if needed as traffic growths as well as makes ALTER TABLE much less painful. The other technique which I covered in some of my presentations is using double data storage with different partitioning. In our case we wanted to track links between sites. It is easy for outgoing part as we already cluster by sites but It is hard for incoming links as they are scattered among many tables and servers. To target this problem we also store inbound links clustered by second level domain which allows to get inbound links pretty efficiently. It turns out however some domains still get way too many links and we’ll likely redesign it in the future to use sphinx instead (it can do extremely fast parallel group-by on many servers, in google style).
Few features which I would like to highlight – first you can use it to Search MySQL Forums Notice simple link structure – you can replace mysql.com in it with any other domain to search forums from that domain. For example you can use this link to search our MySQL Performance Forums
Second – note the graph which shows how many results were found matching this terms right from search results. It can show quite interesting data, for example searching Britney Divorce will show huge spike then news came out and quick calm down in about in week. You can click on the bar in the graph to get search results focused on that period. Can be quite fun.
Another nice feature is domain profile – by using it you can see how actively this domain is getting links, which pages are most frequently linked on domain as well as which pages and domains forum users tend to link to. So far reporting period is restricted by performance reasons – there is too much data to group and quite a hassle to build summary tables as we want to count uniques, but it should be fixed once we rewrite it using sphinx. From that page you can also get to inbound link report which allows you to see what recent links do you have from forums to whole web site or particular url
I also should mention couple of ratings we have implemented. Love for ratings probably comes from my SpyLOG background. At this point we have implemented rating of YouTube videos and rating of Domains In both cases we check how many links each of domains is getting and from how many unique sites. For domains we split domains which are getting normal links as well as domains which have images on them referenced.
There are still a lot things to do and quite probably quite a lot of bugs to kill. We would welcome any feedback such as suggestions or bug reports. Also if you know the forum which is not indexed please free to submit it.
Percona’s widely read Percona Data Performance blog highlights our expertise in enterprise-class software, support, consulting and managed services solutions for both MySQL® and MongoDB® across traditional and cloud-based platforms. The decades of experience represented by our consultants is found daily in numerous and relevant blog posts.
Besides specific database help, the blog also provides notices on upcoming events and webinars.
Want to get weekly updates listing the latest blog posts? Subscribe to our blog now! Submit your email address below and we’ll send you an update every Friday at 1pm ET.