Yesterday I ran into the article which sheds some light on FaceBook search implementation. As we’re recently a lot into search having implemented a bunch of search projects ourselves and helped number a of customers with their full text search needs I decided to post my thoughts on this matter.
First I was surprised article talks about just 1TB of data. I knew FaceBook has much more content than that. Though it seems the article only speaks about searching user profiles and similar global objects not full database of posts and comments. At FaceBook scale it is easy to fit 1TB in distributed memory which makes a lot of things easier.
Second this I was surprised about is about writing crawler for a site…. why would you do that if there is already information in the database which is much faster to extract. Even if data needs to be aggregated from several sources I’d use special data provider which gets the only data which needs to be indexed available for index process. With Sphinx which talks directly to the MySQL database we can get some 10.000 of documents indexed per second from single process, and at this stage it is not getting the documents which becomes bottleneck.
Working with database instead of with web documents has another benefit – you instantly get the structure and efficient object identifiers which you can use for filtering and other needs.
The simple part in profile search is rather simple full text search features are required. As ordering is not done by relevance what is required is simple boolean full text search. Also FaceBook search does not seems to support phrases search, so searching for “Bill Smith” gets you back a lot of Bills which manage to mention Smith in their profile, also Anna Nicole Smith is where which happen to mention bill in the profile – I think this is serious limitation. But anyway if you do boolean search without phrase matching and relevance ranking it is rather easy to do.
The other part which can be hard depending on how search is implemented is counting number of results matched. FaceBook does not give you exact count unless it is small, showing something like “500+ Names” and this OK because few people really need to know exact number if it is large – small numbers are however need to be exact to be able to draw pager properly.
The hard part in social network search is permissions and ranking – not all objects are public and so filtering out objects you can’t see can be expensive if permission check is inefficient.
Ranking also can be expensive – ie FaceBook does the ranking depending on “approximation of social graph distance” which is different for each users.
Two basic approaches one can take is either embedding it in the search engine – which can get things really efficient but can require significant coding and understanding of search engine externals. Simple filtering/sorting though is often supported by search engine directly and it is good idea to see if that can be used
The other one is post filtering and post sorting, meaning we simply use search engine to do “boolean search” (or other ranking) and than fetch first N results and filter and sort them appropriately. This is not absolutely “clean” solution but in practice it works well if used wisely. For example you lose exact counts for large data sets in this case (though you can
approximate them) you also can’t guaranty you get the best results.
Say there were 5.000.000 matches and you fetched 10.000 of them – there well may be some of best matches omitted. It is even worse with permission check – what if you do not have access to all these 10.000 objects fetched first but have access to some ? This though is unlikely to happen in practice.
Such approach often can be improved by using some wise ranking instead of boolean search on the first stage. For example if I would fetch 10.000 matching profiles sorted by their activity (or something else which gives me best profiles) I would likely get much better results from user opinion.
Looking at Features in FaceBook search I found number of features missing which would likely be great value for the users. This is of course content search, ability for more intelligent searches, at least search for phrases. More interesting filtering would also be good – searching within certain distance from the user in addition to the City and Zip code would be nice, sorting by freshness (great if you run some searches on regular basics) and possibly filtering by the date Though this mostly makes sense for content search which does not seems to be where.
I also was surprised advanced search only work for your friends, I’m not sure if this is done for privacy issues or for sake of performance but at least some search options would work great on wider scale. Using advanced search seems to be the only way to restrict search to search among your friends which is also not especially intuitive. Though again as we’re not searching for content you would rarely want to search only your friends.
When I tested it search function was also a bit buggy. Doing search for the “test” would show me only one match in “All Results” while 500+ in names and all other groups, at the same time giving all pages accessible, so clicking to page #2 results in “I’ve got nothing for you.” result
In general we’ll be adding features to the Sphinx to accommodate special needs of some applications. As you can take a look in Powered By list there is quite a few of users with not trivial full text search needs.
We’ve recently added support for multiple value attributes which can be used for storing tags or list of friends. We’re now adding support for search by distance and we’ll need to look at some internal relationships support which are required to avoid massive data duplication for some usages.
We also already make sure Sphinx can be used efficiently to perform queries going beyond full text search. Such as finding the goods from the given group or Finding all Males looking for Females within 50 miles from given zip code matching some complex set set of criteria.