MySQL Full Text Search in ActionPeter Zaitsev
Similarly to Craigslist GumTree is powered by MySQL but unlike craigslist it looks like they are shy in tuning their MySQL Full Text Search index configuration and setting proper ft_min_word_len value. Searching for Audi A6 will give you a lot of Audi with different model numbers. Searching just for A6 gives no results at all. At the same time we can find V40 which means ft_min_word_len was adjusted from default value of 4 which probably had even worse search quality to value 3.
Gumtree also seems to have another issue – as said on their web site when you post an item it takes up to 3-4 hours for items to become visible in search results. I’m wondering if that is replication lag or issues with full text search index updates as I can’t imagine this being user experience you would desire.
In general I think both of this sites have the same mistake (or shortcut) in MySQL Full Text Search configuration – relaying on minimum keyword length while their application has use cases when search for short keywords make sense.
It is much more reasonable to look at frequencies at keywords not their length and if you happen to have some very frequent keywords you can’t afford to index for performance or search quality reasons you better analyze word frequencies and build stop word list which matches your application. For example “A6” unlikely needs to be stop word even if it is short while “IN” surely needs to be. Besides general terms as in you may have your application specific stop words such as “com” or “html”.
In general performance should be the only thing which forces you to have stop word, as you can see Google for example has none, even searching for “a” gives some results (Though of course google takes other kinds of shortcuts) This however is frequently not the case with MySQL Full Text Search which both will skip to frequent keywords in natural language search and because it is using frequency ranking indexing too frequent keywords can add noise and actually get relevance worse.
Looking at alternative solution, which we’re often using with MySQL – Sphinx Search we still tend to use stop word list for performance reasons, but it is normally good enough to keep it very short. Plus as in new versions you can set the time limit on query execution it is rather safe to search a very complex query and just get partial results if it can’t complete in time. From Relevance standpoint “stop words” rarely cause problems with sphinx because it uses word position based ranking not just frequency based ranking by default which means if you’re searching for “a the” the documents which contain this strange phrase will be ranked on top rather than those which just contain a lot of “a” and a lot of “the” articles.