I was thinking today of how people tend to use MySQL in modern applications and it stroke me in many cases MySQL is not used to process the data, at least not on the large scale – instead it is used for data storage and light duty data retrieval. Even in this case however the cache (ie memcache) is often implemented to offload MySQL.
When it comes to large scale data processing other techniques are used. Google uses some bigtable based technologies for search and Google Analytics (which is quite typical log processing application), when you look at recommendation task as well as a lot of other personalized services they may not use MySQL (and SQL at all). In many cases you would see absolutely custom solutions are implemented, in other cases you would see Hadoop or Sphinx used to crunch data in parallel or it even could be special wrapper on top of MySQL do do the work
The reason for this is not only because many tasks are not well expressed in SQL or map to relational data structure to well, but also because of lack of build in parallel processing. When you need high data processing speed you simply can’t do it on single CPU especially since systems are getting wider more than CPU cores get faster, though even single node does not take you that far.
MySQL concept of Single Query=Single Thread works well when few rows needs to be analyzed which often forces us to do tricky optimizations because if query has to analyze 10.000.000 of rows you’re busted – it simply will take too long for user to wait for reply (especially in modern age of AJAX applications with sub second response requirements). Note you will have to solve this problem even if it is relatively rare case and in most cases it will take much shorter time. This is why we have to do careful indexing build summary/cache tables hunt queries which traverse a lot of data and kill them.
Now think what happens if you Search Google or engage in any other processing intensive activity. If result is not served from cache you will have many CPU seconds if not minutes used processing your request, but you get reply fast because hundreds and thousands of CPUs were handling data for you in parallel.
You may argue limited number of applications needs so serious data crunches – could be, however the number of such applications would likely grow in the future. Mobile Internet is to seriously increase number of global Internet users, plus we see more and more personalized services in terms of information provisioning, advertisement, social interactions etc which will need to employ more and more complex algorithms as they mature.
Interesting enough so far MySQL does not have anything which tackles this need, even on the roadmap.
Indeed MySQL has MySQL Cluster but it is quite different. MySQL Cluster at this point is a cluster in terms of Storage – only minor aspects such as data filtering can be offloaded and performed in parallel while a lot of other jobs such as sorting, unions/instersections have to be performed on the SQL node.
May be MySQL Cluster will evolve and data storage nodes will be able to perform all kinds of operations, in which case it will become much more helpful. Of course if it also drops a lot of its other limits and will be able to scale to thousands of nodes.