October 25, 2014

MySQL: Data Storage or Data Processing

I was thinking today of how people tend to use MySQL in modern applications and it stroke me in many cases MySQL is not used to process the data, at least not on the large scale – instead it is used for data storage and light duty data retrieval. Even in this case however the cache (ie memcache) is often implemented to offload MySQL.

When it comes to large scale data processing other techniques are used. Google uses some bigtable based technologies for search and Google Analytics (which is quite typical log processing application), when you look at recommendation task as well as a lot of other personalized services they may not use MySQL (and SQL at all). In many cases you would see absolutely custom solutions are implemented, in other cases you would see Hadoop or Sphinx used to crunch data in parallel or it even could be special wrapper on top of MySQL do do the work

The reason for this is not only because many tasks are not well expressed in SQL or map to relational data structure to well, but also because of lack of build in parallel processing. When you need high data processing speed you simply can’t do it on single CPU especially since systems are getting wider more than CPU cores get faster, though even single node does not take you that far.

MySQL concept of Single Query=Single Thread works well when few rows needs to be analyzed which often forces us to do tricky optimizations because if query has to analyze 10.000.000 of rows you’re busted – it simply will take too long for user to wait for reply (especially in modern age of AJAX applications with sub second response requirements). Note you will have to solve this problem even if it is relatively rare case and in most cases it will take much shorter time. This is why we have to do careful indexing build summary/cache tables hunt queries which traverse a lot of data and kill them.

Now think what happens if you Search Google or engage in any other processing intensive activity. If result is not served from cache you will have many CPU seconds if not minutes used processing your request, but you get reply fast because hundreds and thousands of CPUs were handling data for you in parallel.

You may argue limited number of applications needs so serious data crunches – could be, however the number of such applications would likely grow in the future. Mobile Internet is to seriously increase number of global Internet users, plus we see more and more personalized services in terms of information provisioning, advertisement, social interactions etc which will need to employ more and more complex algorithms as they mature.

Interesting enough so far MySQL does not have anything which tackles this need, even on the roadmap.

Indeed MySQL has MySQL Cluster but it is quite different. MySQL Cluster at this point is a cluster in terms of Storage – only minor aspects such as data filtering can be offloaded and performed in parallel while a lot of other jobs such as sorting, unions/instersections have to be performed on the SQL node.

May be MySQL Cluster will evolve and data storage nodes will be able to perform all kinds of operations, in which case it will become much more helpful. Of course if it also drops a lot of its other limits and will be able to scale to thousands of nodes.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Brian Aker says:

    Hi!

    You are probably not surprised to know that some of us are advocating this. I’ll probably blog what my MySQL internal thoughts sometime soon :)

    A quote I heard recently “Hadoop will replace the database vendors”. Good quote aye?

    Cheers,
    -Brian

  2. peter says:

    Yeah. Good quote though I do not think so. I think the current technologies are not mature enough and may be they will develop to lead the next wave but it is also possible some other leaders pop up.

  3. AlexN says:

    There is, for example, Mondrian (http://mondrian.pentaho.org/), the OLAP tool that uses MySQL
    as a DB engine for ROLAP. About 50% of all processing job is done on server. It is not very fast,
    however. From my experience, the main advantage in data processing on server using MSSQL (not MySQL)
    is gained form its cursors, that use row level locks and do not interfere with data load and retrieval.
    Ideal for background processes. In MySQL cursor syntax is obscure, and I think that many people
    still are not using them at all.

  4. peter says:

    AlexN,

    This is not the thing I’m talking about. The proper cursors allow you to stream date easily but they do not store the main problem. They are still not parallel.

  5. It is not quite correct to compare “normal” databases with google. Google is very special with special needs. For example a while ago I was searching google using some keywords to test my homepage ranking in search results. I was quite surprised for a phenomenon that simultaneous searches in my home computer and work computer (which physically are away ~4 km from each other) returned different results. Obviously they used different google sites and one of them had not updated it’s index. So can you imagine a relational database that will return different results for SUM of your company salaries depending on which node you’ll execute your query? :)
    I have Oracle background and it’s RAC (real application clusters) option is a step towards your proposed solution, but it uses the same storage and it always gives back consistent results, not depending on which node you run your statement. It is both it’s strength and weakness. Strength because it gives back on all nodes the same result, weakness because that means all nodes have to get most recent versions of database blocks and in case of frequent changes it puts a quite big load on interconnect and waits are introduced.

  6. peter says:

    Well Google is just an example. It could be Amazon, FaceBook or Yahoo… a lot of smaller second tier web sites would qualify as well.

    Web system which have to deal with user generated content have to deal with massive amount of such.

    Speaking about Google – this is deliberate choice of search application. There are many services using same underlying infrastructure such as bigtable which do not have this problem.

    You do not really need to have shared storage to give consistent results, you can have consistent shared systems either. There are of course various implications – like if you get to deal with multiple data centers you have to choice either synchronous propagation (meaning you can run into the trouble if connection between them has problems) or asynchronous propagation, meaning you will get a stale copy. Of course you can structure the application so stale copy is never used but it normally comes with other performance challenges.

  7. Gaurav Singh says:

    how i can switch mysql to use other partition of my hard disk to store data.
    i am a new kid for mysql. assist me for the same.

  8. Divyesh Bajwala says:

    If you are using innodb

    By the use of two parameter

    1. innodb_data_home_dir =
    2. innodb_data_file_path =

    Documentation of same can be available at mysql.

    For MyISAM, can take help of symbolic link

Speak Your Mind

*