November 22, 2014

Working with large data sets in MySQL

What does working with large data sets in mySQL teach you ? Of course you have to learn a lot about query optimization, art of building summary tables and tricks of executing queries exactly as you want. I already wrote about development and configuration side of the problem so I will not go to details again.

Two great things you’ve got to learn when working with large data things in MySQL is patience and careful planning. Both of which relate two single property of large data sets – it can take hell a lot of time to deal with. This may sound obvious if you have some large data set experience but it is not the case for many people – I constantly run into the customers assuming it will be quick to rearrange their database or even restore from backup.

You need Patience simply because things are going to take a lot of time. Think about 500GB table for example – ALTER TABLE make take days or even weeks depending on your storage engine and set of indexes, Batch Jobs can take quite similar time. Binary Backup and restore will be faster but it can still take hours especially if database is already loaded. So operating wit such large databases you need to be patient and learn to have bunch of tasks running in the background while you’re doing something else.

You need Careful Planning because if you do not plan things properly you easily get into the trouble as well as because you can’t often use simple “online” solutions but have to do more complicated things instead. You typically can’t simply run ALTER TABLE because table will stay locked for too long you would need to do careful process of ALTERing table on the slave and switching roles or some other techniques. You can’t run many simple reporting queries because for MyISAM they will lock tables for very long time and Innodb can get too many old row versions to deal with which can slow down some queries considerably. You need to be planning for your handling of crashed MyISAM after power failure as check and repair may take long hours (this is indeed one of the big reasons to use Innodb even if you do not care about Table Locks or transactions).

Besides these various trips and gotchas you simply need to plan carefully how you’re going to alter your database because it takes a lot of time and may require waiting for maintainance window or bringing the site down. If you have tiny 1GB table you pretty much can use trial and error approach even for production – found some bad queries fixed them by adding indexes and got back to fix some more. For large data sets this does not work and you really need to have some playground with smaller data sets to play with different schema designs and index structures… this however results in the challenge as results you’ve gotten for small data set may not apply to large data set so you need to re-test your “final design” again with large set.

One thing I often find people miscount is assuming data management operations will be proportional to the database size. Say it takes 30 minutes to alter 10GB table so it will take 5 hours to alter 100GB one. It can be close to that if you’re lucky but it can be much much slower if you’re not. Many operations require certain size of table to fit in memory for decent performance. Typically it would be some portion of Index BTREE (even MyISAM which builds “normal” indexes by sort builds primary key and unique indexes using keycache) If it does not performance may drop performance order of magnitude.

This is actually one of the reasons I try to keep data in smaller tables whenever possible. But this is something I’ve written about in another article.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. howie says:

    I’ve been struggling with this problem of large data sets (well, large for me). I’m afraid of switching to 5.1 and using partitioning.

    Do you basically roll your own partitioning? Do you just choose something to partition on and then create a separate table for it? For example, a big table called invoices might be split into invoices_2007, invoices_2006, etc.

    I was thinking of doing this for one large table. I basically do two types of queries on the table, so I think I might need to mirror the data and partition on two separate fields. Using the invoices example, I would partition by year, and have another set of tables partitioned by vendor name (invoices_vendor_a, invoices_vendor_b, etc).

    Does this sounds like a reasonable approach? In my case, I won’t have to join across partitions, so it seems like it should work.

  2. peter says:

    Yes this should work OK.

    The Joins are the main showstopper for using a lot of tables.

  3. Ken says:

    What about cases where I’m returning a long list of users, sorted by their last login times? I’m using paging (limit) in my mysql, but I’ve noticed that when the result set is huge, this becomes a nightmare. Without the sorting things are fast, but with sorting, it slows down to a crawl, especially at the last pages.

    What do you suggest here? I’ve seen some sites like Myspace, that just return the first 3,000 results (not sure what they’ve done exactly).

  4. peter says:

    Ken,

    Do you have index on last_updated ?

    Not to mention you can cheat and cache pages for a few seconds.

  5. Ken says:

    Yes I do have an index there. The problem is when you start asking for the last few pages, like this:

    SELECT u.id from sys_user u WHERE u.account_status=’Active’ ORDER BY u.last_updated desc limit 188290,10;

    The problem is with the “desc limit 188290,10″ it seems, because if I remove the “desc” or change 188290 to 1, the query time becomes much shorter.

  6. peter says:

    Right. You can’t use large limits because skipping rows becomes expensive.
    Also DESC can be slower for MyISAM Tables with key compression you can set PACK_KEYS=0 for this table and check if it helps.

    Anyway this becomes rather offtopic for this post :)

  7. Ken says:

    I’m using InnoDB tables, so I guess I can’t use PACK_KEYS?

    Is there any other way to go about overcoming this problem?

    Could you perhaps do a blog on this very topic? I think many sites suffer this kind of dilemma.

  8. peter says:

    Right. For Innodb it is strange order by DESC is much slower.
    But anyway you should not be using large LIMIT. I think I’ve blogged about couple of workarounds but they are mostly best for static data

  9. GAC says:

    I have a problem, I am using MyISAM engine and I am trying just to fill first the database with data. My data will fit in around 10 tables of 90Gb each one, After it is filled I only need to extract data, well also may be do some statistics. But no transaction, no updates, no inserts anymore. So, my problem is simply, how can I fill this information fast. I used one primary key with three index, and one index separated, this because after it is filled in, I should be able to make fast queries. So the point is, should I remove the index to fill in the data base, would it be faster if I use a different engine? I already played some how with the my.ini but still not improvement in the uploading performance. I take 1 hour for each 200 MB, so you can Imagine the time that it will take me to fill in all. I am uploading chunks of data using load file into, and it is very slow. I am also thinking in partition the table, but I read in other blocks that it might not be the best? althought I know there is no magic solution, I hope someone can give me some advise to dont loose too much time testing all possible combination of solutions.

    Thanks in advance

  10. Sheyda says:

    I am running into issues of selecting from an 11GB of a table and inserting it into an empty table – it doesn’t have any joins and it dies in the middle of operations without giving any errors, I tested this twice. Any idea??

  11. Sheyda says:

    My insert from a 43000000 records 11GB into an empty table fails without giving any errors. It dies half way through – ANy ideas??

  12. Sheyda,

    Blog comments is not a proper place for question. I encourage you to submit question on forums instead
    http://forum.percona.com/

Speak Your Mind

*