September 22, 2014

Small things are better

Yesterday I had fun time repairing 1.5Tb ext3 partition, containing many millions of files. Of course it should have never happened – this was decent PowerEdge 2850 box with RAID volume, ECC memory and reliable CentOS 4.4 distribution but still it did. We had “journal failed” message in kernel log and filesystem needed to be checked and repaired even though it is journaling file system which should not need checks in normal use, even in case of power failures. Checking and repairing took many hours especially as automatic check on boot failed and had to be manually restarted.

Same may happen with Innodb tables. They are designed to never crash, surviving power failures and even partial page writes but still they can get corrupted because of MySQL bugs, OS Bugs or hardware bugs, misconfiguration or failures.

Sometimes corruption kind be mild, so ALTER TABLE to rebuild the table fixes it. Sometimes table needs to be dropped and recovered from backup but in certain cases you may need to reimport whole database – if corruption is happens to be in undo tablespace or log files.

So do not forget to have your recovery plan this kind failures. This is one thing you better to have backups for. Backups however take time to restore, especially if you do point in time recovery using binary log to get to actual database state.

The good practice to approach this kind of problem is first to have enough redundancy. I always assume any component, such as piece of hardware or software can fail, even if this piece of hardware has some internal redundancy by itself, such as RAID or SAN solutions.

If you can’t afford full redundancy for everything (and probably even if you do) the good idea is to keep your objects smaller so if you need to do any maintenance with them it will take less times. Smaller RAID volumes would typically rebuild faster, smaller database size per system (yet another reason to like medium end commodity hardware) makes it faster to recover, smaller tables allow per table backup and recovery to happen faster.

With MySQL and blocking ALTER TABLE there is yet another reason to keep tables small, so you do not have to use complicated scenarios to do simple things. Assume for example you need to add extra column to 500GB Innodb table. It will probably take long hours or even days for ALTER TABLE to complete and about 500GB of temporary space will be required which you simply might not have. You can of course use MASTER-MASTER replication and run statement on one server, switch role and then do it on other, but if alter table takes several days do you really can afford having no box to fall back to for such a long time ?

On other hand if you would have 500 of 1GB tables it would be very easy – you can simply move small pieces of data offline for a minute and alter them live. Also all process will be much faster this way as whole indexes will well fit in memory for such small tables.

Not to mention splitting 500 tables to several servers will likely be easy than splitting one big one.

There are bunch of complications with many tables of course, it is not always easy to partition your data appropriately, also code gets complicated but for many applications it is worth the trouble

At NNSEEK for example we have data split at 256 groups of tables. Current data size is small enough so even single table would not be big problem but it is much easier to write your code to handle split from very beginning rather than try to add in later on when there are 100 helper scripts written etc.

For the same reason I would recommend setting up multiple virtual servers even if you work with physical one in the beginning. Different accounts with different permissions will be good enough. Doing so will ensure you will not have problems once you will really need to scale to multiple servers.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. What about the new partitioning feature of recent MySQL versions? Would that help? Because splitting the data as you say is something should have considered right from the beginning. If not, you end up with lots and lots of code to adapt and handle the new situation. Moreover I suspect it might not be possible to split data into separate tables under all circumstances, as e. g. joins etc. might get too complicated to maintain.

  2. peter says:

    Partitioning helps to some of the points mentioned but not to the others. I’ve been only playing with it at this point as 5.1 is not ready for production yet in my opinion.

    The biggest problem with partition it still will not be able to get you out of single box, while using multiple boxes effectively is one of the main ways to scale. Technically Federated with Partitioning could work but it is far from being able to work efficiently at this point.

  3. Mike says:

    I’m not exactly a MySQL expert, but I do use it on a daily basis and have had to evolve a backup strategy in the face of not-always-reliable hardware. What I’ve done is this: I have two “main” servers, with lots of RAM and processors, in a dual-master-slave configuration. One acts as a primary and one as a backup. If one fails, I can switch over to the other one seamlessly. I just move an IP alias from one machine to the other. But then the problem is, how do I restore the failed server if I’ve had a crash that corrupts the tables, or the filesystem? The solution is that I have a third and fourth replication slave, one on-site and one off-site via VPN. Once I repair whatever’s broken on the downed “main” server, I then shut down MySQL on one of the backup slaves and synchronize the files with rsync. I could improve this situation even moreso with LVM or EVMS snapshots, and perhaps automated failover, but for now, this approach has saved me on a few occasions, and my boss is happy with the near-zero downtime.

  4. peter says:

    Mike,

    This is common solution even though it requires some hardware overhead. In your case in worst case you have only one server available out of 3 so even single server should be able to handle all the load.

    With LVM two servers would be enough as you can make a clone back without downtime.

    Technically you always have to reclone master if you fall back to slave as due to async nature of MySQL there could be events which are executed on master but they did not make it to the slave…

    I should also mention this is not only type of failures you should watch out for – you may have mistaken update statement which will be instantly replicated among all servers… so you need to be able to recover from backup

  5. Peter Halliday says:

    You said:

    “On other hand if you would have 500 of 1GB tables it would be very easy – you can simply move small pieces of data offline for a minute and alter them live. Also all process will be much faster this way as whole indexes will well fit in memory for such small tables.”

    How exactly would you recommend taking small tables offline? Partitioning in this way manually has always intregued me, but it is the maintainance like this that I didn’t understand.

  6. peter says:

    Peter,

    Taking offline is a broad term here. In some cases you can block chunk of data on application level in others you can
    simply run ALTER TABLE as you need. Yes it will lock some clients for a while but only few and for limited time.

  7. Piet says:

    How is performance affected when I’d have eg private data for 500 customers in tables cust_1_data upto cust_500_data, instead of 1 big cust_data table with a customer_id column ?

    Doesn’t having lots of tables give a performance penalty ? And how does it compare to a performance increase for having smaller tables ?

    A select clause for eg cust 33 would then be ‘select * from cust_33_data’, instead of ‘select * from cust_data where customer_id = 33′.

    I’ve often wondered about this.

  8. peter says:

    Piet,

    Best way to check it is to benchmark it. 500 tables is not a lot so I would not see performance large penalty.
    You would get better data clustering especially with multiple tables – even with clustered Innodb tables you have indexes which may be unclustered.

  9. sean says:

    Peter,
    I have a question about ALTER TABLE and how it logs incoming writes. The documentation states that the “ALTER TABLE works by making a temporary copy of the original table. The alteration is performed on the copy, and then the original table is deleted and the new one is renamed. While ALTER TABLE is executing, the original table is readable by other clients. Updates and writes to the table are stalled until the new table is ready, and then are automatically redirected to the new table without any failed updates.”

    I’m hoping you an elaborate on ‘…are stalled until the new table is ready…’ I assume MySQL is storing these writes somewhere, either in memory or perhaps on disk but I’m not sure.

    Thanks.

  10. peter says:

    Sean,

    MySQL does not store updates. You simply can’t update the table while ALTER TABLE is running. This is partially fixed for MySQL cluster and hopefully will be fixed for MySQL general but now plan to be read-only while running ALTER TABLE.

  11. lotso says:

    peter,

    coming back to your answer to sean’s question. When I do an alter table (to add a new column), I see that a new table is being populated at the same time to add the new column. (BTW, why does it do this? Experience with MSSQL is that an alter table takes merely seconds to add a new column). I always presumed that it was doing such because it needs to ..

    1. Maintain a read/write copy of the existing table
    2. be able to still populate data
    3. be able to switch over w/o any loss in data as soon as the task is finished.

    Is this not true?

  12. peter says:

    lotso,

    In many databases adding column is instant because it only changes meta data and as old rows (which do not have that column) are read the new column is returned as NULL or having default value.

    MySQL does not have this because many storage engine do not have column information stored with each row so it just copes data to the new table.

    MySQL indeed could keep read-write access to the existing table but this is not done now and it is not very trivial to implement.

  13. lotso says:

    Hi Peter,

    After reading your post on http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-with-large-tables/ I was just very surprised when reading some of the comments that an alter table on a 500GB (say) table could take up to 2 days!! I have my tables as separate files/table (which I thought was a good idea, now I _know_ it’s a good idea)

    Actually, I don’t get the statement regarding the column information NOT stored with each row. (can you point me to other resource to read up on? or even a google phrase)

  14. peter says:

    lotso,

    Read some internal docs. For example MyISAM Static row format is pretty much column values jam packed one after another. So if you have 2 integers in the rows (NOT NULL) you get each row taking 8 bytes and pretty much physical offsets used as row pointers.

    Now say you add 3rd in column and some rows become 12 bytes in length while other 8bytes… MyISAM static row format can’t handle it…

Speak Your Mind

*