What is often underestimated is impact of MySQL Performance by complex queries on large data sets(ie some large aggregate queries) and batch jobs. It is not rare to see queries which were taking milliseconds to stall for few seconds, especially in certain OS configurations, and on low profile servers (ie having only one disk drive) even if you just have one such query running concurrently.
Lets talk a bit how it is happening and how to prevent it.
Cache Wiping This is first reason for this to happen – query which crunches large amount of data set wipes data from your normal working set from OS cache. Operation Systems and MySQL Itself employs various strategies to attempt to minimize such effect but the truth is it still happens.
Disk Starvation As Cache efficiency drops more requests have to hit the disk, which may be 100% busy running your batch job query. This is especially bad when you only have one drive for database location and in this case single query really can keep it always busy. As drive gets busy requests have to wait to be scheduled. You may imagine it would not need to wait for long because there is only one query ripping disk apart, in fact however a lot of queries can get piled up because of reduced cache efficiency many queries may start waiting on disk at once. In some cases I’ve seen average time requests spends in the queue to be over 1 second, which is much more than few milliseconds you would expect disk IO to take.
Scheduling Issues OS Disk IO schedules may also “help” in this case. Number of them would try to optimize throughput before latency, meaning if you have query which is doing full table scan (sequential reads) its requests will be prioritized as they do not require disk head movement, compared to IO requests in random locations other threads are willing to take. Especially some older disk schedulers could be poor. I remember in Linux 2.4 times I could make single full table scan query to slow things that bad, so “ls” would take over 10 seconds for directory with couple of files. In recent Linux Kernels deadline or cfq IO schedulers should be better with this.
So what can you do about it ?
Do reporting on the slave Doing reporting on the slave is great idea if you have one. Even if you use Innodb tables and so do not suffer from table locks reporting queries can affect web site performance dramatically. Sometimes instead of using slave you may use database version from the backup or LVM snapshot of current database with same results. In some cases you can’t move it fully to the slave – for example if you load the data or build summary tables. In such cases you may still move some of the load to the server – you may read data from slave and write it to the master (selects often contribute most of the load) or you may prepare summary tables on the slave and when move them to the master using mysqldump, or even careful file transfer if you’re using MyISAM tables.
Chop it If you can’t execute this work on slave server, for example it is old data purge activity you may at least chop it, meaning do not do it as all one big simple query or as set of queries going one after another. Have sleep between them so none of them can take too much resources for too long time. If you delete things do DELETE … LIMIT 1000 and insert sleep 10; in between. By spreading load this way you make sure large portion of the cache will not be wiped out at once and also if few queries pile up because query took many resources they will have time to resolve before next portion of the query takes place.
Time it This is kind of obvious but I still mention it – if you have to do something intense do it during the lowest load. Do not just place it in the cron job for 4AM and forget about it, it may be still too bad and your web site users may be suffering when you can’t notice it. Make sure the load is low enough so your complex query or batch job do not slow things to bad.
In some environments which do not run 24/7 this is great solution as there is no one to suffer at certain times.