One of the very frequent cases with performance problems with MySQL is what they happen every so often or certain times. Investigating them we find out what the cause is some batch jobs, reports and other non response time critical activities are overloading the system causing user experience to degrade.
The first thing you need to know it is not MySQL problem, might be even not problem with your MySQL configuration, queries and hardware, even though fixing these does help in many cases. Whatever powerful and well tuned system you have if you put too heavy of concurrent load on it the response times will increase and user experience will suffer.
So what you can do to prevent this problem from happening ? The answer is easy. Throttle the side load so it does not consume too much system resources. Here are some specific techniques to use.
Do push concurrency too high Many developers will test script with multiple level of concurrency and find out doing work from 32 processes is faster than just having one process. This is true if you have system completely at your disposal. If you however need system to serve other users too you typically need to reduce concurrency to where it does not overload the system. Unless it is really time critical process I would not use more than 4 parallel processes heavily writing to database.
Introduce Throttling Sometimes even single process overloads system too much in this case throttling by having relatively short queries and introducing “sleeps” between them can be a good idea. It also often helps with monopolizing replication thread. For example if I need to delete old data instead of DELETE FROM TBL WHERE ts<"2010-01-01" I’ll do “DELETE FROM TBL WHERE TS<"2010-01-01" LIMIT 1000 in the loop until no more rows need to be deleted. When I may inject “sleep” between iterations which to be as long as query execution – so the longer queries run (and the more system is loaded) the more “rest” it will get. Alternatively you can look at “threads_running” variable which is very good simple identifier of the current load and sleep based on its value – for example you may want chose to pause the script at all if the load is too high and wait for threads_running to go below certain value.
Tuning Cron It also often helps to look into your cron or other scheduling system you’re using. Frequently way too many scripts can be started at once, or very close to each other so they start to overlap and so producing the overload. Solutions could be spacing them out, introducing some “job control” to ensure scripts do not run in parallel if they should not (and especially you do not get many copies of same script running at once). One simple solution is instead of having bunch of scripts scheduled at midnight, 1AM, 2AM to start I can put them into nightly.sh one after another and schedule that to run at midnight – this way I get scripts ran one after another at their own pace.
Dedicated Slave I remember listening to Cary Millsap’s talk once and he recommended moving the load in time and space as optimization technique. We spoke about moving load in time before, but we also can move in space – putting it on the different system, which in MySQL space is most commonly dedicated slave. In a lot of environments especially with low level of operational/development discipline to enforce previous solutions it can be a life saver. Of course it only works for read jobs which is important limitation. Getting slave(s) for batch jobs also can help in other ways too – such as competition for buffer pool between different kinds of workloads is reduced.
innodb_old_blocks_time Surprisingly simple but effective, setting innodb_old_blocks_time=1000 can often be very helpful in avoiding batch jobs washing away buffer pool contents and so making normal user queries a lot more disk bound and slower. I wrote about it in more details few months ago.
Finally lets touch upon discovery question. To deal with load management you need to understand whenever the problem is happening in your environment (we want to catch it before users complain right?) and if it does what jobs exactly cause the overload. In complex environments it might be harder question than it looks. pt-stalk is a great tool for this purpose. Getting it running can help you to collect the state of your system when it was overloaded with side load (as well as performing poorly for other reasons). Analyzing wealth of data it generate will most likely contain answers you’re looking for.