Resolving extreme database overload for the customer recently I have found about 80 copies of same cron job running hammering the database. This number is rather extreme typically the affect is noticed and fixed well before that but the problem with run away cron jobs is way to frequent.
If slow down happens on the database server or job takes longer to run it often can’t complete before its time for it to run again and unless prevented the second copy will run, which will have to compete with first copy for resources so having even less chance to finish. I leave the question of what effect on results running multiple cron jobs at the time may have.
Here are few practices which should help you to keep your cron jobs under control.
Prevent running multiple copies This is the most important one. I would suggest you having “production requirement” of no cron jobs allowed unless they prevent themselves from started in multiple copies or put a wrapper script around developer written jobs when you put them in production.
This can be very well done using file locks (do not just create files, files left after script crash can prevent it from starting again) or using GET_LOCK function in MySQL. The second one is good if you want to serialize jobs from multiple servers (for example you specially put script in CRON job on 2 servers for High Availability Purposes). It is also helpful if you want to limit concurrency for certain processes – like if you have 50 web servers which run certain cron jobs, but you do not want more than 4 of them to run at once.
Watch for errors Cron has a powerful feature of mailing you the output. If you make script to be silent and only print error messages (best to stderr) you can catch when problems start to happen – for example if job failed to run because other one is running which is not expected in your system. In large systems you may approach this problem differently to avoid hundreds of cron error messages when you restart database server etc but information about cron errors should find you.
Store Historical Run Times In a lot of cases when CRON job can’t complete in time any more I’m wondering of it happened over night or if it was taking more and more time gradually until it could not complete in time. Create the table in the database and store information about how long cron took to run. It can also be done by wrapper script but best if it is done internally as you can store some other metrics as well. For example you can store something like script took 40 seconds and processed 4000 images. In this case you would see if slow down happens because the amount of “work” increases or because system gets slower in processing this work. It is good if you can hook up monitoring to this trending data as well. For example you may get an alert if CRON job which runs every 5 minutes and normally completes in 30 seconds took more than 2 minutes to run.
Finally I would like to share the script I used for exclusion of running CRON jobs in the simple cases when I did not want to run any extra code to do it in the script itself.
echo("USE: exclusive.php <lockname> <command to run>\n");
$fp = fopen($filename, "w+");
die('Unable to create lock file');
/* WARNING: LOCK_NB behaves tricky. true is still returned in case of wait but $l is set to 1 */
$r=flock($fp, LOCK_EX | LOCK_NB,$l);
if ($r & !$l) /* Lock successfull */
flock($fp, LOCK_UN); // release the lock
/* Unlink file just in case so we do not have problem running with different users */
echo "Couldn't lock the file !";