I have run into a number of cases recently that all had a similar look and feel. In most of these cases, the symptoms were very complicated, but they boiled down to just a few problems that can be prevented in very simple ways.
If you are not doing any of the following three simple things, you probably should. These are essential practices for building successful applications, which everyone should be doing. And the best part is how easy it is to do them — read on for the details.
1. Archive or Purge Your Database
Data archiving is crucial. There is a difference between data that your application needs now, and data that your application used to need. Do not co-mingle the two. Plan from the outset to archive or purge data that is no longer needed. Everybody wants to know how to deal with an application that is growing. The simplest way is to prevent it from growing. Archiving and purging unnecessary data is probably one of the highest value activities that you could ever do.
At a more technical level, archiving reduces the size of your workload’s “working set” — the data that the server really needs to keep in memory to function well. That’s because unused rows can bloat and dilute your indexes and tables. Storage engines such as InnoDB also don’t have features such as midpoint insertion algorithms for cache control, so a single table scan can flush really interesting data out of your LRU list and replace it with a bunch of stuff that’s needed only once for the table scan. Keeping those indexes small enough to fit in memory is a big step towards better performance, and archiving can help do that.
2. Monitor Your Servers
Everybody should have an automated monitoring system that is watching key elements of their systems. This can be very simple, even just an external service such as Pingdom or PagerDuty. Or it can be your own in-house hosted Nagios installation. The point is, your users should not be telling you when the service is down. Be the first to know, and have enough information to help you find the root of the problem.
Setting up monitoring can seem to be a daunting task, in part because of the steep learning curve that comes with most tools worth using. Another problem is the plethora of well-meaning but misguided advice and plugins for such tools, which will lead you to believe that you should monitor every ratio and counter there is in your systems — a sure recipe for an inbox flooded with useless noise, a monitoring screen that looks like the scene of a crime, and ultimately people who are numb to the monitoring system and ignore it when something really does go wrong. Keep it simple and monitor things that matter to your business: does the website load in the expected time and contain the expected text? Does the database server respond okay to a sample query that your application actually uses? Is the disk getting full?
3. Capture and Store Historical Data
Likewise, it is vital to have something recording metrics over time. A lot of people use an RRDTool-based system, such as Cacti. Again, it doesn’t really matter which one you use. At some point in the future, you are going to be very grateful that you have historical metrics to understand changes in your application, or understand what happened just prior to a crash. Here’s an example from a customer: the server crashed, and we could see in the log when that happened, but not why. The Cacti graph for disk space had a sharp curve — it showed the disk space suddenly filling up at the rate of gigabytes per minute. We looked at the disk and found it’d filled up with files due to an infinite loop in some code. How long would it have taken us to notice this otherwise, while checking dozens of possible causes?
Unlike with monitoring, you should measure and store everything you can, for as long as you can. You will never stay awake at night kicking yourself for measuring too much about your application.
All of the above are quite simple and fast to do once you’ve gone through the process a few times. I’d say that a practiced consultant can do a pretty comprehensive job in most applications in 8 hours or less. If it’s your first time through, it will take longer, and some good books might help. Either way, it is an investment with a very high ROI.