Detecting faults in a system is an age-old problem with a large body of literature and practice. Complete failures are usually not hard to detect, but a misbehaving component is often less straightforward. If the system's throughput dips, is it a fault in the system? Or is there decreased demand because another layer in the application stack is failing?
Traditional monitoring systems like Nagios encourage you to set a threshold for a metric's value, but thresholds are far too simplistic. Real systems fluctuate naturally throughout the course of the day and week, and sudden spikes of activity may be a sign of nothing other than increased demand. There is no right threshold. On the opposite end of the scale, CEP (complex event processing) systems and the fields of operations research and statistical process control can be like using a sledgehammer to crack a nut. Tools such as Holt-Winters forecasting and Shewhart control charts have uses, but in many cases are still simultaneously too difficult to implement and ironically often inadequate for really detecting faults. The result is that many organizations have ineffective fault detection systems that flood them with noisy alerts and fail to detect real problems, and many other organizations have none at all.
In this talk I'll show how some relatively simple observations of a system's behavior, coupled with basic math (little more than addition and subtraction) can be used to build a fault detector that automatically adjusts to a system's fluctuating behavior, avoids false positives, and lets you distinguish whether the problem is inside or outside of the system or component you're measuring. It's ultimately all common sense and heuristics, but it's more effective than you might suspect. Although the techniques are applicable to any system, I'll focus on MySQL.
2 October 14:30 - 15:20 @
Principal Architect, The Rimm-Kaufman Group
Baron Schwartz is the Principal Architect at The Rimm-Kaufman Group. He develops tools that help users succeed with MySQL. He is the lead author of High Performance MySQL.