For months when speaking with customers I have been positioning Percona Cloud Tools (PCT) as a valuable tool for the DBA/Developer/SysAdmin but only recently have I truly been able to harness the data and make a technical recommendation to a customer that I feel would have been very difficult to accomplish otherwise.
Let me provide some background: I was tasked with performing a Performance Audit for one of our customers (Performance Audits are extremely popular as they allow you to have a MySQL Expert confirm or reveal challenges within your MySQL environment and make your database run faster!) and as part of our conversation we discussed and agreed to install Percona Cloud Tools. We let the site run for a few days, and then I started my audit. What I noticed was that at regular intervals there was often a CPU spike, along with a corresponding drop in Queries Per Second (QPS), but that lasted only for a few seconds. We decided that further investigation was warranted as the customer was concerned the spikes impacted their users’ experience with the application.
Here are the tasks that Percona Cloud Tools made easy while I worked to identify the source of the CPU spike and QPS drop:
- Per-second granularity data capture of PCT allowed me to identify how significant the spike and QPS actually were – if I was looking at the 1 minute or higher average values (such as Cacti would provide) I probably wouldn’t have been able to detect the spike or stall as clearly in the first place, it would have been lost in the average. In the case of PCT the current graphs group at the 1 minute range but you have the ability to view the min and max values during this 1 minute range since they are the true highest and lowest observed 1s intervals during the 1 minute group.
- Ability for all graphs to maintain the same resolution time allowed me to zero-in on the problematic time period and then quickly look across all graphs for corresponding deflections. This analysis led me to discover a significant spike in InnoDB disk reads.
- Ability to use the Query Analytics functionality to zoom-in again on the problematic query. By adjusting Query Analytics to an appropriate time period narrowed down the range of unique queries that could be considered the cause. This task in my opinion is the best part of using PCT.
- Query Analytics allowed me to view the Rows Examined in Total for each query based on just this shortened interval. I then tagged those that had higher than 10k Rows Examined (arbitrary but most queries for this customer seemed to fall below this) so that I could then review in real-time with the customer before making a decision on what to do next. We can only view this sort of information by leveraging the slow query log – this data is not available via Performance_Schema or via network sniffing.
Once we were able to identify the problematic queries then the rest was routine query optimization – 10 minutes work using Percona Cloud Tools for what might have been an hour using traditional methods!
For those of you wondering how else this can be done, assuming you detected the CPU spike / QPS drop (perhaps you are using Graphite or other tool that can deliver per-second resolution) then you’d also need to be capturing the slow query log at a good enough resolution level (I prefer long_query_time=0 to just get it all), and then be adept at leveraging pt-query-digest with –since and –until options to narrow down your range of queries. The significant drawback to this approach is that each time you want to tweak your time range you probably need to stream through a fairly large slow log file multiple times which can be both CPU and disk intensive operations, which means it can take some time (probably minutes, maybe hours) depending on the size of your log file. Certainly a workable approach but nowhere near as quick as reloading a page in your browser 🙂