Our customers or prospective customers often ask us how we do a performance audit (it’s our most popular service). I thought I should write a blog post that will both answer their question, so I can just reply “read all about it at this URL” and share our methodology with readers a little bit. This fits well with our philosophy of openness. It also shocks people sometimes — “you’re giving away the golden goose!” Not really. What you hire us for is our experience, not a recipe that anyone can follow.
A full performance audit is far more extensive than this article can cover, and might wander into Apache, networking config, caching layers, etc. Wherever the problem is, we’ll track it down. I won’t talk about that. That’s not because I want to keep secrets from you. To the contrary, I’d love to share it all with you. But that’s a huge job; it will take many pages, and I’m not going to write that much.
The kickoff call
There’s actually a step before the performance audit begins. We call this a kickoff call. We get on the phone with the key technical staff on the client’s side and discuss the application in general. If possible, we take a brief look at the server beforehand, so we can ask more intelligent questions and skip obvious things. This call is sometimes up to an hour long, if we’re discussing a lot of things like how to build for massive scalability, how to do read-write splitting without breaking the user experience, how to take online non-blocking backups, or things like that.
But in general it can be a lot shorter than clients expect, because
- time is money
- your application is not as unique as you think, and we already know a lot about it by browsing your site or hearing you say “we are a SaaS hosted application for data mining stock market transactions,” or whatever the case may be
- we’ve generally seen just about everything people throw at databases
Many people assume their application is difficult or somehow different, and that we need all kinds of schema diagrams and code listings, but the truth is people often have the same insights into particular problems, and therefore they try similar solutions. And besides, most applications have a lot of the same components. Tagging, friends, queues, click tracking, search, paginated displays — we’ve seen these and dozens of other common patterns done a hundred different ways.
The kickoff call has some other important goals too. We want to know what your concerns are at this stage. Is it the current performance, future performance, scalability, ability to recover from disasters? We also want to know what the operating parameters are. Can we suggest upgrades, rewriting queries, or whatnot? Sometimes there are rigid constraints on what types of solutions we can propose, and we need to know ahead of time so we don’t spend time on things you have no way to change. An example of this is a third-party product whose code you are unwilling or unable to change.
We also want to know the operating mode you expect us to work in. Maybe you want us to analyze and present recommendations for your consideration, and take a look-don’t-touch approach. Or, maybe you just want us to fix things and tell you later what we did. Either is a completely valid approach. You tell us what you want, and that’s what we’ll do (of course, we will also tell you what we think is best for you, which is our job).
After the kickoff call, we write some notes, and then get down to the audit itself. We usually have two people on the kickoff call so that we don’t have a single person knowing everything about you (only one is billable, though). The second person will be from the same team within Percona. However, the audit itself is generally a single person, with another person reading all the notes and generally keeping in touch with what the lead consultant does. We want two sets of eyes on things whenever possible. This also has value in case you decide to continue on to a longer relationship with Percona, which is reasonably common; in that case you will establish a solid relationship with the team, and group involvement early on is a good start to this.
What to look for during the audit
During the audit I look for outliers and anomalies in every aspect of the server’s configuration and performance and contents. I’m looking for characteristics that are much bigger or smaller than normal, or things I don’t usually see. Anything out of the ordinary or out of proportion. These things are often relatively easy to catch if you have enough experience. It’s a little hard to describe without making it sound like it’s some secret handshake, but it’s really not magical. You’re either familiar with what a server normally looks like, or you’re not. It’s just like a mechanic who listens to the car and says “I hear the valves knocking.”
In other cases there may be things that are NOT easy to catch, and may require a lot of experience and investigation. The work we do is often not simple at all. It’s impossible to predict. A customer may call and ask “how long to debug server lockups?” It could be ten minutes, or it could be a really major effort chasing down something like a bug in a hardware component, and you never know ahead of time.
In either case, the point to know here is that we can work with systems that are at any level of tuning, from completely untuned to a system that’s already had a lot of expert attention. I use “tuning” in a generic way here — we focus on far more than my.cnf files. In fact my.cnf files are usually one of the smallest levers we can pull for server optimization. Schema and query optimization can give much greater improvements, for example.
If you’re familiar with Malcolm Gladwell’s excellent book Blink, you know the term “thin-slicing.” That describes an experienced consultant’s approach to the problem very well.
The audit itself
OK, so here’s a very basic audit, whose scope I’ll keep within just a MySQL server. The first step is to log into the machine via SSH and open up a text editor (gedit, notepad, what have you.) I want to keep this open and paste everything I see into it. If there’s no record later that someone else can follow, I am not doing my job.
I would also note that the commands I’ll show here are the lowest common denominator. We often use tools to gather a lot of the data for us and speed the process, when possible, but the core commands I’ll show are what we can always fall back to if that’s not possible.
Gathering information about the server
I start off with these commands:
- uname -a
- cat /etc/*release
The first command is invaluable for later reference. The others will show exactly what kind of server I’m looking at, assuming I’m on a GNU/Linux box (I have to adapt to whatever environment I’m in — obviously a FreeBSD box is different). For example,
[percona@db1 ~]$ uname -a
Linux hostname.domain 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[percona@db1 ~]$ cat /etc/*release
Red Hat Enterprise Linux Server release 5.2 (Tikanga)
From this, I can see a couple of important things. One thing is that I’m looking at a 64-bit OS, which is important to know. I also know I’m on RHEL and I know the kernel — knowing this may tell us a lot of useful things about the system.
Next I continue looking around the server. Holding with the assumption of GNU/Linux, I keep working to get a picture of the OS and hardware.
[percona@db1 ~]$ tail -n 25 /proc/cpuinfo
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU L5335 @ 2.00GHz
stepping : 11
cpu MHz : 1995.022
cache size : 4096 KB
physical id : 1
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx tm2 cx16 xtpr lahf_lm
bogomips : 3990.03
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
What I see here is that this is an 8-core machine, two quad-core processors at 2GHz. I also know some things about the CPU, such as the model number and the cache size, etc. What else is important is the ‘lm’ flag, which tells me this is a 64-bit CPU. From this and what I saw earlier I can see whether someone installed a 32-bit OS on 64-bit hardware, which is important if you want to allocate a lot of memory to mysqld.
Next I look at a couple of all-in-one commands. For example, ‘top -n 1’ and ‘uptime’. That shows me what processes are running, but it also shows me load average, memory sizes, and a bunch of other stuff. In this case I’m on a machine with 32GB of memory and there is very little load. I see that mysqld is using 10g of memory with 17g virtual size, and there’s practically nothing else running (this is a dedicated master server). The machine has a lot of swap, but basically none is used. This isn’t all that relevant, as you’ll see later, but in some cases it might be.
I continue my investigation into the machine itself with these commands:
- df -h
- pvs; vgs; lvs
These commands show me what filesystems are mounted, how much space they have on them, and what kind of I/O system is under the box in general. dmesg is useful for several reasons. For one thing, it has a lot of information on the box’s hardware, including RAID controllers, installed memory, and so on in the boot output. In addition, at the tail it has information on core dumps and so on. It’s possible that it has filled up with stuff and the boot output has been lost, but in many cases there is a lot of information here that’s hard to get elsewhere.
If I see an LSI MegaRAID card, which is pretty common, I can check whether MegaCli is installed, and use it to get information about the physical drives installed; then I can find out exactly what they are, both in terms of size and spindle speed. I need to know how many physical spindles are under the box, and how fast they are. I can also get information about the all-important battery-backed write cache, how it’s configured and what its status is. I can see the drive status, too. Sometimes a failed drive is present but no one has noticed it yet! (By the way, we can also help you install and configure monitoring and graphing systems, if you don’t have any. These will help catch such problems.) If MegaCLI isn’t installed there may be other ways to get this information, too.
The last three commands are for giving me information about LVM devices, volume groups, and logical volumes. At this point, unless I’ve seen something that I want to investigate further, I have a pretty good idea what hardware and operating system I’m working with.
Next I investigate what the system is doing, in terms of performance. There are two key commands for this on most operating systems (there are others, such as mpstat or dstat, that I may also elect to use in some cases):
- vmstat 5 5
- iostat -dx 5 5
If you don’t know how to read these, there’s a full explanation in our book, High Performance MySQL 2nd Edition (it’s not in the first edition). From this I get an idea where any possible problems might be: is the disk saturated? Is the machine swapping actively? (This is more important than whether it’s using swap.) There are a lot of combinations of possible things you can see here, so I will not try to explain it all. Basically you need to know everything there is to know about these two commands and their output.
Based on what I’ve seen so far, I may look at other things, such as ifconfig (which shows good stuff like dropped packets).
Starting to look at MySQL
Assuming I don’t see anything noteworthy here, I’ll move on to the MySQL server. I mean, let’s just assume the physical hardware and the machine setup is fine, and the client has asked me to figure out why the website is slow; assume also I’ve checked out Apache and network config, etc, and I don’t see anything wrong there. So at this point, it looks like MySQL is probably a good place to look for performance problems. Everything I’ve done up till now is just due diligence; I’ve gotten my bearings on the server as a whole, and now I’m ready to see what’s up with the database.
The first step is to find out what MySQL instances are on the machine. It doesn’t do to just assume there’s a single instance in /var/lib/mysql and it’s reading from /etc/my.cnf. I’ve seen a lot of cases where there’s a /var/lib/mysql and an /etc/my.cnf, and the server is installed in /customsoftware/mysql with a different my.cnf file. The way to find this out is to look at ps:
[percona@db1 ~]$ ps -eaf | grep mysqld
root 3137 1 0 Nov19 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file=/var/run/mysqld/mysqld.pid --log-error=/var/log/mysqld.log
mysql 3172 3137 78 Nov19 ? 1-11:17:30 /usr/libexec/mysqld --defaults-file=/etc/my.cnf --basedir=/usr --datadir=/var/db/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --skip-external-locking --port=3306 --socket=/var/db/mysql/mysql.sock
consult 28852 26909 0 19:08 pts/2 00:00:00 grep mysqld
In this case, yes it’s all default. It’s listening on the usual port, etc etc. Nothing unusual. Next I check what the MySQL version is, and I check (with the ‘file’ command) whether /usr/libexec/mysqld is a 64-bit build, if I’m on a 64-bit OS.
Examining MySQL configuration
Then I’ll look at my.cnf quickly for anything weird:
[percona@db1 ~]$ grep -v '^#' /etc/my.cnf | grep .
My goal here is to look at the my.cnf without seeing all the comments, which a) are often a small novel, b) often say what the sysadmin thinks s/he’s doing, which might not be what is really being done. I want to look at it without polluting my brain with that. I’ll just scan this for anything odd. In contrast to what you might think, spending a lot of time here isn’t usually a good idea, because what’s left out is often more important than what’s included. For example, the absence of skip-name-resolve is easy to miss, as is the absence of innodb_log_file_size.
I won’t give you a full rundown over every option in the file. But again, you learn quickly if something strange is here.
This whole time I’ve been copying the output of the commands and saving them in a text file. I’ll attach this text file to the issue in our CRM system so I (or someone else) can look at it later, see what I saw, and follow my line of reasoning about any suggestions I make.
The next thing to do is grab ‘mysqladmin variables’ and put it in the text file. I don’t spend a lot of time looking at this. For one thing, all the values are in big units, so it’s hard to read. If the InnoDB buffer pool is set to 22GB, it’s much easier to see 22GB than 23622320128, especially since that number is buried next to a lot of other long numbers. They make my eyes hurt, and there are better ways to do this (I will run mysqlreport to get friendly numbers I can read easily). But what I will do is scan the output for something strange that might be caused by a syntax error in the my.cnf file. For an example, take a look at this bug report (not a bug).
Looking at MySQL status
Now we’re starting to get into the really interesting bits. Alas, these are the bits where there is the most variability, so I’ll have to be even less detailed and a little vague. The next command is this:
mysqladmin ext -ri10
I let that run for at least two iterations. The first iteration is the current values since the server was booted; the second and subsequent are incremental differences. I’ll usually capture two of these. It looks like this:
[percona@db1 ~]$ mysqladmin ext -ri10
| Variable_name | Value |
| Aborted_clients | 205174 |
| Aborted_connects | 29 |
| Binlog_cache_disk_use | 0 |
| Binlog_cache_use | 9630066 |
| Bytes_received | 38563413074 |
| Bytes_sent | 216162991863 |
| Com_admin_commands | 255868807 |
... snip ...
| Uptime | 162626 |
| Uptime_since_flush_status | 162626 |
| Variable_name | Value |
| Aborted_clients | 6 |
| Aborted_connects | 0 |
| Binlog_cache_disk_use | 0 |
| Binlog_cache_use | 468 |
| Bytes_received | 1708212 |
| Bytes_sent | 7796961 |
| Com_admin_commands | 11893 |
... snip ...
Then I format the second set of values up beside the first set for ease of reference (an easy task with an automated too, or a quick Vim command if I’m doing this the manual way.) The result looks like this:
| Variable_name | Value | Value |
| Aborted_clients | 205174 | 6 |
| Aborted_connects | 29 | 0 |
| Binlog_cache_disk_use | 0 | 0 |
| Binlog_cache_use | 9630066 | 468 |
| Bytes_received | 38563413074 | 1708212 |
| Bytes_sent | 216162991863 | 7796961 |
| Com_admin_commands | 255868807 | 11893 |
| Com_alter_db | 0 | 0 |
This output is very useful when writing the report to the client. Now I go through it line by line looking for things that look wrong. When I find something, I compare it against the server’s variables or whatever other data I need, and write something instructional to the client about it.
Preparing to report findings
I should mention here a bit about how I write the report to the client. I copy and paste the interesting snippets from what I’ve been gathering, save it in a text file, and explain what I’m seeing to the client. Anything I see that looks wrong should be explained in detail, because my goal is to teach the client, not just to awe them with my knowledge and get “I could never understand, O Almighty Percona, here’s your money” in reply. If the client doesn’t understand what I’m doing, I’ve done something wrong. Clients need to understand so that they know they’re getting good service. A Percona consultant never says “just trust me” unless the client really insists “I don’t want to know.”
That’s also why I’ll comment on things that seem good. Clients need to know that I’m being thorough. Once upon a time, a Percona consultant was called back in to do some further work about 6 months after Peter helped a client. Now, if Peter worked with the client 6 months ago, you have big shoes to fill, and getting further gains will probably not be easy. So this consultant looked through things, spent about 2 hours, and came back to the client with “there’s not a lot of performance left to gain here. You can do X, Y and Z, and that’s about all I see. Beyond that, we need to talk about making your overall architecture more scalable.” And the client said “that’s all? Didn’t you do any work? Peter’s report 6 months ago was a small encyclopedia!” The mistake here was that the consultant didn’t show all the things he’d checked, which was extensive indeed. After a quick explanation the customer understood that the consultant had really done a lot of work, but it’s better to show the client the full extent of the investigations in the first place, and not just show “here’s the 3 bad things I found in 2 hours.” We also make sure that we refer back to the previous issue’s findings, so that we don’t duplicate efforts from the earlier optimization.
Back to the output of mysqladmin. When I send the report to the client, it typically looks something like this. First, I explain the meaning of the output, so it’s clear what I’m showing them. Then, I go through it line by line, like this:
12345678910111213141516171819202122232425262728293031323334 +-----------------------------------+----------------------+----------------------+| Variable_name | Value | Value |+-----------------------------------+----------------------+----------------------+| Aborted_clients | 205174 | 6 |Here we can see that something is disconnecting ungracefully(without closing the connection properly). This is happening 6times every ten seconds. We need to figure out what is happeningbecause it can indicate something else is wrong. This is not abig deal for performance, it's just something to figure out.... snip ...| Com_admin_commands | 255868807 | 11893 |Why is something running an admin command constantly? Thisis generally a Ping command, which is useless. Something ispinging the server about 1200 times per second. Often this isfrom some ORM system, such as Class::DBI, which does $dbh->pingbefore each and every query.Why is Ping useless? Well, it is intended to tell you whether theserver is alive. The trouble is, if it succeeds, it tells you nothing.An instant later, when the ORM system actually runs the query,the server could have died. The ORM should just be running thequery, and if the query fails, then you know from the error codethat the server is dead. So a successful Ping is meaningless.On the other hand, pinging can be a serious performance problem.Not really for the server, since it's a cheap query. It adds hardlyany load. But for the application, it's another network round trip.It is a "garbage" query that just adds latency and makes the overalltime to complete an action longer. If your application runs very shortqueries, this can be a non-trivial part of the app's response time.
And so on, and so on. I’ll often refer to other things; for example, when I’m examining the status values for Created_tmp_tables and Created_disk_tmp_tables, I’ll paste in
mysql> show global variables like '%table_size%';
| Variable_name | Value |
| max_heap_table_size | 268435456 |
| tmp_table_size | 268435456 |
If these are not the same size, as often happens, I’ll explain that the minimum of the two is used to determine when an in-memory temporary table will be converted to an on-disk table, which is an expensive operation.
In this way, I work through the status and the variables, and explain to the client what I can deduce about the server from them.
I would point out that I don’t spend a lot of time on settings and status. It is not where the greatest gains are to be found. The real gold mine is yet to come. After you’ve done it a few times, you can go through the settings and status pretty quickly.
Analyzing MySQL’s workload and data
After I’m done with that, I’ll grab a few snapshots of SHOW FULL PROCESSLIST, and look for odd values in this. For example, if I see a lot of queries in odd statuses, like “statistics”, I know something weird is going on, and I look deeper. I also look at the queries themselves. If the client has long-running queries, it can be pretty easy to catch. Maybe they’re long-running because they’re in Locked status, which might be an indication that it’s time to convert MyISAM tables to InnoDB. (But then again, it might not be.) Or maybe the client is doing queries like “… WHERE client IN (SELECT id FROM clients)” which is a really bad query plan.
After this I’ll look at SHOW INNODB STATUS\G and see if there’s anything worth commenting on there. Depending on the workload, there might be a lot of substance here. There’s also a very thorough section on this in our book.
Next I look at the data in the server and see what I think about it. If the server is not heavily loaded, I may even do some INFORMATION_SCHEMA queries to help me find the biggest tables and so forth. If the server is heavily loaded or has a lot of data, touching the INFORMATION_SCHEMA can take a barely-running site and bring it down, so I do this with great care! If it’s not appropriate I may run ‘du’ on the data files, or simply skip this step. This information can help me determine what kinds of things I ought to look for later during query analysis, and also might show me things like huge tables that should be archived, tables that I should be aware of if I see them in queries, or what have you.
I will also look at the error log. The error log shows all sorts of interesting things. Basically, anything but silence in the error log is interesting and needs to be investigated. You might find that there are InnoDB tables whose .frm files exist but have been dropped from InnoDB, for example. (The INFORMATION_SCHEMA query will surely trigger errors about this.)
There are a bunch of other things I’ll look at, including checking for default users in the mysql.* tables, running mk-duplicate-key-checker to find redundant indexes, and so on. As with settings and status, these set the stage for bigger gains to come later, so I don’t spend all that much time here, although there is occasionally something really bad that can be fixed and bring big gains.
Schema, query, and index optimization
Analyzing and optimizing a server’s table and index structures, and the queries that run against them, is by far the most important thing to concentrate on. All the other work is a prerequisite to understanding the context within which these things operate. For example, if the workload is a star schema data warehouse, it is important to know a lot about the I/O subsystem. And by now I know that I should concentrate in certain areas — there may be hundreds of tables, but at this point I should have identified a handful of tables that really matter. That’s why I save this part of the analysis for last.
You cannot consider any one of these factors (schema, queries, indexing) in isolation, because they are tightly bound together, and tweaking one will often have effects on the others. Each decision is a consideration of the costs and benefits on the system as a whole.
The queries the system runs, and hence the slow query log, is one of the more fruitful ways to analyze the system. What I usually do is ask the client to enable it a day before the audit if possible, and set it to one second. That way there’s at least one “cycle” of queries in it. Then I run the following:
perl mysqlsla -lt slow /path/to/slow.log
By default, this outputs the top 10 most expensive queries, in total execution time. By the way, the slow query logs in the stock MySQL server are extremely limited, and for serious analysis it’s basically mandatory to use a server that has the Percona patches for microsecond logging and additional information in the slow query log output. The stock MySQL server’s limitation of one-second granularity makes it hide problem queries that are faster than one second (which in a high-performance system is virtually every query.) If it’s not possible to use a patched binary, we can use MySQL Proxy, packet sniffing, or other techniques to get more information than the slow log gives us; but by far the richest and most efficient source of analysis information is our custom server, which we have designed exactly for the purpose of giving us as much insight as possible into the server’s true workload. At the moment there is no other technique that approaches the amount of information a Percona build can give you.
Having found the desired information about the queries by any means necessary, I find out which are the most likely to give the greatest gains. I use a dual approach for this: I look for both the queries which cause the greatest load on the server (in aggregate) and which cause the greatest latency for the application. Again, it’s a process of finding outliers by some criteria. Both types of queries matter. Some may be extremely fast, but run very often; others may be run seldom but take a long time to finish. And it’s important to focus on ones that can give the greatest improvement for the customer. I can often tell at a glance whether a query is going to be possible to improve a lot, and a simple mental calculation can then tell me how much total gain I can get from it. Queries are not just “bad” or “good” — it’s a question of where they fall on the scale.
Now I analyze these one at a time. For each query, I run EXPLAIN if possible (rewriting non-SELECT queries if needed, and using careful judgment for queries with subqueries in the FROM clause, which will actually execute the inner subquery!), and examine the query plan. This is where you need to really know how to write queries and how EXPLAIN works. At Percona, we have peer training constantly on our internal IRC channel and mailing list. We share all sorts of dirty tricks and neat ideas with each other. Nothing is off the table. Queries can be rewritten. Indexes and data types and table structures can be changed. Queries can be broken into pieces, combined, or even eliminated entirely (we may suggest caching, or tell the customer to evaluate Sphinx, or something like that.)
In the common case, though, a query simply needs a rewrite or a new index or something. In this case we show the query and the information mysqlsla outputs about it (execution times, etc), the EXPLAIN plan, and the desired modifications to the query or the table. We explain how to interpret what we’re showing, and why the proposed modifications are better. If the client approves it, we may make a copy of the table, make the proposed modifications to it, and show the difference afterwards. Or we might have a test server to run on. It varies widely; some clients have a test server, some don’t; some have a formal QA process; some don’t. It is very customer-specific, and we work with what we have.
Sending the report to the client
After the slow query log analysis is done, I format all my analysis for sending, add in things I may have noticed along the way (comments on backups, for example) read over it once again and make sure I didn’t miss anything important or write something in a confusing way, and then send it to the client. Then I usually call to discuss, or just confirm that it was received and wait for a reply. In most cases the client has a lot of work to do — sometimes weeks of application changes. Afterwards there’s usually more to be gained by doing another pass through.
We may also want to have another call with the customer and talk about changes that can or should be made at a much higher level. If we see that the application’s overall architecture needs to be changed, that’s something to discuss. Customers usually want us to validate their application’s overall scalability and whether it’ll grow to meet their demands for some period of time. We also frequently bring the discussion to topics such as monitoring, alerting, backups, caching, reverse proxies, and high availability solutions, all of which we can frequently help customers set up much more efficiently and with better results, due to our knowledge of what works in the real world and where there’s quicksand to be avoided.
The changes Percona’s performance audits typically suggest can be really significant, and can completely change the performance profile and workload of a server, so it’s often worth another iteration, and sometimes even more after that. A really thorough audit can take up to ten hours! The initial round usually takes less than two, however. The ten-hour cases are usually for really complex applications, or apps that have a lot of stored procedures, which are more difficult to analyze because you have to pull out the statements and analyze them individually. The goal is always to develop X-Ray Vision and see through the symptoms to the underlying causes in every aspect of application performance.
In all cases, the customer’s guidance is our first rule. The customer determines the depth and complexity of the analysis we do. A lot of times, customers will tell us up front to limit the work to some number of hours, such as 3 hours. Knowing this ahead of time gives us a framework within which we have to deliver the maximum value to the customer. If we have less time, we fix the “sore thumbs” and make notes of places where more effort is likely to pay off, and mention those to the customer. Or we economize by giving the customer easy choices so they can decide which things to have us fix, which to fix themselves, and which to just ignore.
The above process may look like a static recipe that you can just mechanically execute, and in some cases that’s truly what we do. However, we don’t just go through the motions. The “recipe” above is only a framework within which we work. Knowing the rules of a performance audit, and why they are the way they are, means you can know when and how to break them for the client’s benefit. It’s like knowing openings in chess — the great chess players study openings exhaustively, but they don’t robotically play them in tournaments.
For example, although it may sound as though I go work for a while and then emerge with a report, it’s not that way a lot of times. Flexibility and exercising judgment about when it’s appropriate to interact with the customer are a hallmark of our approach. If we need to be interactive and keep in touch with the client early and often, we do that. A lot of times I’ll start out with a quick look and have a second brief call with the customer, or send a quick email and ask for feedback on the direction I’m going; or just get on an instant messaging chat and tell the customer “here’s the top thing you can start working on while I keep investigating.” We make our process as interactive and flexible as the situation calls for. Early, frequent feedback lets us make small adjustments to the direction we’re going, communicate to the client how much time we’re spending, and shortcut effort if it’s not useful, e.g. skipping analysis for a query the client’s developers are in the process of removing from the application completely.
That’s all! I hope this has been an informative and fun trip through the world of a typical Percona performance audit, although as you’ve seen there really isn’t any such thing. Post your questions in the comments, and I’ll try to answer them as time permits.