September 2, 2014

Color code your performance numbers

When analyzing how good or bad response time is it is not handy to look at the averages, min or max times – something what is easily computed using built in aggregate functions. We most likely would like to see some percentile numbers – 95 percentile or 99 percentile. The problem is computing these in SQL is ugly and resource consuming. There is however a simple way to get similar data, looking at it from the different point of view.

When we’re speaking about application we may not always care about exact value of response time but rather we want to see response time to be within certain range. For example if we define page feels good if response time is below 50ms it is not as important if response was 40ms or 44ms – it is much more important how frequently this goal was reached.

In fact I prefer to define 2 performance level. One what users would consider good performance, another what is acceptable, and basically the third level is unacceptable. We can also set classical color codes to them – green, yellow, red.

For example we can define good performance as response time for search pages within 1 second, acceptable as response within 3 seconds and anything over that will be considered unacceptable:

These are great numbers to look after. In this case we can say system has “two nines” in terms of performance – it responds with what we consider “good” performance in over 99% cases. I consider this a lot more meaningful number for business than speaking about 99% response time, especially as it also clearly clarifies how bad the problem is. If your goal is 99% response time is 1sec and you’re seeing it 2 seconds what you really do not know is how many people fall in this 1-sec to 2 sec range. It could be very small number and so your performance is not that bad or it could be majority of response times are withing 1 to 2 sec range.

Next thing I would add is the average response time for our good requests. This eliminates slow outliers which we just count by their number and allows to see how good our performance really is. I consider it secondary value as our major goal is eliminating variance as defined by Six Sigma practices. However given variance is the same better average response time for good requests helps us to track performance changes closely and act before we get large number of yellow or red requests:

Finally I find it helpful to track it among time period. For example in this case we’re looking for data grouped by 15 min interval though you may be looking at different intervals too:

Note response time does not always correlates with number of bad requests. For example we can see at one interval there were over 6% requests in yellow but average response time for good requests actually was better than in the other intervals.

The queries which produce this data may not be overly beautiful but they are pretty fast requiring simple scan of the data range.

P.S Gathering data this way also often gives you some idea about what your percentile response time numbers might be. In this case I know over 99% of requests are served in 1sec so I know my 99% response time is below 1 second.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. What does the table structure of performance_log_100601 look like? What sort of data are you pushing into it? (I know it’s not necessary to the scope of this article, but you’ve peaked my curiosity).

  2. Peter, this reminds me about Apdex (what NewRelic uses to score):
    http://en.wikipedia.org/wiki/Apdex

    Meeting the response time objective gives you 1 point.
    Missing your response time objective, but being within x4 of it gives you 0.25 points.

    Your final score is based on:
    your_points / maximum_points_possible

  3. peter says:

    Eric,

    This table is very similar to described here
    http://code.google.com/p/instrumentation-for-php/
    (With example PHP code etc)

  4. peter says:

    Morgan,

    You mean like this:

    mysql> select (sum(wtime<1.0)+sum(wtime between 1.0 and 4.0)*0.25)/count(*) apdex from performance_log_100601 WHERE page_type=’search’;
    +———-+
    | apdex |
    +———-+
    | 0.997480 |
    +———-+
    1 row in set (24.30 sec)

    Re NewRelic I’m wondering how they can compute it if they do not know response time goals for different user interaction in the application, or do they ?

Speak Your Mind

*