October 20, 2014

Performance or Stability ???

Sometimes the question is put like are you looking for Performance OR Stability, which I believe is a strange way to put it. In real life systems you care both about Performance AND Stability. I would even say Stability is a not the best world here, I would day you care about your minimal performance in most cases.

If system can handle 5000 q/sec for 1 minute and when 20.000 s/sec for the next one, how much I can count on in terms of capacity planning ? In case this is typical OLTP system I will have to use 5000 q/sec number as I need my system to always be able to reach performance requirements. If the system though is doing batch processing may be I can count on the average which is 12.5K in this case.

The difference between stability and minimal performance is important as I can be quite OK with “unstable” performance if it is performance bursts rather than stalls, for example if my system performs 7000 q/sec and sometimes bursts to 15.000 q/sec I will prefer it to the system which has stable 6000 q/sec performance.

Finally we have to understand the difference between Benchmarks and real life. Most benchmarks perform “stress testing” throwing load on the system and seeing how much it can handle. In real world however you typically have a given load, which typically falls into the certain range, for example you may say we have 300-500 queries/sec in our peak time. Because most systems have load based on something called “random arrivals” rather than uniform pace of requests, the more you slice the time the more variance you’re going to get. For example the same case could correspond to 20-100 queries in 100ms period. In the real applications you do not drive your system at complete saturation point to accommodate for such micro spikes and you care about response time a lot, as it is response time what
users will use to judge if your system is fast or slow.

You will always see some response time distribution rather than all queries of the same time having same response time and this distribution will vary a lot. Typically the less outliers you have with the same
average response time the better it is.

The relationships between Throughput and response time is complicated and we can’t always say better throughput comes with better response times, but it still serves and important point. If I know my system peaks on 1000 q/sec for 10 seconds and I have to serve traffic which is 2000 q/sec – I can’t do that, as a lot of queries will have to be queued for at least 10 seconds until performance recovers, and this means their response time will be at least 10 seconds, which is likely not acceptable to me.

The micro stalls though can be acceptable. If my system serves 5000q/sec in average but there are some 10ms intervals in which it stalls to process 0 queries or processes just 1000 queries,sec
and my query inflow rate is 2000 q/sec and required response time is within 50ms, it well may be acceptable. Note if you drill to small enough time intervals you will find those micro stalls in basically any system.

As a summary most likely you do care about your performance, the minimal performance to be certain. The interval at which you should measure this minimal performance depends of response time your application can tolerate. In the MySQL Benchmarks Vadim Posted we see there are a lot of stalls lasting 1 minute or more which are not acceptable for any interactive application.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Peter Zaitsev says:

    Patrick,

    Yes. There are a lot of assumptions out there. I surely assume you want to get the best result for your money, which means you want to have system which can happen maximal amount of load all the time. In your case the answer is easy as 110-120 q/sec both higher minimum and higher variance compared to 90-1000 q/sec. What if it would be 120-1000 q/sec in this case ? This has higher minimum bug higher variance.

    If you care about getting most for your money such behavior for the same is better as it can handle application which requires 120q/sec in peaks. However I understand some people would rather spend extra and get system with lower performance but lower variance for the piece of mind. System which as lower variance is a lot easier to monitor for exceptional events.

  2. Peter Zaitsev says:

    Dmitri,

    Indeed too bad you’re not coming to California this year. Though good you at least make it to Collaborate. There are going to be bunch of Percona guys there though to keep you company :)

    Speaking about your description – yes… in the end computers are state machines and if you dig deep enough you will find same conditions lead to stable result. My point is at certain level it is too complicated to deal in production environment. We love to have explanation for everything and we can find it in almost any case… though in many cases people would rather have just solution if explanation costs too much :)

  3. Peter Zaitsev says:

    Yzmir,

    Capacity is a good word though it is rather loosely understood by people. Though Scalability and Performance are not much better you can find tons of definitions.

  4. Yzmir Ramirez says:

    What about Capacity, for a lack of a better word? If you take Stability, Performance, and Capacity and put them on a triangle that pretty much sums up your choices between the three.

    As you increase Capacity both Stability and Performance degrade and visa versa. Would you agree?

    [img]http://img269.imageshack.us/img269/7130/servertriangleperforman.png[/img]

  5. Patrick Casey says:

    I think I share the consensus here in that I’d prefer consistent performance over throughput, provided that my performance was “good enough”.

    Put numerically, if my app requires 100 q/s to function, I’d rather have a database that delivers between 110-120 q/s over one that delivered anywhere between 90-1000 q/s.

    Of course, nobody sets out to build database infrastructure that can’t keep up with minimum requirements, usually if I’m looking at something its because the lower bound has been breached and suddenly we have a performance problem.

    Like Dimitri was saying though, there’s a mess of factors at work here, one of which is clearly going to be price. For *enough* money I can design almost any practical performance level, but I’ve never, ever, worked on a project where the design criterial was “make it as fast as you can I don’t care about the price”. Usually the criteria is “it must have X performance, Y uptime, and should be as cheap as you can manage it”.

    My mental model of the mysql user base (which could be totally bonkers), probably separates it into three big tranches.

    You have a lot of small users backing cookbook sites, probably running myisam on old single disk servers or virtuals. This group probably values convenience above all else.

    You have a middle group of users running on designed hardware to run one or more “real” database servers, almost definitely innodb on RAID, big memory boxes, but with serious budget constraints. This group probably values performance because they’re trying to eek out the maximum throughput without buying another server.

    You have a small group at the top running enterprise grade sites on farms of dedicated database boxes. This group values stability above all else; they can solve a performance problem with more hardware and the cost of a few dozen extra boxes is nothing compared to the business cost of going dark or degrading to the point where there is customer impact.

    So I guess the answer to the first question is: it depends who you are :) .

  6. Dimitri says:

    Peter,

    Well, there are many reason why performance may be unstable (and as you see in all your examples you already have an explanation ;-) ).. – and most of the HW issues can be easily monitored and explained too (for ex. on Solaris you may scan for a ration between Seq and Random I/O operations, and every storage array has its own stats too, etc.)..

    Then, once you observe performance issues on your production system the first thing you’re trying to do is to align all “random” events with what you’re observing within a given period and look if there is any dependency, etc.. Then, if you really want to find the root issue of the problem, you start by reducing the “random” perimeter: you make sure your MySQL instance is alone to use its storage, its system, etc. Then you’re trying to analyze more in depth and finally come with a test case which gives you a way to reproduce your problem *every* time.. – And yet then you still may split your problem in several test cases, and analyze each one alone.. – like you’re using bricks to build a wall.. :-)

    A “stable workload” test case is one of such kind of bricks. And according to what kind of workload you’re running you know what kind of the result you’re expecting, no?.. – and if it’s not so, then you’re analyzing the problem in depth and looking for solution(s) to fix it.. Once you’re sure you’re obtaining a “stable performance” on a “stable workload”, you may try to involve any other “random” events – to see what will be impact of the batch started in background, or a report heavy query scanning a half of tables, etc..

    Son before to build a wall you first have to be sure about your bricks..
    Still don’t agree?.. ;-)

    Such a discussion may be very long and take weeks or months (specially if the beer is good :-) ) it’s pity I’m coming this year only for Collaborate-11 conference and will miss UC, otherwise it was a good occasion to spend a night in “perf bar” :-) )

    Rgds,
    -Dimitri

  7. Peter Zaitsev says:

    Dmitri,

    The thing is stable workload mostly exists in labs, almost any real system has workload which is unstable – number of incoming requests varies, there are batch jobs etc. Now the “hardware” performance is also generally not uniform with competing processes running on the system, such as backups, in many installations such as virtualized environments or SAN/NAS based databases there are many competing tenants for same hardware resources.

    In practical case there is a danger in the system which bursts performance every so often as if burst is long enough you may think it is the normal performance and use the wrong number for capacity planning.

    One more thing – burst may happen in seemingly uniform systems and you would need to do extra effort to prevent them. Consider for example flushing pages. Even if you design system to flush evenly 100 pages to the disk every second you may end up with cases when these pages are split on the whole disk or located on the close track(s) or even merged to the same sequential write. Which will take different amount of resources and hence will leave more resources for reads to happen hence likely increasing throughput.

  8. Dimitri says:

    Personally I’d prefer “stable performance” term, but everything is relative :-)
    Yes, testing is testing, and life is life.. What I’d expect from a database: if the testing workload is “stable” itself, there is no really any reason to see “unstable” database performance (if OS, HW and IO level are working as expected and don’t have any particular performance issues..) – so testing is helping to reproduce the problem (when possible) as no one customer will agree to play with their production :-) )

    Then, any observed TPS “drops” should be analyzed and fixed. And there is still a lot of work yet to be done with MySQL.. ;-)

    Rgds,
    -Dimitri

  9. Dimitri says:

    Personally I’d prefer “stable performance” term, but everything is relative :-)
    Yes, testing is testing, and life is life.. What I’d expect from a database: if the testing workload is “stable” itself, there is no really any reason to see “unstable” database performance (if OS, HW and IO level are working as expected and don’t have any particular performance issues..) – so testing is helping to reproduce the problem (when possible) as no one customer will agree to play with their production :-))

    Then, any observed TPS “drops” should be analyzed and fixed. And there is still a lot of work yet to be done with MySQL.. ;-)

    Rgds,
    -Dimitri

  10. Dmitri,

    The thing is stable workload mostly exists in labs, almost any real system has workload which is unstable – number of incoming requests varies, there are batch jobs etc. Now the “hardware” performance is also generally not uniform with competing processes running on the system, such as backups, in many installations such as virtualized environments or SAN/NAS based databases there are many competing tenants for same hardware resources.

    In practical case there is a danger in the system which bursts performance every so often as if burst is long enough you may think it is the normal performance and use the wrong number for capacity planning.

    One more thing – burst may happen in seemingly uniform systems and you would need to do extra effort to prevent them. Consider for example flushing pages. Even if you design system to flush evenly 100 pages to the disk every second you may end up with cases when these pages are split on the whole disk or located on the close track(s) or even merged to the same sequential write. Which will take different amount of resources and hence will leave more resources for reads to happen hence likely increasing throughput.

  11. Yzmir Ramirez says:

    What about Capacity, for a lack of a better word? If you take Stability, Performance, and Capacity and put them on a triangle that pretty much sums up your choices between the three.

    As you increase Capacity both Stability and Performance degrade and visa versa. Would you agree?

    [img]http://img269.imageshack.us/img269/7130/servertriangleperforman.png[/img]

  12. Dimitri says:

    Peter,

    Well, there are many reason why performance may be unstable (and as you see in all your examples you already have an explanation ;-)).. – and most of the HW issues can be easily monitored and explained too (for ex. on Solaris you may scan for a ration between Seq and Random I/O operations, and every storage array has its own stats too, etc.)..

    Then, once you observe performance issues on your production system the first thing you’re trying to do is to align all “random” events with what you’re observing within a given period and look if there is any dependency, etc.. Then, if you really want to find the root issue of the problem, you start by reducing the “random” perimeter: you make sure your MySQL instance is alone to use its storage, its system, etc. Then you’re trying to analyze more in depth and finally come with a test case which gives you a way to reproduce your problem *every* time.. – And yet then you still may split your problem in several test cases, and analyze each one alone.. – like you’re using bricks to build a wall.. :-)

    A “stable workload” test case is one of such kind of bricks. And according to what kind of workload you’re running you know what kind of the result you’re expecting, no?.. – and if it’s not so, then you’re analyzing the problem in depth and looking for solution(s) to fix it.. Once you’re sure you’re obtaining a “stable performance” on a “stable workload”, you may try to involve any other “random” events – to see what will be impact of the batch started in background, or a report heavy query scanning a half of tables, etc..

    Son before to build a wall you first have to be sure about your bricks..
    Still don’t agree?.. ;-)

    Such a discussion may be very long and take weeks or months (specially if the beer is good :-)) it’s pity I’m coming this year only for Collaborate-11 conference and will miss UC, otherwise it was a good occasion to spend a night in “perf bar” :-))

    Rgds,
    -Dimitri

  13. Patrick Casey says:

    I think I share the consensus here in that I’d prefer consistent performance over throughput, provided that my performance was “good enough”.

    Put numerically, if my app requires 100 q/s to function, I’d rather have a database that delivers between 110-120 q/s over one that delivered anywhere between 90-1000 q/s.

    Of course, nobody sets out to build database infrastructure that can’t keep up with minimum requirements, usually if I’m looking at something its because the lower bound has been breached and suddenly we have a performance problem.

    Like Dimitri was saying though, there’s a mess of factors at work here, one of which is clearly going to be price. For *enough* money I can design almost any practical performance level, but I’ve never, ever, worked on a project where the design criterial was “make it as fast as you can I don’t care about the price”. Usually the criteria is “it must have X performance, Y uptime, and should be as cheap as you can manage it”.

    My mental model of the mysql user base (which could be totally bonkers), probably separates it into three big tranches.

    You have a lot of small users backing cookbook sites, probably running myisam on old single disk servers or virtuals. This group probably values convenience above all else.

    You have a middle group of users running on designed hardware to run one or more “real” database servers, almost definitely innodb on RAID, big memory boxes, but with serious budget constraints. This group probably values performance because they’re trying to eek out the maximum throughput without buying another server.

    You have a small group at the top running enterprise grade sites on farms of dedicated database boxes. This group values stability above all else; they can solve a performance problem with more hardware and the cost of a few dozen extra boxes is nothing compared to the business cost of going dark or degrading to the point where there is customer impact.

    So I guess the answer to the first question is: it depends who you are :).

  14. Yzmir,

    Capacity is a good word though it is rather loosely understood by people. Though Scalability and Performance are not much better you can find tons of definitions.

  15. Dmitri,

    Indeed too bad you’re not coming to California this year. Though good you at least make it to Collaborate. There are going to be bunch of Percona guys there though to keep you company :)

    Speaking about your description – yes… in the end computers are state machines and if you dig deep enough you will find same conditions lead to stable result. My point is at certain level it is too complicated to deal in production environment. We love to have explanation for everything and we can find it in almost any case… though in many cases people would rather have just solution if explanation costs too much :)

  16. Patrick,

    Yes. There are a lot of assumptions out there. I surely assume you want to get the best result for your money, which means you want to have system which can happen maximal amount of load all the time. In your case the answer is easy as 110-120 q/sec both higher minimum and higher variance compared to 90-1000 q/sec. What if it would be 120-1000 q/sec in this case ? This has higher minimum bug higher variance.

    If you care about getting most for your money such behavior for the same is better as it can handle application which requires 120q/sec in peaks. However I understand some people would rather spend extra and get system with lower performance but lower variance for the piece of mind. System which as lower variance is a lot easier to monitor for exceptional events.

Speak Your Mind

*