September 21, 2014

Intel Nehalem vs AMD Opteron shootout in sysbench workload

Having two big boxes in our lab, one based Intel Nehalem (Cisco UCS C250) and second on AMD Opteron (Dell PowerEdge R815), I decided to run some simple sysbench benchmark to compare how both CPUs perform and what kind of scalability we can expect.

It is hard to make apples to apples comparison, but I think it is still interesting.
Cisco UCS C250 has total 12 cores / 24 threads of Intel Nehalem X5670, and Dell PowerEdge R815 has 48 cores of AMD Opteron 6168.
One of biggest difference is that Cisco is running CentOS 5.5 and Dell R815 is based on RedHat EL 6. I will probably will need to rerun benchmark after upgrade Cisco to CentOS 6 ( will be it even released or it is dead ? ).
For benchmark I took sysbench oltp ( both read-only and read-write) and MySQL 5.6.2 as it seems most scalable system at the moment. All data fits into memory, so it is full CPU bound benchmark.

The full numbers, script and config are on our Benchmark Wiki, there are some graphs.

Ok, Despite claims I heard that AMD Opteron is much slower then Intel Nehalem, I do not see it in results.
Both systems scales pretty decent up to 32 threads, with AMD a little bit slower.

After 32 threads, system based on Intel handles 48-256 threads pretty decent, but on R815 / AMD something
gets wrong way. We do not see good improvement from 32 to 48 threads ( but I did expect that, as we have 48 cores),
and after 48 threads throughput drops significantly. I actually suspect this is rather MySQL problem, and 32 cores may be limit of scaling for MySQL 5.6.

Anyway, we have PERFORMANCE_SCHEMA and in next run I will try to get more information what is most used “wait” event that does not allow MySQL to scale.

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Vadim Tkachenko says:

    @whatever,

    The disk system does not matter as all data is stored in memory, this test was 100% CPU burning.
    But I expected that objection. On R815 the main storage was Virident tachIOn card. But again, it does not matter at all in this test.

  2. whatever says:

    Hi, Vadim Tkachenko :

    According to the hardware specs, you are comparing the Cisco C250 with FusionIO SSD + 384GB ram against R815 with 6x 7200rpm western digital RAID10 setup+ 160GB ram? Then this test is an IO test, not CPU comparison?

  3. Ketil says:

    I can’t help but noticing that the Intel system has a lot more memory than the AMD system, and I’d be inclined to suspect this is the reason for performance dropoff for AMD.

    Anyway, the interesting question is how much performance do I need, and how much do I have to pay for it. I just checked Dell, and a 32 core R910 Intel system (2.26GHz) is almost twice the price of an R815 with 48 2.3GHz cores – both with 256GB memory. So although the Intel system might still be faster, I think the AMD system is a much better deal.

  4. Peter Zaitsev says:

    Laurynas,

    It might be an interesting test indeed. Though I remember some 6 years ago we played a lot with different compiler settings in attempt to get gains. We could get couple of percent tops at that time which was not worth it. May be architectures have diverged now and compilers improved so numbers are different.

  5. Laurynas says:

    Vadim -

    I do not know how exactly the generic binary was built but in the best case it could have used -mtune=generic option (one size fits all, or boths CPUs slightly penalized – perhaps by differing amount). Since this is CPU/memory bound benchmark, it’d be interesting to see if building for each CPU with -march=native –param l1-cache-size=x –param l1-cache-line-size=y –param l2-cache-size=z changes the results. Although I’d expect any differences to come up with smaller numbers of threads only.

  6. Vadim Tkachenko says:

    Laurynas,

    Nope, it was generic Linux binary tar.gz from dev.mysql.com

  7. Laurynas says:

    Vadim -

    Was the mysqld binary built optimized for each CPU?

  8. Denis M says:

    Vadim,
    I think it’s interesting to measure what portion of performance hyper-threading contributes.

  9. Patrick Casey says:

    I dunno, my rule of thumb is that I’d almost always prefer to have my MIPS on a small number of cores rather than on big ones since that’ll work well under various loads. I’m sure there are other workloads where having a huge core count beats having fast cores, but most of the ones I run into tend to be the opposite.

    If you imagine 1 big CPU that does 16 “units of work/time” vs 16 little ones that do 1 “units/time” then they both have the same theoretical throughput, but the single big CPU is much more likely to achieve that in practice. Only case having a family of little CPUs would help out would be if you have a near perfectly parallelized problem.

    I’ve been burned pretty hard with a couple of clients, for example, who bought Sun Niagara chips because the sun salesman convinced them it was the cheapest way to get lots of clock cycles, which was true for a perfectly parallelized compute problem, but grossly untrue for the kind of database workloads the customer had (they were much better off on dual core intels).

  10. Jeffrey Gilbert says:

    Now that I think about it a little while, it’s not THAT surprising that Intel’s faster chips have a leg up on the scalability side under MySQL workloads. The AMD chips are far better suited for true multi-threaded applications (e.g. apache), where there are no lock conditions. What’s happening, from my understanding, is that once contention for transactions becomes a factor of waiting for a process to finish executing, intel will win because it can execute that process faster and release the lock. On AMD, the individual cores can only execute instructions so fast, so their speed bottlenecks will show up in higher tps loads.

    Under apache or something where resource locking isn’t an issue, i think the AMD chips would have a better leg up because they could spawn a process, throw it to a core, and forget about it. Just a guess based on how I understand both softwares to work.

  11. EBob says:

    You are all confusing the issue with the cores:

    Intel: 12 cores + hyperthreading.
    AMD: 48 cores.

    Forget about “threads” and “virtual cores”, its not a number you should consider, just think 5-10% faster than without hyperthreading

    This seems to prove difinitively that fewer faster cores are better that more slower cores for mysql. I’m shocked that a 48 core system did not obliterate the 12 core. Simply shocked.

  12. Peter Zaitsev says:

    Jeffrey,

    Vadim has corrected it is 12 cores 24 threads here. It is 2 sockets Intel vs 4 socket AMD

  13. Jeffrey Gilbert says:

    From wiki:

    http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

    6 cores (12 threads) @ 2.93 GHz
    6 x 4 = 24 (so 4 sockets)

    And the Opterons of course have slower clocked cores but more per socket, so

    12 cores (12 threads) @ 1.9GHz
    4 x 12 = 48 (still 4 sockets)

    That’s my understanding of how this benchmark is being run and why it’s so even between the two. It is interesting to see that the chips with more physical cores cannot end up scaling past the chips with the faster clock, but for energy consumption during average use, it’s probably closer to AMD than Intel.

  14. Peter Zaitsev says:

    Vadim,

    So are you saying the performance is basically the same for single core or is it how graph make it look ?

    If this is the case it is interesting to see Opteron cores are not slower. Overall however we get about same performance from 2 socket Intel as 4 socket Opteron which should be more expensive and take more power… which means 2 socket Intel is a way to go these days.

  15. Vadim Tkachenko says:

    Yes, my bad, it is 24 cores on Cisco / Intel Nehalem

  16. George says:

    Yeah Xeon X5670 is dual hexa-core = 12 cores + 12 virtual = 24 cores.

  17. Jeffrey Gilbert says:

    You sure that’s the right chip model? That’s a six core chip. 32 / 6 = 5.3333

  18. Patrick Casey says:

    Could it be choking on memory access? With that many cores thrashing the registers at once, you’d expect the instruction caches on the chips to be pretty saturated.

    I’ve yet to run a benchmark myself where memory access time mattered all that much, but you might have found the use case :) .

  19. George says:

    Surprising results would of though a Nehalem based processor running at much higher clock speed would have a more substantial lead.

    How would innodb_buffer_pool_instances come into play would it make a difference ? Noticed Intel and AMD servers have different vm.dirty_ratio of 40% and 20%

  20. Raine says:

    Nahalem? Did you mean Nehalem?

    :)

  21. tobi says:

    “I actually suspect this is rather MySQL problem, and 32 cores may be limit of scaling for MySQL 5.6.” But mysql ist a constant factor to both benchmarks. The CPUs and OS are the only difference(?), Maybe the opteron CPUs have lower bandwidth for coherency traffic? Investigating this could reveal powerful insights.

  22. Baron Schwartz says:

    You could bind mysqld to 32 cores on the AMD box too, and see if it scales better after that.

  23. You could bind mysqld to 32 cores on the AMD box too, and see if it scales better after that.

  24. tobi says:

    “I actually suspect this is rather MySQL problem, and 32 cores may be limit of scaling for MySQL 5.6.” But mysql ist a constant factor to both benchmarks. The CPUs and OS are the only difference(?), Maybe the opteron CPUs have lower bandwidth for coherency traffic? Investigating this could reveal powerful insights.

  25. Raine says:

    Nahalem? Did you mean Nehalem?

    :)

  26. George says:

    Surprising results would of though a Nehalem based processor running at much higher clock speed would have a more substantial lead.

    How would innodb_buffer_pool_instances come into play would it make a difference ? Noticed Intel and AMD servers have different vm.dirty_ratio of 40% and 20%

  27. Patrick Casey says:

    Could it be choking on memory access? With that many cores thrashing the registers at once, you’d expect the instruction caches on the chips to be pretty saturated.

    I’ve yet to run a benchmark myself where memory access time mattered all that much, but you might have found the use case :).

  28. Jeffrey Gilbert says:

    You sure that’s the right chip model? That’s a six core chip. 32 / 6 = 5.3333

  29. George says:

    Yeah Xeon X5670 is dual hexa-core = 12 cores + 12 virtual = 24 cores.

  30. Yes, my bad, it is 24 cores on Cisco / Intel Nehalem

  31. Vadim,

    So are you saying the performance is basically the same for single core or is it how graph make it look ?

    If this is the case it is interesting to see Opteron cores are not slower. Overall however we get about same performance from 2 socket Intel as 4 socket Opteron which should be more expensive and take more power… which means 2 socket Intel is a way to go these days.

  32. Jeffrey Gilbert says:

    From wiki:

    http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

    6 cores (12 threads) @ 2.93 GHz
    6 x 4 = 24 (so 4 sockets)

    And the Opterons of course have slower clocked cores but more per socket, so

    12 cores (12 threads) @ 1.9GHz
    4 x 12 = 48 (still 4 sockets)

    That’s my understanding of how this benchmark is being run and why it’s so even between the two. It is interesting to see that the chips with more physical cores cannot end up scaling past the chips with the faster clock, but for energy consumption during average use, it’s probably closer to AMD than Intel.

  33. Jeffrey,

    Vadim has corrected it is 12 cores 24 threads here. It is 2 sockets Intel vs 4 socket AMD

  34. EBob says:

    You are all confusing the issue with the cores:

    Intel: 12 cores + hyperthreading.
    AMD: 48 cores.

    Forget about “threads” and “virtual cores”, its not a number you should consider, just think 5-10% faster than without hyperthreading

    This seems to prove difinitively that fewer faster cores are better that more slower cores for mysql. I’m shocked that a 48 core system did not obliterate the 12 core. Simply shocked.

  35. Jeffrey Gilbert says:

    Now that I think about it a little while, it’s not THAT surprising that Intel’s faster chips have a leg up on the scalability side under MySQL workloads. The AMD chips are far better suited for true multi-threaded applications (e.g. apache), where there are no lock conditions. What’s happening, from my understanding, is that once contention for transactions becomes a factor of waiting for a process to finish executing, intel will win because it can execute that process faster and release the lock. On AMD, the individual cores can only execute instructions so fast, so their speed bottlenecks will show up in higher tps loads.

    Under apache or something where resource locking isn’t an issue, i think the AMD chips would have a better leg up because they could spawn a process, throw it to a core, and forget about it. Just a guess based on how I understand both softwares to work.

  36. Patrick Casey says:

    I dunno, my rule of thumb is that I’d almost always prefer to have my MIPS on a small number of cores rather than on big ones since that’ll work well under various loads. I’m sure there are other workloads where having a huge core count beats having fast cores, but most of the ones I run into tend to be the opposite.

    If you imagine 1 big CPU that does 16 “units of work/time” vs 16 little ones that do 1 “units/time” then they both have the same theoretical throughput, but the single big CPU is much more likely to achieve that in practice. Only case having a family of little CPUs would help out would be if you have a near perfectly parallelized problem.

    I’ve been burned pretty hard with a couple of clients, for example, who bought Sun Niagara chips because the sun salesman convinced them it was the cheapest way to get lots of clock cycles, which was true for a perfectly parallelized compute problem, but grossly untrue for the kind of database workloads the customer had (they were much better off on dual core intels).

  37. Denis M says:

    Vadim,
    I think it’s interesting to measure what portion of performance hyper-threading contributes.

  38. Laurynas says:

    Vadim –

    Was the mysqld binary built optimized for each CPU?

  39. Laurynas,

    Nope, it was generic Linux binary tar.gz from dev.mysql.com

  40. Laurynas says:

    Vadim –

    I do not know how exactly the generic binary was built but in the best case it could have used -mtune=generic option (one size fits all, or boths CPUs slightly penalized – perhaps by differing amount). Since this is CPU/memory bound benchmark, it’d be interesting to see if building for each CPU with -march=native –param l1-cache-size=x –param l1-cache-line-size=y –param l2-cache-size=z changes the results. Although I’d expect any differences to come up with smaller numbers of threads only.

  41. Laurynas,

    It might be an interesting test indeed. Though I remember some 6 years ago we played a lot with different compiler settings in attempt to get gains. We could get couple of percent tops at that time which was not worth it. May be architectures have diverged now and compilers improved so numbers are different.

Speak Your Mind

*