Intel Nehalem vs AMD Opteron shootout in sysbench workload

Intel Nehalem vs AMD Opteron shootout in sysbench workload

PREVIOUS POST
NEXT POST

Having two big boxes in our lab, one based Intel Nehalem (Cisco UCS C250) and second on AMD Opteron (Dell PowerEdge R815), I decided to run some simple sysbench benchmark to compare how both CPUs perform and what kind of scalability we can expect.

It is hard to make apples to apples comparison, but I think it is still interesting.
Cisco UCS C250 has total 12 cores / 24 threads of Intel Nehalem X5670, and Dell PowerEdge R815 has 48 cores of AMD Opteron 6168.
One of biggest difference is that Cisco is running CentOS 5.5 and Dell R815 is based on RedHat EL 6. I will probably will need to rerun benchmark after upgrade Cisco to CentOS 6 ( will be it even released or it is dead ? ).
For benchmark I took sysbench oltp ( both read-only and read-write) and MySQL 5.6.2 as it seems most scalable system at the moment. All data fits into memory, so it is full CPU bound benchmark.

The full numbers, script and config are on our Benchmark Wiki, there are some graphs.

Ok, Despite claims I heard that AMD Opteron is much slower then Intel Nehalem, I do not see it in results.
Both systems scales pretty decent up to 32 threads, with AMD a little bit slower.

After 32 threads, system based on Intel handles 48-256 threads pretty decent, but on R815 / AMD something
gets wrong way. We do not see good improvement from 32 to 48 threads ( but I did expect that, as we have 48 cores),
and after 48 threads throughput drops significantly. I actually suspect this is rather MySQL problem, and 32 cores may be limit of scaling for MySQL 5.6.

Anyway, we have PERFORMANCE_SCHEMA and in next run I will try to get more information what is most used “wait” event that does not allow MySQL to scale.

PREVIOUS POST
NEXT POST

Share this post

Comments (41)

  • Vadim Tkachenko Reply

    @whatever,

    The disk system does not matter as all data is stored in memory, this test was 100% CPU burning.
    But I expected that objection. On R815 the main storage was Virident tachIOn card. But again, it does not matter at all in this test.

    April 25, 2011 at 12:00 am
  • whatever Reply

    Hi, Vadim Tkachenko :

    According to the hardware specs, you are comparing the Cisco C250 with FusionIO SSD + 384GB ram against R815 with 6x 7200rpm western digital RAID10 setup+ 160GB ram? Then this test is an IO test, not CPU comparison?

    April 25, 2011 at 12:00 am
  • Ketil Reply

    I can’t help but noticing that the Intel system has a lot more memory than the AMD system, and I’d be inclined to suspect this is the reason for performance dropoff for AMD.

    Anyway, the interesting question is how much performance do I need, and how much do I have to pay for it. I just checked Dell, and a 32 core R910 Intel system (2.26GHz) is almost twice the price of an R815 with 48 2.3GHz cores – both with 256GB memory. So although the Intel system might still be faster, I think the AMD system is a much better deal.

    April 25, 2011 at 12:00 am
  • Peter Zaitsev Reply

    Laurynas,

    It might be an interesting test indeed. Though I remember some 6 years ago we played a lot with different compiler settings in attempt to get gains. We could get couple of percent tops at that time which was not worth it. May be architectures have diverged now and compilers improved so numbers are different.

    April 25, 2011 at 12:00 am
  • Laurynas Reply

    Vadim –

    I do not know how exactly the generic binary was built but in the best case it could have used -mtune=generic option (one size fits all, or boths CPUs slightly penalized – perhaps by differing amount). Since this is CPU/memory bound benchmark, it’d be interesting to see if building for each CPU with -march=native –param l1-cache-size=x –param l1-cache-line-size=y –param l2-cache-size=z changes the results. Although I’d expect any differences to come up with smaller numbers of threads only.

    April 25, 2011 at 12:00 am
  • Vadim Tkachenko Reply

    Laurynas,

    Nope, it was generic Linux binary tar.gz from dev.mysql.com

    April 25, 2011 at 12:00 am
  • Laurynas Reply

    Vadim –

    Was the mysqld binary built optimized for each CPU?

    April 25, 2011 at 12:00 am
  • Denis M Reply

    Vadim,
    I think it’s interesting to measure what portion of performance hyper-threading contributes.

    April 25, 2011 at 12:00 am
  • Patrick Casey Reply

    I dunno, my rule of thumb is that I’d almost always prefer to have my MIPS on a small number of cores rather than on big ones since that’ll work well under various loads. I’m sure there are other workloads where having a huge core count beats having fast cores, but most of the ones I run into tend to be the opposite.

    If you imagine 1 big CPU that does 16 “units of work/time” vs 16 little ones that do 1 “units/time” then they both have the same theoretical throughput, but the single big CPU is much more likely to achieve that in practice. Only case having a family of little CPUs would help out would be if you have a near perfectly parallelized problem.

    I’ve been burned pretty hard with a couple of clients, for example, who bought Sun Niagara chips because the sun salesman convinced them it was the cheapest way to get lots of clock cycles, which was true for a perfectly parallelized compute problem, but grossly untrue for the kind of database workloads the customer had (they were much better off on dual core intels).

    April 25, 2011 at 12:00 am
  • Jeffrey Gilbert Reply

    Now that I think about it a little while, it’s not THAT surprising that Intel’s faster chips have a leg up on the scalability side under MySQL workloads. The AMD chips are far better suited for true multi-threaded applications (e.g. apache), where there are no lock conditions. What’s happening, from my understanding, is that once contention for transactions becomes a factor of waiting for a process to finish executing, intel will win because it can execute that process faster and release the lock. On AMD, the individual cores can only execute instructions so fast, so their speed bottlenecks will show up in higher tps loads.

    Under apache or something where resource locking isn’t an issue, i think the AMD chips would have a better leg up because they could spawn a process, throw it to a core, and forget about it. Just a guess based on how I understand both softwares to work.

    April 25, 2011 at 12:00 am
  • EBob Reply

    You are all confusing the issue with the cores:

    Intel: 12 cores + hyperthreading.
    AMD: 48 cores.

    Forget about “threads” and “virtual cores”, its not a number you should consider, just think 5-10% faster than without hyperthreading

    This seems to prove difinitively that fewer faster cores are better that more slower cores for mysql. I’m shocked that a 48 core system did not obliterate the 12 core. Simply shocked.

    April 25, 2011 at 12:00 am
  • Peter Zaitsev Reply

    Jeffrey,

    Vadim has corrected it is 12 cores 24 threads here. It is 2 sockets Intel vs 4 socket AMD

    April 25, 2011 at 12:00 am
  • Jeffrey Gilbert Reply

    From wiki:

    http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

    6 cores (12 threads) @ 2.93 GHz
    6 x 4 = 24 (so 4 sockets)

    And the Opterons of course have slower clocked cores but more per socket, so

    12 cores (12 threads) @ 1.9GHz
    4 x 12 = 48 (still 4 sockets)

    That’s my understanding of how this benchmark is being run and why it’s so even between the two. It is interesting to see that the chips with more physical cores cannot end up scaling past the chips with the faster clock, but for energy consumption during average use, it’s probably closer to AMD than Intel.

    April 25, 2011 at 12:00 am
  • Peter Zaitsev Reply

    Vadim,

    So are you saying the performance is basically the same for single core or is it how graph make it look ?

    If this is the case it is interesting to see Opteron cores are not slower. Overall however we get about same performance from 2 socket Intel as 4 socket Opteron which should be more expensive and take more power… which means 2 socket Intel is a way to go these days.

    April 25, 2011 at 12:00 am
  • Vadim Tkachenko Reply

    Yes, my bad, it is 24 cores on Cisco / Intel Nehalem

    April 25, 2011 at 12:00 am
  • George Reply

    Yeah Xeon X5670 is dual hexa-core = 12 cores + 12 virtual = 24 cores.

    April 25, 2011 at 12:00 am
  • Jeffrey Gilbert Reply

    You sure that’s the right chip model? That’s a six core chip. 32 / 6 = 5.3333

    April 25, 2011 at 12:00 am
  • Patrick Casey Reply

    Could it be choking on memory access? With that many cores thrashing the registers at once, you’d expect the instruction caches on the chips to be pretty saturated.

    I’ve yet to run a benchmark myself where memory access time mattered all that much, but you might have found the use case :) .

    April 25, 2011 at 12:00 am
  • George Reply

    Surprising results would of though a Nehalem based processor running at much higher clock speed would have a more substantial lead.

    How would innodb_buffer_pool_instances come into play would it make a difference ? Noticed Intel and AMD servers have different vm.dirty_ratio of 40% and 20%

    April 25, 2011 at 12:00 am
  • Raine Reply

    Nahalem? Did you mean Nehalem?

    :)

    April 25, 2011 at 12:00 am
  • tobi Reply

    “I actually suspect this is rather MySQL problem, and 32 cores may be limit of scaling for MySQL 5.6.” But mysql ist a constant factor to both benchmarks. The CPUs and OS are the only difference(?), Maybe the opteron CPUs have lower bandwidth for coherency traffic? Investigating this could reveal powerful insights.

    April 25, 2011 at 12:00 am
  • Baron Schwartz Reply

    You could bind mysqld to 32 cores on the AMD box too, and see if it scales better after that.

    April 25, 2011 at 12:00 am
  • Baron Schwartz Reply

    You could bind mysqld to 32 cores on the AMD box too, and see if it scales better after that.

    April 25, 2011 at 11:13 am
  • tobi Reply

    “I actually suspect this is rather MySQL problem, and 32 cores may be limit of scaling for MySQL 5.6.” But mysql ist a constant factor to both benchmarks. The CPUs and OS are the only difference(?), Maybe the opteron CPUs have lower bandwidth for coherency traffic? Investigating this could reveal powerful insights.

    April 25, 2011 at 11:33 am
  • Raine Reply

    Nahalem? Did you mean Nehalem?

    🙂

    April 25, 2011 at 11:51 am
  • George Reply

    Surprising results would of though a Nehalem based processor running at much higher clock speed would have a more substantial lead.

    How would innodb_buffer_pool_instances come into play would it make a difference ? Noticed Intel and AMD servers have different vm.dirty_ratio of 40% and 20%

    April 25, 2011 at 12:33 pm
  • Patrick Casey Reply

    Could it be choking on memory access? With that many cores thrashing the registers at once, you’d expect the instruction caches on the chips to be pretty saturated.

    I’ve yet to run a benchmark myself where memory access time mattered all that much, but you might have found the use case :).

    April 25, 2011 at 12:45 pm
  • Jeffrey Gilbert Reply

    You sure that’s the right chip model? That’s a six core chip. 32 / 6 = 5.3333

    April 25, 2011 at 1:11 pm
  • George Reply

    Yeah Xeon X5670 is dual hexa-core = 12 cores + 12 virtual = 24 cores.

    April 25, 2011 at 1:27 pm
  • Vadim Tkachenko Reply

    Yes, my bad, it is 24 cores on Cisco / Intel Nehalem

    April 25, 2011 at 2:10 pm
  • Peter Zaitsev Reply

    Vadim,

    So are you saying the performance is basically the same for single core or is it how graph make it look ?

    If this is the case it is interesting to see Opteron cores are not slower. Overall however we get about same performance from 2 socket Intel as 4 socket Opteron which should be more expensive and take more power… which means 2 socket Intel is a way to go these days.

    April 25, 2011 at 2:45 pm
  • Jeffrey Gilbert Reply

    From wiki:

    http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

    6 cores (12 threads) @ 2.93 GHz
    6 x 4 = 24 (so 4 sockets)

    And the Opterons of course have slower clocked cores but more per socket, so

    12 cores (12 threads) @ 1.9GHz
    4 x 12 = 48 (still 4 sockets)

    That’s my understanding of how this benchmark is being run and why it’s so even between the two. It is interesting to see that the chips with more physical cores cannot end up scaling past the chips with the faster clock, but for energy consumption during average use, it’s probably closer to AMD than Intel.

    April 25, 2011 at 4:27 pm
  • Peter Zaitsev Reply

    Jeffrey,

    Vadim has corrected it is 12 cores 24 threads here. It is 2 sockets Intel vs 4 socket AMD

    April 25, 2011 at 7:34 pm
  • EBob Reply

    You are all confusing the issue with the cores:

    Intel: 12 cores + hyperthreading.
    AMD: 48 cores.

    Forget about “threads” and “virtual cores”, its not a number you should consider, just think 5-10% faster than without hyperthreading

    This seems to prove difinitively that fewer faster cores are better that more slower cores for mysql. I’m shocked that a 48 core system did not obliterate the 12 core. Simply shocked.

    April 25, 2011 at 7:40 pm
  • Jeffrey Gilbert Reply

    Now that I think about it a little while, it’s not THAT surprising that Intel’s faster chips have a leg up on the scalability side under MySQL workloads. The AMD chips are far better suited for true multi-threaded applications (e.g. apache), where there are no lock conditions. What’s happening, from my understanding, is that once contention for transactions becomes a factor of waiting for a process to finish executing, intel will win because it can execute that process faster and release the lock. On AMD, the individual cores can only execute instructions so fast, so their speed bottlenecks will show up in higher tps loads.

    Under apache or something where resource locking isn’t an issue, i think the AMD chips would have a better leg up because they could spawn a process, throw it to a core, and forget about it. Just a guess based on how I understand both softwares to work.

    April 26, 2011 at 6:41 am
  • Patrick Casey Reply

    I dunno, my rule of thumb is that I’d almost always prefer to have my MIPS on a small number of cores rather than on big ones since that’ll work well under various loads. I’m sure there are other workloads where having a huge core count beats having fast cores, but most of the ones I run into tend to be the opposite.

    If you imagine 1 big CPU that does 16 “units of work/time” vs 16 little ones that do 1 “units/time” then they both have the same theoretical throughput, but the single big CPU is much more likely to achieve that in practice. Only case having a family of little CPUs would help out would be if you have a near perfectly parallelized problem.

    I’ve been burned pretty hard with a couple of clients, for example, who bought Sun Niagara chips because the sun salesman convinced them it was the cheapest way to get lots of clock cycles, which was true for a perfectly parallelized compute problem, but grossly untrue for the kind of database workloads the customer had (they were much better off on dual core intels).

    April 26, 2011 at 12:45 pm
  • Denis M Reply

    Vadim,
    I think it’s interesting to measure what portion of performance hyper-threading contributes.

    April 27, 2011 at 6:57 am
  • Laurynas Reply

    Vadim –

    Was the mysqld binary built optimized for each CPU?

    April 27, 2011 at 8:02 am
  • Vadim Tkachenko Reply

    Laurynas,

    Nope, it was generic Linux binary tar.gz from dev.mysql.com

    April 27, 2011 at 8:03 am
  • Laurynas Reply

    Vadim –

    I do not know how exactly the generic binary was built but in the best case it could have used -mtune=generic option (one size fits all, or boths CPUs slightly penalized – perhaps by differing amount). Since this is CPU/memory bound benchmark, it’d be interesting to see if building for each CPU with -march=native –param l1-cache-size=x –param l1-cache-line-size=y –param l2-cache-size=z changes the results. Although I’d expect any differences to come up with smaller numbers of threads only.

    April 27, 2011 at 8:13 am
  • Peter Zaitsev Reply

    Laurynas,

    It might be an interesting test indeed. Though I remember some 6 years ago we played a lot with different compiler settings in attempt to get gains. We could get couple of percent tops at that time which was not worth it. May be architectures have diverged now and compilers improved so numbers are different.

    April 27, 2011 at 9:36 am

Leave a Reply