Hyper-threading -How does it double CPU throughput?

Hyper-threadingThe other day a customer asked me to do capacity planning for their web server farm. I was looking at the CPU graph for one of the web servers that had Hyper-threading switched ON and thought to myself: “This must be quite a misleading graph – it shows 30% CPU usage. It can’t really be that this server can handle 3 times more work?”

Or can it?

I decided to do what we usually do in such case – I decided to test it and find out the truth. Turns out – there’s more to it than meets the eye.

How Intel Hyper-Threading works

Before we get to my benchmark results, let’s talk a little bit about hyper-threading. According to Intel, Intel® Hyper-Threading Technology (Intel® HT Technology) uses processor resources more efficiently, enabling multiple threads to run on each core. As a performance feature, Intel HT Technology also increases processor throughput, improving overall performance on threaded software.

Sounds almost like magic, but in reality (and correct me if I’m wrong), what HT does essentially is – by presenting one CPU core as two CPUs (threads rather), it allows you to offload task scheduling from kernel to CPU.

So for example, if you just had one physical CPU core and two tasks with the same priority running in parallel, the kernel would have to constantly switch the context so that both tasks get a fair amount of CPU time. If, however, you have the CPU presented to you as two CPUs, the kernel can give each task a CPU and take a vacation.

On the hardware level, it will still be one CPU doing the same amount of work, but there may be some optimization to how that work is going to be executed.

My hypothesis

Here’s the problem that was driving me nuts: if HT does NOT actually give you twice more power and yet the system represents statistics for each CPU thread separately, then at 50% CPU utilization (as per mpstat on Linux), the CPU should be maxed out.

So if I tried to model the scalability of that webserver – a 12-core system with HT enabled (represented as 24 CPUs on a system), assuming perfect linear scalability, here’s how it should look:

In the example above, a single CPU thread could process the request in 1.2s, which is why you see it max out at 9-10 requests/sec (12/1.2).

From the user perspective, this limitation would hit VERY unexpectedly, as one would expect 50% utilization to be… well, exactly that – 50% utilization.

In fact, the CPU utilization graph would look even more frustrating. For example, if I were increasing the number of parallel requests linearly from 1 to 24, here’s how that relationship should look:

Hence CPU utilization would skyrocket right at 12 cores from 50% to 100% because in fact, the system CPU would be 100% utilized at this point.

What happens in reality

Naturally, I decided to run a benchmark and see if my assumptions are correct. The benchmark was pretty basic – I wrote a CPU-intensive php script, that took 1.2s to execute on the CPU I was testing this and bashed it over http (apache) with ab at increasing concurrency. Here’s the result:

Requests per secondRaw data can be found here.

If this does not blow your mind, please go over the facts again and then back at the graph.

Still not sure why do I find this interesting? Let me explain. If you look carefully, initially – at a concurrency of 1 through 8 – it scales perfectly. So if you only had data for threads 1-8 (and you knew processes don’t incur coherency delays due to shared data structures), you’d probably predict that it will scale linearly until it reaches ~10 requests/sec at 12 cores, at which point adding more parallel requests would not have any benefits as the CPU would be saturated.

What happens in reality, though, is that past 8 parallel threads (hence, past 33% virtual CPU utilization), execution time starts to increase and maximum performance is only achieved at 24-32 concurrent requests. It looks like at the 33% mark there’s some kind of “throttling” happening.

In other words, to avoid a sharp performance hit past 50% CPU utilization, at 33% virtual thread utilization (i.e. 66% actual CPU utilization), the system gives the illusion of a performance limit – execution slows down so that the system only reaches the saturation point at 24 threads (visually, at 100% CPU utilization).

Naturally then the question is – does it still make sense to run hyper-threading on a multi-core system? I see at least two drawbacks:

1. You don’t see the real picture of how utilized your system really is – if the CPU graph shows 30% utilization, your system may well be 60% utilized already.
2. Past 60% physical utilization, the execution speed of your requests will be throttled intentionally in order to provide higher system throughput.

So if you are optimizing for higher throughput – that may be fine. But if you are optimizing for response time, then you may consider running with HT turned off.

Did I miss something?

Share this post

Comments (22)

  • Nils

    It may also be interesting to test it out with and without the SMT aware scheduler (SCHED_SMT kernel option). Also of note is that the new Power8 CPUs offer even more Threads per CPU.

    January 15, 2015 at 4:21 am
  • pplu

    I always understood that HT, more than offloading the kernel of scheduling, lies to the Kernel, presenting more CPUs than it really has. The effect that is that it tricks the kernel into scheduling more than one process at the same time. The HT is able to use of different zones of the CPU in parallel (since CPUs have different zones that are not necessarily needed entirely for executing one instruction). That increments the chance that one thread wants to utilize a part of the CPU that the other thread isn’t needing, being able to parallelize some operations, but if by chance, the two CPUs need the same part of the CPU, one has to wait.

    I think that this would explain why until more or less 50% capacity (100% of “one cpu”) you have linear scaling, and after that, you have non-linearity because of the “chance” of the load being able to execute in parallel or not. Could you retry the test without hyperthreading? I think the result would render that at 4 to 6 requests per second you would max out, instead of 9.

    January 15, 2015 at 5:39 am
  • Uli Stärk

    you missed that the cpu is overclocking itself depending on the number of cores active (and other stuff). And you do not understand how a cpu disassembles an instruction and disparches it to uta execution units. Thats where hyperthreading kicks in and is able zo utilize the available ressouces better. Depending on the workload, hyperthreading achieves about 10-15% performance increase, where the cpu would have waited.

    January 15, 2015 at 9:05 am
  • Aurimas Mikalauskas

    Pplu, Uli – I really appreciate your feedback guys. This is exactly the kind of feedback I’m looking for with this post that will then help me understand the problem better and maybe even redo some of the benchmarks incorporating your input.

    By the way, this is not meant to be a rant against HT as I haven’t been benchmarking performance with and without it. (although I am glad it sparkled some hot reactions). My point is, when looking at the graphs plotting CPU utilisation, one should always have in mind whether the CPU has HT enabled or not.

    Thanks,
    Aurimas

    January 15, 2015 at 9:27 am
  • gggeek

    I think you might be missing a big chunk of the picture if you are not measuring response time (any of average, media, 90 percentile, all of the above).

    The hint is: what you are seeing is that the average number of served request per seconds increases just-a-bit, while the incoming requests-per-second are increasing linearly.

    And the answer is… going above 8 concurrent requests, your response times are most likely increasing exponentially.
    Which would make for very unhappy customers, and poor websites.

    January 15, 2015 at 9:53 am
  • Aurimas Mikalauskas

    Gggeek – that’s actually exactly what my biggest concern was. Unfortunately I don’t have this data now as I have done the benchmark itself quite a while ago, but now I feel obliged to do some benchmarking with and without HT just to see how different is the behaviour in these cases, considering the feedback from other fellows. So I will make sure to include the response time graphs as well.

    January 15, 2015 at 10:08 am
  • Peter Zaitsev

    Aurimas,

    CPU architectures these days are very complicated so I think it is very hard to do assumptions along the way of – I’m showing 40% CPU usage how much more capacity do I have. As guys mention response time distribution which starts to suffer on high utilization is one thing but there is also another – if you measure CPU time which execution of your PHP script took it is likely to change with concurrency for many reasons some of them in addition to Hypperthreading are

    – Turboboost. When you have only 1 or 2 cores operating it is quite likely to run higher than nominal clock speed which means performance gain from all cores loaded is less even in best case scenario

    – Cache. Cache misses can grow with doing a lot of parallel execution which can drastically impact things

    – Synchronization – This is very broad but depending on the code there can be a lot of synchronization needs between Cores even if you are not using explicit mutexes in the code which limit scalability.

    – Memory Bus, PCI-E etc. This can be the bottleneck causing stalls and limiting scanability.

    Now if you’re using advanced profiling tools you should be able to see those various events and understand why CPUs are stalling and it is great for doing work on low level code optimization.

    From the ops side though I think it just best to assume we do not know very well how system will scale and be very cautious with guesses. If we’re to place exact bets it is much better to do some benchmarks to tell OK this system can handle 2.5x more load than it is currently handles while still keeping good response time.

    January 15, 2015 at 10:36 am
  • Uli Stärk

    Thats true. But regarding HT: it is usually not bad for web-based load (lamp stack) or virtualisation.

    The key for additional performance is the os process scheduler. It must not use the virtual cores, until there are too many running processes. and as peter mentioned, at high load the cache management is important. Take a look at the network stack and how they realised that at 10gbit network traffic, even interrupts (that are bound to cores) do matter!

    January 15, 2015 at 12:14 pm
  • Karel Čížek

    The deal with HyperThreading and SMT is that it make sure there are enough independet instructions in ready to be executed out-of-order.
    Modern CPUs are able to execute lot of thing in parallel – not only by simultameous threads but they are also able to execute multiple instruction from single thread in parallel (so called instruction level paralellism or ILP). It’s made possible by out-of-order (OOO) execution mode when CPU is fetching, and decoding multiple instruction every cycle. Those instructions are then put to reorder buffer, which is basically queue of operations waiting to be exucuted. Out-of-order engine then schedule those instructions to be run on particular ports (latest intels have 8 of those) and tries to pick as many instructions to be run at the same time. If instructions are independent, they can be execute in parallel which is good when those instructions are loads that can stall CPU for up to few hundreds of cycles till data arives from memory. If one instruction depend on result of another, it has to wait in queue. Modern CPUs go to extreme lengths to make sure there are at all times some instructions ready in queue, they are prefetching data, predicting branches, speculating.

    But there are limits how much ILP is there in single threaded code. Modern x86 are able to fetch, decode and execute up to 4 instructions every single cycle. But in most cases there are not so much ILP available in general code. So HyperThreading come to play. It adds one extra frontend to your CPU (bit that fetches and decodes instructions) that supply extra instruction stream into out-of-order machinery that is hopefully able to find and dispatch more independent instruction to run on available harware.

    Main problem with HT is that two threads are competing for same amount of cache. And if you have code that run at 4IPC (like typical math heavy code) HT do nothing.

    Suprisingly your results could be caused by cache effects. As you run more threads in parallel, every one of them have smaller portion of cache and that might cause slowdowns. Run code through perf and you will see.

    January 15, 2015 at 3:25 pm
  • Twirrim

    As others have done I’d encourage you to look at the cache stats. Back when HT first came out people started doing significant benchmarking and real-world experimenting with it and found decidedly mixed results. In some cases it resulted in a nice performance boost, in others it could be crippling, actually proving to be slower than not having it on at all.

    One of the ‘problems’ with HT is that the virtual core (for want of a better word) shares the exact same cache as the primary core. If those cores are working on completely different tasks that can result in the cache being stomped all over and repeatedly purged which can be highly inefficient.

    January 15, 2015 at 4:11 pm
  • Peter (Stig) Edwards

    turbostat from the cpupowerutils package can be run as root to report processor frequency, idle power-state statistics. Running:

    seq 1 24 | xargs -t -I {} turbostat ./sysbench –test=cpu –num-threads={} run | grep ‘total time:’

    will show the turbostat reports for the sysbench CPU test for 1..24 threads, below I have the reports for 1,6,12 and 24 threads.
    Notice how the GHz (average clock rate while the CPU was in c0 state) for the busy CPUs changes, and the %time shifts from c6% ( idle state ) to %c0