This post follows up on CPU Performance: Why is that one core so heavily utilized? where we explored why one core was much more heavily utilized than others, even with an evenly distributed workload.
In this post we will look at the other strange behaviour I saw during that load testing work.
Background
If you have already read the the previous post you can skip the background section as it contains the same information.
Last year, I was conducting load testing on a system at work and noticed a couple of intriguing behaviours.
I was working with a product that streams MPEG-TS video between processes. This is commonly achieved by using UDP over localhost as its what most programs support. But even over localhost there are cases where packets might arrive out of order, specifically when the sending processes migrates between CPU cores.
The simplest way of mitigating this issue is to pin the various threads to specific CPU cores to ensure each thread only ever runs on one core.
During load testing I pinned 200 processes to 20 cores, with 10 threads per core. Despite each CPU performing the same amount of work the load distribution was uneven:

When I first saw this, two questions immediately came to mind:
- What is going on with cpu8?
- What is causing cpu12–cpu19 to have a higher load than cpu0–cpu11?
This post will only address the last question, check out the previous post for a deep dive into the first.
After solving the first issue the CPU load looked like this:

What is causing cpu12-cpu19 to have a higher load than cpu0-cpu11?
In three words; Heterogeneous CPU topology.
Many consumer Intel CPUs use something they call P-cores and E-cores, for Performance cores and Efficiency cores. Arm’s implementation is know as big.LITTLE.
The system I tested used an Intel Core i5-13500E, which has 6 P-cores and 8 E-cores. The P-cores use Intel’s Hyper-threading, so each appears as two logical processors to the operating system.
Thus the first 12 cores in the screenshot are the P-cores and the last 8 are the E-cores. Evidently the E-cores are less performant than the P-cores and thus the same work results in higher utilization.
Why is that a problem?
For most consumer devices — like desktops, laptops and phones — this isn’t a problem. The system uses the power-efficient E-cores for lighter tasks and the high-performance P-cores for heavier work, providing both efficiency and speed.
However in my case I’d like to utilize the computer as much as possible to get the most value out of it. Thus if I naively assign tasks to cores as in the screenshot above I can only put as many tasks on the computer as the least performant cores allows, leaving resources on the table on the more performant cores.
The solution
There are many different CPU topologies, designs and architectures. There is no simple way of taking it all into account. Especially since it also heavily depends on the workload.
There is a program called lstopo(1) that I really like. It visualizes your system’s CPU topology, showing how cores and threads are distributed across the system. It’s a simple but powerful tool and I highly recommend you try it out yourself.
lstopo(1) has this excellent page called “The Best of
lstopo” which very well
illustrates how many different CPU topologies and architectures there exists
out there.
My solution was to simply use the maximum CPU clock frequency for each CPU. I
read the maximum frequency from
/sys/devices/system/cpu/cpu[0-9]*/cpufreq/cpuinfo_max_freq and used the ratio
between the core frequencies to balance task distribution.
Result
The result is a more evenly distributed load across the CPU cores. The system is still running 200 processes, but they are now better balanced across the cores:
 
   
As a result, on a fully loaded system, CPU utilization can be increased by about 20% (from 240 to 288 processes).