CPU performance: Why is that one core so heavily utilized?
A deep dive into solving CPU load imbalance


Posted on 2025-01-02

Background

Last year, I was conducting load testing on a system at work and noticed a couple of intriguing behaviours.

I was working with a product that streams MPEG-TS video between processes. This is commonly achieved by using UDP over localhost as it’s what most programs support. But even over localhost there are cases where packets might arrive out of order, specifically when the sending processes migrates between CPU cores.

The simplest way of mitigating this issue is to pin the various threads to specific CPU cores to ensure each thread only ever runs on one core.

During load testing I pinned 200 processes to 20 cores, with 10 threads per core. Despite each CPU performing the same amount of work the load distribution was uneven:

Uneaven CPU load

When I first saw this, two questions immediately came to mind:

  • What is going on with cpu8?
  • What is causing cpu12–cpu19 to have a higher load than cpu0–cpu11?

This post will only address the first question, the second is coming in a later post.

What is going on with cpu8?

Trying to understand what was happening, I turned to mpstat(1) which showed that cpu8 was doing much more %soft than the other CPU cores:

mpstat

%soft refers to software interrupts and is a mechanism used by the kernel to handle I/O such as networking more efficiently.

/proc/interrupts showed about the same thing as mpstat(1) (I have not included a screenshot as it is a very wide table). However it also indicated that all cores were processing interrupts for the network interface, however cpu8 handled many more than the rest.

This led me to investigate the algorithm the network driver uses to distribute packets across CPU cores. ethtool(8) is a tool to query and control network driver and hardware settings. Specifically it can both show and change that algorithm if the network driver supports it. We are looking for “network flow classification”, specifically rx-flow-hash:

$ sudo ethtool --show-nfc eno3 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA

The ethtool(8) manpage shows what the different letters mean:

Letter Description
s Hash on the IP source address of the rx packet.
d Hash on the IP destination address of the rx packet.

As we can see from the output above only the source and destination IP were used for hashing.

Interestingly, this behaviour differed from what I observed on other systems. I’m not sure if it related to the network driver (ixgbe), the operating system (Ubuntu 22.04) or some other aspect of the system. The algorithm also differs from tcp4:

$ sudo ethtool --show-nfc eno3 rx-flow-hash tcp4
TCP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
L4 bytes 0 & 1 [TCP/UDP src port]
L4 bytes 2 & 3 [TCP/UDP dst port]

Since the load test only used one load-generator and one device under test, both with a single IP configured, both the source and destination IP were the same. This caused all packets for the load test to be processed on the same CPU core, that is cpu8.

Solution

Once we understood the problem, the solution was simple. We only need to configure the network driver to also use the source and destination port for hashing.

Because hashing occurs on the raw IP packet before the protocol (TCP, UDP etc) is parsed it is not possible to directly tell it to use the source and destination port.

Lets look at the UDP packet header:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |        Destination Port       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|             Length            |            Checksum           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

And compare that with the table from the ethtool(8) manpage:

Letter Description
s Hash on the IP source address of the rx packet.
d Hash on the IP destination address of the rx packet.
f Hash on bytes 0 and 1 of the Layer 4 header of the rx packet.
n Hash on bytes 2 and 3 of the Layer 4 header of the rx packet.

We need to hash bytes 0-4 of the UDP header, that is we want the fields fn in addition to sd, thus, to ensure the packets is evenly distributed across cores, we run:

sudo ethtool --config-nfc enp15s0 rx-flow-hash udp4 sdfn

Result

The result was a much more balanced load distribution, with a slight increase in load on all other cores:

More even CPU load

The difference between the two images is highlighted by fading between them:

Before After