Background
Last year, I was conducting load testing on a system at work and noticed a couple of intriguing behaviours.
I was working with a product that streams MPEG-TS video between processes. This is commonly achieved by using UDP over localhost as it’s what most programs support. But even over localhost there are cases where packets might arrive out of order, specifically when the sending processes migrates between CPU cores.
The simplest way of mitigating this issue is to pin the various threads to specific CPU cores to ensure each thread only ever runs on one core.
During load testing I pinned 200 processes to 20 cores, with 10 threads per core. Despite each CPU performing the same amount of work the load distribution was uneven:
When I first saw this, two questions immediately came to mind:
- What is going on with cpu8?
- What is causing cpu12–cpu19 to have a higher load than cpu0–cpu11?
This post will only address the first question, the second is coming in a later post.
What is going on with cpu8?
Trying to understand what was happening, I turned to
mpstat(1) which showed that cpu8 was doing much
more %soft
than the other CPU cores:
%soft
refers to software interrupts and is a mechanism used by the kernel to
handle I/O such as networking more efficiently.
/proc/interrupts
showed about the same thing as mpstat(1)
(I have not
included a screenshot as it is a very wide table). However it also indicated
that all cores were processing interrupts for the network interface, however
cpu8 handled many more than the rest.
This led me to investigate the algorithm the network driver uses to distribute
packets across CPU cores.
ethtool(8) is a tool to
query and control network driver and hardware settings. Specifically it can
both show and change that algorithm if the network driver supports it. We are
looking for “network flow classification”, specifically rx-flow-hash
:
$ sudo ethtool --show-nfc eno3 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
The ethtool(8) manpage shows what the different letters mean:
Letter | Description |
---|---|
s | Hash on the IP source address of the rx packet. |
d | Hash on the IP destination address of the rx packet. |
As we can see from the output above only the source and destination IP were used for hashing.
Interestingly, this behaviour differed from what I observed on other systems. I’m not sure if it related to the network driver (ixgbe), the operating system (Ubuntu 22.04) or some other aspect of the system. The algorithm also differs from tcp4:
$ sudo ethtool --show-nfc eno3 rx-flow-hash tcp4
TCP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
L4 bytes 0 & 1 [TCP/UDP src port]
L4 bytes 2 & 3 [TCP/UDP dst port]
Since the load test only used one load-generator and one device under test, both with a single IP configured, both the source and destination IP were the same. This caused all packets for the load test to be processed on the same CPU core, that is cpu8.
Solution
Once we understood the problem, the solution was simple. We only need to configure the network driver to also use the source and destination port for hashing.
Because hashing occurs on the raw IP packet before the protocol (TCP, UDP etc) is parsed it is not possible to directly tell it to use the source and destination port.
Lets look at the UDP packet header:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Length | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
And compare that with the table from the ethtool(8) manpage:
Letter | Description |
---|---|
s | Hash on the IP source address of the rx packet. |
d | Hash on the IP destination address of the rx packet. |
f | Hash on bytes 0 and 1 of the Layer 4 header of the rx packet. |
n | Hash on bytes 2 and 3 of the Layer 4 header of the rx packet. |
We need to hash bytes 0-4 of the UDP header, that is we want the fields fn
in
addition to sd
, thus, to ensure the packets is evenly distributed across
cores, we run:
sudo ethtool --config-nfc enp15s0 rx-flow-hash udp4 sdfn
Result
The result was a much more balanced load distribution, with a slight increase in load on all other cores:
The difference between the two images is highlighted by fading between them: