NUMA topology with Edera
This guide provides an overview of how Edera manages CPU-Memory topology. As servers got larger and larger, maintaining fast connections to a server’s distant parts became more complicated. Modern machines simplify by using an interconnect like HyperTransport or Intel’s UltraPath Interconnect. These introduce the concepts of locality and non-uniform memory access: an individual CPU core might have very fast access to its own ’local’ memory. However, if it wants to talk to the totality of system memory, it may have to make a transaction across the interconnect. These transactions take orders of magnitude more time than a local request and your performance can suffer greatly if you are crossing the interconnect needlessly. Therefore the quest for memory performance becomes a challenge of ensuring that CPU cores stay close to the memory they’re operating on.
Terminology and Putting It Together
A server is typically two or more physical packages which are each plugged into a socket. Each package contains a number of CPU cores, and some amount of cache memory. Each of the CPU cores is connected to an interconnect allowing them to share among themselves within the package. This internal interconnect also has direct access to some amount of RAM. Then a further interconnect connects each socket to each other.
A NUMA Node describes a set of memory and cores that are local to each other. At the very least, in a two-socket machine, each package in its socket would usually be its own NUMA node. In this case migrating data between two cores in the same package and socket would be cheap and fast. But migrating data between two cores across different sockets would require a trip over the interconnect and would take more time to complete.
The count of cores per package has been increasing, but the number of sockets generally stayed the same. CPU packages started including their own internal interconnects, each splitting into another pair of NUMA nodes. Today, a typical two-socket server might have 128 cores total across two packages. Each package would have 64 cores broken up into two NUMA nodes of 32 cores each.
If this server had 1 TB of memory, we would be expected to break it up first by socket, with each socket getting 512 GB local. Then that would be further subdivided in the package into two pools of 256 GB each. Of course, any core can talk to any memory, but the cost depends on the locality. Memory that is ’local’ to your NUMA node is fast. If you have to cross from your local NUMA node to the NUMA node in the same package, you will experience around 120-130% latency. If you have to both cross out of the local node, then also cross sockets, then finally back into a remote NUMA node it will be interminably long, possibly more than 300% latency. Further complicating is that you may have to wait additional time if the interconnect is saturated.
How to Discover Your Server’s Topology
The numactl -H tool will display the number of NUMA nodes available, which CPUs and RAM are attached to each node, and the relative distance of the nodes from each other. For instance, this output from a simpler machine shows a total of two nodes, with ten cores on each node. See the examples below for how to read this output.
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 46963 MB
node 0 free: 22655 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 48360 MB
node 1 free: 37787 MB
node distances:
node 0 1
0: 10 21
1: 21 10What this means for Edera
edera-check will report the PVH capabilities of the host, and the dev.edera/virt-backend: "pvh" annotation can be used to force it on.At its most simple, you tell Edera how many cores you wish to use via the dev.edera/cpu annotation. Edera will use a compact strategy that tries to utilize the fewest number of NUMA nodes needed to cover the requested number of CPU cores. If it needs to use more than one NUMA node to cover the request, it will attempt to split those cores evenly across the nodes. In other words, Edera tries to keep as many CPU cores as possible within the lowest number of NUMA nodes, but spread them evenly across nodes.
If you request fewer cores than the size of your NUMA nodes
All your cores will be local to each other, and they will be local to their memory. All your nodes will share that one interconnect.
If you request more cores than the size of your NUMA nodes
Edera will pick the minimum number of NUMA nodes sufficient to cover your request, then it will spread your cores evenly across the nodes.
PCI adds a special case
If you are using PCI passthrough to access special hardware, like GPU and ethernet cards, know that Edera will always try to keep that hardware close to the memory and CPU that you are using.
Memory size does not change the math
Edera picks the node count from the vCPU count, then requests the zone’s memory node-local on each chosen node. If the memory does not fit node-local, the host places the remainder on other nodes; the zone still receives all the memory it asked for, but accesses to the spilled portion run at remote distance.
All the knobs
While we think our defaults are the best for most cases, you might have a special case and need to adjust how NUMA is handled. Here is the full list of flags and annotations that can impact placement and what they default to.
| Knob | What it controls | protect flag | Kubernetes annotation | Default |
|---|---|---|---|---|
| Virtualization backend | PV vs PVH; NUMA awareness requires PVH | --virt-backend | dev.edera/virt-backend | auto |
| vCPU count | how many vCPUs, which drives how many nodes the zone spans | --target-cpus--max-cpus | dev.edera/cpu | 2 |
| Memory size | the zone’s memory, and whether it fits node-local | --target-memory--max-memory | pod limits.memorydev.edera/initial-memory-request | pod limit |
| NUMA node count | force an exact number of vNUMA nodes | --numa-nodes | dev.edera/numa-nodes | auto |
| NUMA expansion strategy | how extra nodes are chosen: nearest (compact) or furthest (scatter) | --numa-strategy | dev.edera/numa-strategy | compact |
| Memory ballooning policy | whether ballooned memory stays on its nodes (static) or floats (dynamic) | --resource-adjustment-policy | dev.edera/resource-policy | dynamic |
Examples
Our example machine is a complicated device with 8 nodes, each with 16 cores.
$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 10611 MB
node 0 free: 8902 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 10917 MB
node 1 free: 8989 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 2 size: 10923 MB
node 2 free: 5145 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 3 size: 10923 MB
node 3 free: 7677 MB
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 4 size: 10923 MB
node 4 free: 6723 MB
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 5 size: 10908 MB
node 5 free: 8515 MB
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 6 size: 10923 MB
node 6 free: 8801 MB
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 7 size: 10923 MB
node 7 free: 7729 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 12 12 32 32 32 32
1: 12 10 12 12 32 32 32 32
2: 12 12 10 12 32 32 32 32
3: 12 12 12 10 32 32 32 32
4: 32 32 32 32 10 12 12 12
5: 32 32 32 32 12 10 12 12
6: 32 32 32 32 12 12 10 12
7: 32 32 32 32 12 12 12 10The node distances table shows how the distances vary between different nodes depending on topology. This unitless ‘distance’ number roughly correlates with the latency cost of crossing between nodes. Consider CPUs 9 and 14 which are both in node 0: crossing from CPU 9 to CPU 14 takes place within NUMA node 0 and would cost the minimum 10 units.
Each node has three neighboring nodes that are 12 distance units away. These are cores in a separate NUMA node, but still within the same socket. Crossing from CPU 9 to CPU 40 would cost 12 to cross from node 0 to node 2.
The remaining four nodes show the worst case of crossing between sockets. Crossing from CPU 9 to CPU 102 has a cost of 32 units to cross from node 0 to node 6.
Here’s the layouts that would appear on this machine depending on the number of cores:
| Request | Selected nodes | vCPU split |
|---|---|---|
| 8 cores | 1 node | 8 |
| 16 cores | 1 node | 16 |
| 24 cores | 2 nodes (same socket) | 12 / 12 |
| 32 cores | 2 nodes (same socket) | 16 / 16 |
| 40 cores | 3 nodes (same socket) | 13 / 13 / 14 |
| 48 cores | 3 nodes (same socket) | 16 / 16 / 16 |
| 64 cores | 4 nodes (same socket) | 16 / 16 / 16 / 16 |
| 80 cores | 5 nodes | 16 / 16 / 16 / 16 / 16 |
| 128 cores | 8 nodes | 16 / 16 / 16 / 16 / 16 / 16 / 16 / 16 |