Xen on NUMA Machines

From Xen
Revision as of 08:17, 21 February 2014 by Dariof (talk | contribs) (NUMA and vCPU pinning)

What is NUMA ?

Non-Uniform Memory Access (NUMA) means that memory accessing times of a program running on a CPU depends on the relative distance between that CPU and that memory. In fact, most of the NUMA systems are built in such a way that processors have their local memory, on which they can operate very fast. On the other hand, getting and storing data from and on remote memory (that is, memory local to some other processor) is quite more complex and slow. NUMA machines are becoming more and more common. Some examples of NUMA boxes/architectures are:

While on Linux (bare metal, so, no Xen), a pretty easy way to check whether your box is a NUMA one, and find out all its characteristics is the following:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 6055 MB
node 0 free: 5968 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 5977 MB
node 1 free: 5873 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

NUMA Performances Impact

NUMA awareness becomes very important as soon as many domains start running memory-intensive workloads on a shared host. In fact, the cost of accessing non node-local memory locations is very high, and the performance degradation is likely to be noticeable. For example, let's look at a memory-intensive benchmark with a bunch of competing VMs running concurrently on an host with 2 NUMA nodes.

The benchmark being run is SpecJBB2005. Host is a 16 pCPUs, 2-NUMA nodes Xeon, with 12GB RAM (2GB of which reserved for Dom0). Linux kernel for dom0 was 3.2, Xen was xen-unstable at the time of the benchmarking (i.e., more or less Xen 4.2). Guests have 4 vCPUs and 1GB of RAM each. Numbers come from running the benchmark on an increasing (1 to 8 ) number of Xen PV-guests at the same time, and repeating each run 5 times for each of the VMs configurations below:

  • default is what was happening by default prior to Xen 4.2, i.e., no vCPU pinning at all;
  • pinned means VM#1 was pinned on NODE#0 after being created. This implies its memory was striped on both the nodes, but it can only run on the fist one;
  • all memory local is the best possible case, i.e., VM#1 was created on NODE#0 and kept there. That implies all its memory accesses are local;
  • all memory remote is the worst possible case, i.e., VM#1 was created on NODE#0 and then moved (by explicitly pinning its pCPUs) on NODE#1. That implies all its memory accesses are remote.

In all the experiments, it is only VM#1 that was pinned/moved. All the other VMs have their memory "striped" between the two nodes and are free to run everywhere. The final score achieved by SpecJBB on VM#1 is reported below. As SpecJBB output is in terms of "business transactions per second (bops)", higher values correspond to better results.

kernbench_avgstd2.png

First of all, notice how small the standard deviation is for all the runs: this confirms SpecJBB is a good benchmark. The most interesting lines to look at and compare are the purple and the blue ones. In fact, that clearly shows how much getting NUMA right matters, even in such a small box! To have an even better idea of this, let's look also at the percent increase in performance of each run with respect to the worst case (all memory remote):

kernbench_avg2.png

In this second graph, we see hat NUMA placement is accountable for a ~10% to 20% (depending on the load) impact on performance. Also, it appears that just pinning the vCPUs to some pCPUs, although it can help in keeping performance consistent, is not able to get performances that close to the best possible situation.

The full set of results, with plots about all the statistical properties of the data set, can be found here).

NUMA and Xen

Xen is already capable of running on NUMA machines. First of all, while on Xen (well, on Dom0), figuring out the NUMA characteristics of the box goes by running xl info -n. Its output, on a 2 NUMA nodes machine, with 8 cores each (pCPUs 0-7 ==> Node #0, pCPUs 8-15 ==> Node #1) is as follows:

# xl info -n
host                   : Zhaman
release                : 3.10.0
version                : #1 SMP Mon Jul 8 23:16:42 UTC 2013
machine                : x86_64
nr_cpus                : 16
max_cpu_id             : 63
nr_nodes               : 2
cores_per_socket       : 4
threads_per_core       : 2
cpu_mhz                : 2394
hw_caps                : bfebfbff:2c100800:00000000:00003f00:029ee3ff:00000000:00000001:00000000
virt_caps              : hvm hvm_directio
total_memory           : 12285
free_memory            : 11604
...
cpu_topology           :
cpu:    core    socket     node
  0:       0        1        0
  1:       0        1        0
  2:       1        1        0
  3:       1        1        0
  4:       9        1        0
  5:       9        1        0
  6:      10        1        0
  7:      10        1        0
  8:       0        0        1
  9:       0        0        1
 10:       1        0        1
 11:       1        0        1
 12:       9        0        1
 13:       9        0        1
 14:      10        0        1
 15:      10        0        1
numa_info              :
node:    memsize    memfree    distances
   0:      6144       5754      10,20
   1:      6720       5850      20,10
...

Xen deals with NUMA by assigning each domain a node affinity, which is the set of nodes of the host from which memory for that domain is allocated (in equal parts). This node affinity is liked to the subset of physical CPUs in the host the domain can or prefers to run. Acting like this, in fact, maximizes the probability of local memory accesses.

NUMA and vCPU pinning

At domain creation time, Xen constructs a domain's node affinity basing on what nodes the pCPUs to which the domain's vCPUs are pinned belong to. If no vCPU to pCPU pinning is specified at creation time (e.g., via the xl config file), what happens is described in the following section. To specify a vCPU pinning during domain creation, one should use the cpus="..." config option. For example, the excerpt below would create a 4 vCPU guest, with 1024 Mb of RAM, and all the 4 vCPUs will be pinned to pCPUs 0-3.

...
vcpus =  '4'
memory =  '1024'
cpus = "0-3"
...

From a NUMA perspective, if pCPUs 0 to 3 belong all to the same NUMA node (say NUMA node 0), that means the node affinity of the domain will be set to node 0, and all its memory will be allocated on there.

NUMA and cpupools

Cpupools can also come handy in a NUMA system, especially if very large. In fact, they guarantee an even more strict partitioning and isolation than vCPU pinning. When a domain is created inside a cpupool, its node affinity is set to only the node(s) to which the pCPUs in the pool belong to. Have a look at how cpupool-numa-split works.

Automatic NUMA Placement

What happens by default, when a new domain is started, with respect to NUMA, varies (unfortunately) with the actual Xen version.XXX If no cpus="..." option is specified in the config file, Xen tries to figure out on its own on which node(s) the domain could fit best. It is worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an host is an incarnation of the Bin Packing Problem. As such problem is known to be NP-hard, we will be using some heuristics.

So, among the node (or set of nodes) that have enough free memory and enough physical CPUs to accommodate the domain, the one with the smallest number of vCPUs already running there is chosen. The domain is the placed there, just by pinning all its vCPUs to all the pCPUs belonging to the node itself.

In some more details, if using libxl (and, of course, xl), and Xen >= 4.2, the NUMA automatic placement works as follows. First of all the nodes (or the sets of nodes) that have enough free memory and enough pCPUs are found. The idea is to find a spot for the domain with at least as much free memory as it has configured to have, and as much pCPUs as it has vCPUs. After that, the actual decision on which node to pick happens accordingly to the following heuristics:

  • in case more than one node is necessary, solutions involving fewer nodes are considered better. In case two (or more) suitable solutions span the same number of nodes,
  • the node(s) with a smaller number of vCPUs runnable on them (due to previous placement and/or plain vCPU pinning) are considered better. In case the same number of vCPUs can run on two (or more) node(s),
  • the node(s) with with the greatest amount of free memory is considered to be the best one.

As of now (Xen 4.2) there is no way to interact with the placement mechanism, for example for modifying the heuristic it is driven by. Of course, if vCPU pinning or cpupools are manually setup (look at the Tuning page), no automatic placement will happen, and the user's requests are honored.

Automatic NUMA Placement and Migration

For xl, VM migration a very special form of VM creation (if looking at the migration target). That's the reason why, when the target VM gets created on the target host, automatic placement may take place, exactly as it happens with new VMs. So, unless one override the relevant options in the config file during migration itself, the target VM will be placed on one (some) node(s) of the target host, and it's node-affinity will be set accordingly.

However, there is right now no guarantee for the final decision from the placing algorithm on the target machine to be "compatible" with the one made on the source machine at initial VM creation time. For instance, if the VM fits in just one node and is placed there on the source host, it can well end up being split on two or more nodes when migrated on the target host (and, of course, the vice-versa).

Currently this is considered acceptable, especially as there is not way to expose any virtual NUMA topology to the VM itself. As soon as some support for that will be introduced, this behavior might change accordingly.

Querying Memory Distribution

Up to Xen 4.4, there is no easy way to figure out how much memory from each domain has been allocated on each NUMA node in the host. A workaround to that, is exploiting one of the Xen debug keys, by issueing the following commands:

# xl debug-keys u
# xl dmesg | tail -n40

A possible output is as follows:

(XEN) 'u' pressed -> dumping numa info (now-0x3F5:9821EF14)
(XEN) idx0 -> NODE0 start->1720320 size->1572864 free->1473152
(XEN) phys_to_nid(00000001a4001000) -> 0 should be 0
(XEN) idx1 -> NODE1 start->0 size->1720320 free->1235505
(XEN) phys_to_nid(0000000000001000) -> 1 should be 1
(XEN) CPU0 -> NODE0
(XEN) CPU1 -> NODE0
(XEN) CPU2 -> NODE0
(XEN) CPU3 -> NODE0
(XEN) CPU4 -> NODE0
(XEN) CPU5 -> NODE0
(XEN) CPU6 -> NODE0
(XEN) CPU7 -> NODE0
(XEN) CPU8 -> NODE1
(XEN) CPU9 -> NODE1
(XEN) CPU10 -> NODE1
(XEN) CPU11 -> NODE1
(XEN) CPU12 -> NODE1
(XEN) CPU13 -> NODE1
(XEN) CPU14 -> NODE1
(XEN) CPU15 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 130998):
(XEN)     Node 0: 62804
(XEN)     Node 1: 68194
(XEN) Domain 1 (total: 262144):
(XEN)     Node 0: 0
(XEN)     Node 1: 262144

In this case, Domain 0 has 62804 pages allocated on NUMA node 0, and 68194 on node 1. On the other hand, Domain 1 has all its 262144 pages allocated out of node 1. A more straightforward mechanism for querying on what node(s) domains' memory is allocated is under development, and will (probably) be available in Xen 4.5.

Future Developments

Preliminary patches introducing support for automatic placement in xl and NUMA-aware scheduling were posted to the xen-devel mailing list a while back. The results of some (preliminary as well) benchmarks have been discussed in this and this Blog posts (as well as in this Wiki article).

Xen 4.2 contains a slightly modified version of that placement algorithm (described above), more specifically, the one implemented by this patch series. Keeping improving it, as well as continuing adding NUMA related features will happen throughout the Xen 4.3 development cycle (so keep yourself posted!).

This Wiki also hosts a sort of Xen NUMA Roadmap or, better, the list of items that will be getting some Xen developers' attentions in the upcoming months... Feel free to contribute if thinking something could be missing or wrong.