Difference between revisions of "Xen 4.2 Automatic NUMA Placement"

From Xen
m (moved Xen Numa Scheduling and Placement to Automatic NUMA Placement: The scheduling related bits are being moved to a new page)
m
 
(18 intermediate revisions by one other user not shown)
Line 1: Line 1:
  +
__TOC__
= NUMA Placement =
 
   
First of all, just notice that optimally fitting VMs on a NUMA host is an incarnation of the [http://en.wikipedia.org/wiki/Bin_packing_problem Bin Packing Problem], which means it is [http://en.wikipedia.org/wiki/NP-hard NP-hard], so an heuristic approach must be undertaken.
+
With ''NUMA placement'', we refer to the decision on out of which NUMA nodes in an host the memory for a newly created VM should be allocated. Unfortunately, fitting VMs on a NUMA host is an incarnation of the [http://en.wikipedia.org/wiki/Bin_packing_problem Bin Packing Problem], which means it is [http://en.wikipedia.org/wiki/NP-hard NP-hard], so heuristics is the only reasonable way to go.
   
  +
This document provides information about exploratory work on NUMA placement, as well as a description of what was included in <em>Xen 4.2</em>. You can find other articles about NUMA in the [[:Category:NUMA|NUMA category]].
Concentrating on memory, what if we have a bunch of nodes, each with its own amount of free memory, and we need to decide where a new VM should be placed? [http://lists.xen.org/archives/html/xen-devel/2012-04/msg00732.html This] RFC patch series includes two patches ([http://lists.xen.org/archives/html/xen-devel/2012-04/msg00741.html 8/10] and [http://lists.xen.org/archives/html/xen-devel/2012-04/msg00740.html 9/10]) implementing three possible NUMA placement policies:
 
   
  +
= Preliminary/Exploratory Work =
* ''greedy'', which scans the host’s node and put the VM on the first one that is found to have enough free memory;
 
* ''packed'', which puts the VM on the node that has the smallest amount of free memory (although still enough for the VM to fit there);
 
* ''spread'', which puts the VM on the node that has the biggest amount of free memory.
 
   
  +
The first attempt of moving NUMA placement away of [[XEND]]'s python code (to, in that case, libxc) dates back to 2010 (can't find the link anymore).
  +
  +
More recently, this [http://lists.xen.org/archives/html/xen-devel/2012-04/msg00740.html patch series] was released as an RFC on April 2012. In patches [http://lists.xen.org/archives/html/xen-devel/2012-04/msg00749.html 8/10] and [http://lists.xen.org/archives/html/xen-devel/2012-04/msg00748.html 9/10], it implemented three possible ''placement policies'':
  +
* ''greedy'', which scans the host’s node and put the new VM on the first one that is found to have enough free memory;
  +
* ''packed'', which puts the new VM on the node that has the smallest amount of free memory (although still enough for the VM to fit there);
  +
* ''spread'', which puts the new VM on the node that has the biggest amount of free memory.
 
The names comes from the intrinsic characteristics of the three algorithms. In fact, greedy just grabs the first suitable node, packed tends to keep nodes as full as possible while spread tries to keep them as free/empty as possible. Notice that keeping the nodes full or empty should be intended memory-wise here, but that it also imply the following:
 
The names comes from the intrinsic characteristics of the three algorithms. In fact, greedy just grabs the first suitable node, packed tends to keep nodes as full as possible while spread tries to keep them as free/empty as possible. Notice that keeping the nodes full or empty should be intended memory-wise here, but that it also imply the following:
 
 
* greedy and packed policies are both incline to put as much VMs as possible on one node before moving on to try others (which one does that most, depends on the VMs' characteristics and creation order);
 
* greedy and packed policies are both incline to put as much VMs as possible on one node before moving on to try others (which one does that most, depends on the VMs' characteristics and creation order);
 
* spread is incline to distribute the VMs across the various nodes (although again, it will depend somehow on VMs' characteristics).
 
* spread is incline to distribute the VMs across the various nodes (although again, it will depend somehow on VMs' characteristics).
Line 16: Line 19:
 
"Scientifically" speaking, greedy is based on the [http://www.youtube.com/watch?v=4QBdzVD3AbE First Fit ] algorithm, packed is based on [http://www.youtube.com/watch?v=QSXB693Hrls Best Fit] and spread on [http://www.youtube.com/watch?v=OwUIiSf0c-U Worst Fit].
 
"Scientifically" speaking, greedy is based on the [http://www.youtube.com/watch?v=4QBdzVD3AbE First Fit ] algorithm, packed is based on [http://www.youtube.com/watch?v=QSXB693Hrls Best Fit] and spread on [http://www.youtube.com/watch?v=OwUIiSf0c-U Worst Fit].
   
  +
Some benchmarks, of all the three policies, were performed (as explained in details [http://lists.xen.org/archives/html/xen-devel/2012-04/msg01060.html here]). One of the obtained graphs is reported below. This shows the aggregate result of the SpecJBB2005 benchmark, concurrently run inside multiple VMs
For benchmarking the placement heuristics (full results available [http://xenbits.xen.org/people/dariof/benchmarks/specjbb2005-numa_aggr/ here]), on the other hand, all the VMs have been created asking xl to use either the greedy, packed or spread policy. The aggregate throughput is then computed by summing the throughput achieved in all the VMs (and averaging the result).
 
   
 
http://xenbits.xen.org/people/dariof/images/blog/NUMA_2/kernbench_avgstd2.png
 
http://xenbits.xen.org/people/dariof/images/blog/NUMA_2/kernbench_avgstd2.png
   
  +
That served as a quite effective confirmation that the spread (i.e., the one based on the worst fit algorithm) policy was the absolute best, and thus, in the continuation of the work on automatic placement, the othe twos could be neglected.
It is easy to see that (in this case) both spread and packed placement works better than unpinned. That should have been expected, especially for spread, as it manages in distributing the load a bit better. Given the specific characteristics of this experiment, it should also have also been expected for greedy to behave very bad when load is small. In fact, with less than 5 VMs, all of them fit on the first NUMA node. This means memory accesses are local, but the load distribution is clearly sub-optimal. Also, the benefit we get from NUMA affinity seems to get smaller with the increase of the VM count. This probably have some relationship with the CPU overcommitment (starting at 5 VM).
 
 
 
= NUMA Scheduling =
 
 
Suppose we have a VM with all its memory allocated on NODE#0 and NODE#2 of our NUMA host. Of course, the best thing to do would be to pin the VM’s vCPUs on the pCPUs related to the two nodes. However, pinning is quite unflexible: what if those pCPUs get very busy while there are completely idle pCPUs on other nodes? It will depend on the workload, but it is not hard to imagine that having some chance to run –even if on a remote node– would be better than not running at all. It would therefore be preferable to give the scheduler some hints about where a VM’s vCPUs should be executed. It then can try at its best to honor these requests of ours, but not at the cost of subverting its own algorithm. From now on, we’ll call this hinting mechanism node affinity (don’t confuse it with CPU affinity, which is about to static CPU pinning).
 
 
As said, the experimental [http://lists.xen.org/archives/html/xen-devel/2012-04/msg00732.html patchset] introduces also the support for node affinity aware scheduling by means of this changeset: [sched_credit: Let the [http://lists.xen.org/archives/html/xen-devel/2012-04/msg00739.html scheduler know about `node affinity`]. As of now, it is all very simple, and it only happens for the '''credit1''' scheduling plugin of the Xen hypervisor. However, looking at some early performance measurements seems promising.
 
   
  +
There was also a blog post about this, and it is still available [http://blog.xen.org/index.php/2012/04/26/numa-and-xen-part-1-introduction/ here].
Looking at the results, attempts to suggest the scheduler the preferred node for a VM seem to be the righ direction to go (For the interested, complete results set is [http://xenbits.xen.org/people/dariof/benchmarks/specjbb2005-numa/ here]). The various curves in the graph below represents the throughput achieved on one of the VMs, more specifically the one that is being, respectively:
 
   
  +
= The Actual Solution in Xen 4.2 =
* scheduled without any pinning or affinity, i.e., (''cpus="all"'' in the VM config file, red line, also called default in this article),
 
* created and pinned on NODE#0, so that all its memory accesses are local (green line),
 
* scheduled with node affinity only to NODE#0 (no pinning) as per what is introduced by the patch (blue line).
 
   
  +
Keeping on experimenting and benchmarking (full history available [[Xen_NUMA_Roadmap#Automatic_VM_placement|here]]), it was proved that proper VM placement can improve performances significantly.
http://xenbits.xen.org/people/dariof/images/blog/NUMA_2/kernbench_avg2.png
 
  +
As in [[Xen_on_NUMA_Machines#NUMA_Performance_Impact|here]], the curves, in the graph below, have the following meaning:
  +
* ''default'' is what was happening by default prior to Xen 4.2, i.e., no vCPU pinning at all;
  +
* ''pinned'' means VM#1 was pinned on NODE#0 after being created. This implies its memory was striped on both the nodes, but it can only run on the fist one;
  +
* ''all memory local'' is the best possible case, i.e., VM#1 was created on NODE#0 and kept there. That implies all its memory accesses are local;
   
  +
What is shown is the in performance, for increasing (1 o 8) number of VMs, with respect to the worst possible case case (all memory remote).
What the plot actually shows is the percent increase of each configuration with respect to the worst possible case (i.e., when all memory access are remote). This means tweaking node affinity increases performance by ~12% to ~18% from the worst case. Also, it lets us gain up to ~8% performance as compared to unpinned behavior, doing particularly well as load increases. However, although it gets quite close to the green line (which is the best case), there is still probably some performance bits to squeeze from it.
 
   
  +
http://xenbits.xen.org/people/dariof/kernbench_avg2.png
= Combining Placement and Scheduling =
 
   
  +
This makes evident that NUMA placement is accountable for a ~10% to 20% (depending on the load) impact. Also, it appears that just pinning the vCPUs to some pCPUs, although it can help in keeping the performance consistent, is not able to get that close to the best possible situation.
So, if we do both things, i.e.:
 
   
  +
That led to the implementation of a proper solution for NUMA placement, which became the default [[XL]] behavior (as described Xen_on_NUMA_Machines#Automatic_NUMA_Placement here]]), starting from Xen 4.2. In some more details (with reference to the libxl implementation):
* we introduce some form of automatic placement logic in xl. This means assigning a "node affinity" to a domain at creation time and asking Xen to stick to it;
 
  +
* of all the nodes (or sets of nodes) that have enough free memory and enough pCPUs (at least as much as the domain's vCPUs) are found, and considered ''placement candidates''
* we tweak the scheduler so that it will strongly prefer running a domain on the node(s) it has affinity with, but not in a strict manner as it is for pinning.
 
  +
* if there is more than one candidate:
  +
** in case more than one node is necessary, solutions involving fewer nodes are considered better. In case two (or more) candidates span the same number of nodes,
  +
** the candidate with a fewer of vCPUs runnable on it (due to previous placement and/or plain vCPU pinning) is considered better. In case the same number of vCPUs can run on two (or more) candidates,
  +
** the candidate with with the greatest amount of free memory is considered to be the best one.
   
  +
To actually ''place'' the domain on a candidate (node or set of nodes), this is what happens:
The various lines have the same meaning described in the [[Xen_NUMA_Introduction#Why_Caring_.3F|Xen NUMA Introduction]] article.
 
  +
* in [[Xen_4.2_Feature_List|Xen 4.2]], all the domain's vCPUs are statically pinned to the pCPUs of the node(s);
  +
* starting from [[Xen_4.3_Feature_List|Xen 4.3]], which supports NUMA aware scheduling (at least for the credit scheduler), and only with [[XL]], there is no pinning involved, it is only the node affinity that is set to the node(s) in question. That means the vCPUs are free to run everywhere, but they'll prefer the pCPUs of the selected node(s).
   
  +
Due to the fact that libxl and [[XL]] where just technical preview at the time, in Xen 4.1, the default behavior, for [[XL]], is just "nothing happens". So, if not '''cpus="..."''' or '''pool="..."''' are specified in the config file, neither pinning nor node affinity will be affected, and the domain will be able to run on every pCPU, and will have its memory spread on all nodes. Conversely, if [[XEND|XenD]] is used, the behavior is the same as the one described above for [[Xen_4.2_Feature_List|Xen 4.2]].
Here it is what the benchmarks tells. In all the graphs [http://xenbits.xen.org/people/dariof/benchmarks/specjbb2005-numa/ here], the light blue line is the interesting one, as it is representative of the case when VM#1 has its affinity set to NODE#0. The most interesting among the various plots is probably this one below:
 
   
  +
In all versions and for both toolstacks, if any vCPU pinning and/or [[Xen_4.2:_cpupools|cpupool]] assignment is manually setup (see at the [[Tuning]] page), no automatic placement happens at all, and the user's requests are honored.
http://xenbits.xen.org/people/dariof/benchmarks/specjbb2005-numa/kernbench_avgstd.png
 
   
  +
The [http://xenbits.xen.org/docs/unstable/misc/xl-numa-placement.html in the in tree documentation] has all the details, and is guaranteed to be updated. Some "history", and the roadmap for future improvement to automatic NUMA placement is available [[Xen_NUMA_Roadmap#Automatic_VM_placement|here]].
We can see the "node affinity" curve managing in getting quite closer to the best case, especially under high system load (4 to 8 VMs). It can't be called as perfect yet, as some more consideration needs to be given to the not-so-loaded cases, but it is a start. If you feel like wanting to help with testing, benchmarking, fixing or whatever... Please, jump in!
 
   
 
[[Category:Performance]]
 
[[Category:Performance]]
Line 59: Line 63:
 
[[Category:Developers]]
 
[[Category:Developers]]
 
[[Category:NUMA]]
 
[[Category:NUMA]]
  +
[[Category:Resource Management]]

Latest revision as of 12:55, 9 February 2015

With NUMA placement, we refer to the decision on out of which NUMA nodes in an host the memory for a newly created VM should be allocated. Unfortunately, fitting VMs on a NUMA host is an incarnation of the Bin Packing Problem, which means it is NP-hard, so heuristics is the only reasonable way to go.

This document provides information about exploratory work on NUMA placement, as well as a description of what was included in Xen 4.2. You can find other articles about NUMA in the NUMA category.

Preliminary/Exploratory Work

The first attempt of moving NUMA placement away of XEND's python code (to, in that case, libxc) dates back to 2010 (can't find the link anymore).

More recently, this patch series was released as an RFC on April 2012. In patches 8/10 and 9/10, it implemented three possible placement policies:

  • greedy, which scans the host’s node and put the new VM on the first one that is found to have enough free memory;
  • packed, which puts the new VM on the node that has the smallest amount of free memory (although still enough for the VM to fit there);
  • spread, which puts the new VM on the node that has the biggest amount of free memory.

The names comes from the intrinsic characteristics of the three algorithms. In fact, greedy just grabs the first suitable node, packed tends to keep nodes as full as possible while spread tries to keep them as free/empty as possible. Notice that keeping the nodes full or empty should be intended memory-wise here, but that it also imply the following:

  • greedy and packed policies are both incline to put as much VMs as possible on one node before moving on to try others (which one does that most, depends on the VMs' characteristics and creation order);
  • spread is incline to distribute the VMs across the various nodes (although again, it will depend somehow on VMs' characteristics).

"Scientifically" speaking, greedy is based on the First Fit algorithm, packed is based on Best Fit and spread on Worst Fit.

Some benchmarks, of all the three policies, were performed (as explained in details here). One of the obtained graphs is reported below. This shows the aggregate result of the SpecJBB2005 benchmark, concurrently run inside multiple VMs

kernbench_avgstd2.png

That served as a quite effective confirmation that the spread (i.e., the one based on the worst fit algorithm) policy was the absolute best, and thus, in the continuation of the work on automatic placement, the othe twos could be neglected.

There was also a blog post about this, and it is still available here.

The Actual Solution in Xen 4.2

Keeping on experimenting and benchmarking (full history available here), it was proved that proper VM placement can improve performances significantly. As in here, the curves, in the graph below, have the following meaning:

  • default is what was happening by default prior to Xen 4.2, i.e., no vCPU pinning at all;
  • pinned means VM#1 was pinned on NODE#0 after being created. This implies its memory was striped on both the nodes, but it can only run on the fist one;
  • all memory local is the best possible case, i.e., VM#1 was created on NODE#0 and kept there. That implies all its memory accesses are local;

What is shown is the in performance, for increasing (1 o 8) number of VMs, with respect to the worst possible case case (all memory remote).

kernbench_avg2.png

This makes evident that NUMA placement is accountable for a ~10% to 20% (depending on the load) impact. Also, it appears that just pinning the vCPUs to some pCPUs, although it can help in keeping the performance consistent, is not able to get that close to the best possible situation.

That led to the implementation of a proper solution for NUMA placement, which became the default XL behavior (as described Xen_on_NUMA_Machines#Automatic_NUMA_Placement here]]), starting from Xen 4.2. In some more details (with reference to the libxl implementation):

  • of all the nodes (or sets of nodes) that have enough free memory and enough pCPUs (at least as much as the domain's vCPUs) are found, and considered placement candidates
  • if there is more than one candidate:
    • in case more than one node is necessary, solutions involving fewer nodes are considered better. In case two (or more) candidates span the same number of nodes,
    • the candidate with a fewer of vCPUs runnable on it (due to previous placement and/or plain vCPU pinning) is considered better. In case the same number of vCPUs can run on two (or more) candidates,
    • the candidate with with the greatest amount of free memory is considered to be the best one.

To actually place the domain on a candidate (node or set of nodes), this is what happens:

  • in Xen 4.2, all the domain's vCPUs are statically pinned to the pCPUs of the node(s);
  • starting from Xen 4.3, which supports NUMA aware scheduling (at least for the credit scheduler), and only with XL, there is no pinning involved, it is only the node affinity that is set to the node(s) in question. That means the vCPUs are free to run everywhere, but they'll prefer the pCPUs of the selected node(s).

Due to the fact that libxl and XL where just technical preview at the time, in Xen 4.1, the default behavior, for XL, is just "nothing happens". So, if not cpus="..." or pool="..." are specified in the config file, neither pinning nor node affinity will be affected, and the domain will be able to run on every pCPU, and will have its memory spread on all nodes. Conversely, if XenD is used, the behavior is the same as the one described above for Xen 4.2.

In all versions and for both toolstacks, if any vCPU pinning and/or cpupool assignment is manually setup (see at the Tuning page), no automatic placement happens at all, and the user's requests are honored.

The in the in tree documentation has all the details, and is guaranteed to be updated. Some "history", and the roadmap for future improvement to automatic NUMA placement is available here.