Xen-netback NAPI + kThread V3 performance testing

From Xen

Xen-netback NAPI + kThread V3 performance testing

Test Description

  • VM template Split contains Debian Wheezy (7.0) 64-bit with a modified xen-netfront to support split event channels. This was cloned 8 times into Split-$N
  • Tests were performed using iperf with traffic flowing from dom0 to N domU, and from N domU to dom0 (see individual sections below for results).
  • iperf sessions ran for 120 seconds each and were repeated three times, for each combination of 1 to N domU instances and 1 to 4 iperf threads. Results were stored in a serialised Python list format for later analysis.
  • Analysis scripts automatically produced the plots below from the stored results.
  • The domU instances were shut down between each test repeat (but not when the number of iperf threads was changed) to ensure that the results are comparable and uncorrelated.
  • No attempt was made to modify the dom0 configuration to improve performance beyond that of the default (XenServer) settings
  • In the first instance, dom0 was running a 32-bit build of linux-next as at commit 404f7e793, plus a small XenServer-specific patchqueue (mostly blktap stuff).
  • The second run added the relevant patches for NAPI + kthread per VIF to the above kernel.
  • The host has 12 cores on two sockets, across two nodes, with Hyperthreading enabled. The CPUs are Intel Xeon X5650 (2.67 GHz).
  • dom0 has 2147483648 bytes memory (2048 MB)
  • VMs have 536870912 bytes memory each (512 MB)
  • VMs have two VCPUs, dom0 has six VCPUs

Interpreting the plots

On the plots below, red and orange lines refer to the plain kernel tests, while blue and purple lines refer to the patched version. The dotted green lines show CPU usage for both (which are more or less identical for both tests).

The x-axis is number of VMs, from 0 to 8 (data starts at N=1). The left-hand y-axis is throughput in Gbit/s, corresponding to the red, orange, blue and purple lines. The right-hand y-axis is CPU usage (as a percentage of one dom0 VCPU), corresponding to the green lines. Error bars are ± 1 standard deviation.

For each test, four plots are shown, corresponding to tests run with between 1 and 4 iperf threads (1 iperf thread = 1 TCP stream).

dom0 to Debian VM (Split event channels)

V3-split-1.png V3-split-2.png V3-split-3.png V3-split-4.png

Analysis

There is no appreciable difference resulting from the patches when the VM is _receiving_ traffic. Since these patches affect the dom0 (backend) mechanism for receiving packets from a virtual network interface, and not the mechanism for transmitting over a virtual interface, this is expected.

Debian VM (Split event channels) to dom0

V3-reverse-split-1.png V3-reverse-split-2.png V3-reverse-split-3.png V3-reverse-split-4.png

Analysis

  • For one and two TCP streams, the results show no significant change in throughput when up to four guests are transmitting, but the unpatched kernel performs better after this point.
  • For three and four TCP streams, the results indicate no significant change in throughput resulting from these patches.
  • Both of these scenarios differ from the V1 patches (results at xen-netback NAPI + kThread V1 performance testing), where the patched kernel outperforms the original for low numbers of TCP streams. Note, however, that the base kernel against which these patches were tested has also changed; the first run of tests was performed against a 3.6.11 kernel, while this set are performed against linux-next. The throughput for the unpatched kernel appears to have improved between 3.6.11 and linux-next, which may reduce the apparent benefit (in terms of throughput) of these patches.


Conclusions

The V3 patches have minimal performance impact, but should improve fairness on the dom0-to-VM pathway due to the kthread per VIF, and simplify the code for the VM-to-dom0 pathway too. It may be worth investigating the drop in VM-to-dom0 throughput above 4 active VMs at one or two TCP streams per VM, however.