Difference between revisions of "Performance of Xen VCPU Scheduling"

From Xen
m
(strategy of prototypes)
 
(6 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
== Introduction ==
 
== Introduction ==
   
This page evaluates the performance of the Xen VCPU schedulers with different parameters/patches and under different guest loads. The motivation was the observation that a configuration where dom0 vcpus were pinned to a set of pcpus in the host using dom0_vcpus_pin and the guests were prevented from running on the dom0 pcpus (here called "exclusively-pinned dom0 vcpus") caused the general performance of the guests in a host with around 24 pcpus or more to increase and the desire to understand why this improved behaviour was not present in the default non-pinned state and if it would be possible to obtain this extra performance in the default state by changing some parameter or patching the Xen VCPU scheduler.
+
This page evaluates the performance of the Xen VCPU schedulers with different parameters/patches and under different guest loads. The motivation was the observation that a configuration where dom0 vcpus were pinned to a set of pcpus in the host using dom0_vcpus_pin and the guests were prevented from running on the dom0 pcpus (here called "exclusively-pinned dom0 vcpus", or xpin) caused the general performance of the guests in a host with around 24 pcpus or more to increase and the desire to understand why this improved behaviour was not present in the default non-pinned state and if it would be possible to obtain this extra performance in the default state by changing some parameter or patching the Xen VCPU scheduler.
   
 
== T1. Check that pinning/unpinning issue is present ==
 
== T1. Check that pinning/unpinning issue is present ==
Line 42: Line 42:
 
* If Xen was able to categories the work time of the VM then the sched_ratelimit_us parameter could be adjusted automatically.
 
* If Xen was able to categories the work time of the VM then the sched_ratelimit_us parameter could be adjusted automatically.
 
* A prototype of this would be a suitable improvement to implement as part of this TDP.
 
* A prototype of this would be a suitable improvement to implement as part of this TDP.
  +
  +
  +
== T4. Understand what is causing latency ==
  +
  +
A Scale of Measure (SOM) to measure latency between dom0 and a guest was created, using a minimal vmping-pong round-trip protocol over an event channel. In this measurement, (t1) dom0 notifies the guest via an event channel; the guest receives this notification and immediately notifies back dom0, which receives it (t2). The latency reported in this SOM is (t2 - t1). These two values are comparable because both are obtained from dom0's clock.
  +
  +
=== T4.1. Run experiments for this SOM ===
  +
  +
The typical values observed are between 20us and 40us. Interestingly, in the nopin case only there's a second cluster of measurements between 100us-2000us when there are several VMs running. This second cluster is not visible in the xpin case.
  +
  +
No pinning event channel ping time, 60 1-vpcpu win7 vms starting and running a tight cpuloop burning load on a 24 pcpu host. Green dots at the left show when the vms started. Green dots at the right show when the vms shut down.
  +
  +
[[File:50vms-with-cpuloop-load-nopin.png|frameless|200px|No pinning event channel ping time.]]
  +
  +
Exclusive pinning event channel ping time, 60 1-vpcpu win7 vms starting and running a tight cpuloop burning load on a 24 pcpu host. Green dots at the left show when the vms started. Green dots at the right show when the vms shut down.
  +
  +
[[File:50vms-with-cpuloop-load-xpin.png|frameless|200px|Exclusive pinning event channel ping time.]]
  +
  +
*source for vmping(dom0)
  +
*source for vmpong(linux guest)/xenpong(windows guest)
  +
  +
  +
== T5.1. Prototype patches for Xen to improve scheduling ==
  +
  +
The main theme for these patches is to make the xen credit1 scheduler with nopin behave more similarly to xpin regarding the improved bootstorm times and vm density numbers, while removing the xpin issue when measuring vmlogin for many vms running a cpuloop load (T3).
  +
  +
These guidelines for the approach detail the idea a bit more:
  +
  +
"We know that xpinning the dom0 vcpus makes bootstorms and vm density more efficient than when the dom0 vcpus are not xpinned.
  +
Interesting patches to the xen scheduler are therefore those that try to make the nopinn case closer to the xpin case. Then verify if the benefits of both xpin (larger density of vms) and nopin (more pcpus available for vms) can be obtained.
  +
  +
xpin-like strategy for nopin vcpu scheduling:
  +
* 1) hard? to patch: dom0 vcpus must not compete for the same pcpu(s).
  +
* 2) easy to patch: a guest vcpu (boosted or not) must not preempt the rate limit of a dom0 vcpu (regardless if it is boosted or not).
  +
* 3) easy to patch: a dom0 vcpu runs for a long time (eg. rate limit 100ms) on the same pcpu without being preempted, unless it blocks.
  +
  +
beyond the xpin strategy:
  +
* 4) easy to patch: a dom0 vcpu is always in the boosted state.
  +
* 5) easy to patch: a dom0 vcpu never runs out of credit.
  +
"
  +
  +
=== Allow Domain0 vCPUs to preempt another vCPUs before they complete their ratelimited timeslice ===
  +
* Change sched_credit.c function csched_schedule()
  +
** allow-dom0-to-preempt-before-ratelimit-complete.patch
  +
<pre>
  +
# HG changeset patch
  +
# Parent 628da3faccb7a1b716a75f96ae2144ff4a5d62ed
  +
  +
diff -r 628da3faccb7 xen/common/sched_credit.c
  +
--- a/xen/common/sched_credit.c
  +
+++ b/xen/common/sched_credit.c
  +
@@ -1455,7 +1455,8 @@ csched_schedule(
  +
&& prv->ratelimit_us
  +
&& vcpu_runnable(current)
  +
&& !is_idle_vcpu(current)
  +
- && runtime < MICROSECS(prv->ratelimit_us) )
  +
+ && runtime < MICROSECS(prv->ratelimit_us)
  +
+ && __runq_elem(runq->next)->vcpu->domain->domain_id != 0 )
  +
{
  +
snext = scurr;
  +
snext->start_time += now;
  +
</pre>
  +
  +
==== CPU Loop Load ====
  +
* Test: Waiting for VMLogin events, while starting 1 VM at a time: http://perf/?t=644 http://perf/?t=715
  +
** Exclusive pinning: minimal change.
  +
** No pinning: minimal change.
  +
* Test: Waiting for VMStart events, while starting 1 VM at a time: http://perf/?t=642 http://perf/?t=716
  +
** Exclusive pinning: no change.
  +
** No pinning: small regression before CPU contention, but large improvement after. (36%)
  +
* Test: Waiting for VMStart events, while starting 25 VMs at a time: http://perf/?t=641 http://perf/?t=717
  +
** Exclusive pinning: minimal change.
  +
** No pinning: large regression. (30%)
  +
  +
====Observation ====
  +
Minimal changes are observed when exclusive pinning is used. This is as expected as only dom0 vCPUs are queued on the exclusive pCPUs.
  +
  +
=== Allow blocked vCPUs to be migrated to other pCPUs before the migration limit is reached ===
  +
* If vCPU yields itself (i.e blocked state) allow for migration to other pCPUs regardless of time left before migration limit reached
  +
** The next vCPU on that pCPU may hold it for upto 1ms (ratelimit time) thus if request is handled by dom0 before then it can't run even if there is a free pCPU elsewhere.
  +
* Change sched_credit.c function __csched_vcpu_is_migrateable()
  +
** allow-migrate-before-limit-if-block.patch
  +
<pre>
  +
# HG changeset patch
  +
# Parent d125cfd1125347ab8ace76402b36621c22dc8747
  +
  +
diff -r d125cfd11253 xen/common/sched_credit.c
  +
--- a/xen/common/sched_credit.c
  +
+++ b/xen/common/sched_credit.c
  +
@@ -456,7 +456,7 @@ __csched_vcpu_is_migrateable(struct vcpu
  +
* peer PCPU. Only pick up work that's allowed to run on our CPU.
  +
*/
  +
return !vc->is_running &&
  +
- !__csched_vcpu_is_cache_hot(vc) &&
  +
+ (!__csched_vcpu_is_cache_hot(vc) || vc->runstate.state == RUNSTATE_offline) &&
  +
cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
  +
}
  +
</pre>
  +
  +
=== Increase cache hits for dom0 vCPUs in the no pinning case ===
  +
* Try to reduce the amount that dom0 vCPUs move about, so that the dom0 moving behaviour is closer to what happens when xpinning its vcpus. This should increase the cache hits and potentially increase the performance of dom0.
  +
  +
=== Schedule a vm vcpu to run on the same pcpu runqueue as the dom0 vcpu that woke the vm vcpu ===
  +
(George suggested on 03/Jul/2013)
  +
* This could reduce impact of cold cache when accessing memory.
  +
  +
=== Allow CSCHED_PRI_TS_BOOST vCPUs to preempt other vCPUs ===
  +
* If the vCPUs is woken up because its request has been handled its priority is set to CSCHED_PRI_TS_BOOST. However, if another vCPU is using the pCPU then it has to wait up to the rate limit before it can process the request.
  +
** allow-BOOST-vcpus-to-preempt-ratelimit.patch
  +
<pre>
  +
# HG changeset patch
  +
# Parent 42a7bd2f4cd2d9223e7b550912d4e938965841e3
  +
diff -r 42a7bd2f4cd2 -r af469897377e xen/common/sched_credit.c
  +
--- a/xen/common/sched_credit.c
  +
+++ b/xen/common/sched_credit.c
  +
@@ -1455,7 +1455,8 @@ csched_schedule(
  +
&& prv->ratelimit_us
  +
&& vcpu_runnable(current)
  +
&& !is_idle_vcpu(current)
  +
- && runtime < MICROSECS(prv->ratelimit_us) )
  +
+ && runtime < MICROSECS(prv->ratelimit_us)
  +
+ && !__runq_elem(runq->next)->pri == CSCHED_PRI_TS_BOOST )
  +
{
  +
snext = scurr;
  +
snext->start_time += now;
  +
</pre>
  +
  +
==== Version 2 ====
  +
* Only allow vCPUs with priority CSCHED_PRI_TS_BOOST to preempt the current running vCPU if it isn't in the CSCHED_PRI_TS_BOOST state.
  +
* Hopefully this will reduce some context switching of two boosted vCPUs on the same pCPU.
  +
** boost-preempt-if-curr-not-boost.patch
  +
<pre>
  +
# HG changeset patch
  +
# Parent 42a7bd2f4cd2d9223e7b550912d4e938965841e3
  +
  +
diff -r 42a7bd2f4cd2 xen/common/sched_credit.c
  +
--- a/xen/common/sched_credit.c
  +
+++ b/xen/common/sched_credit.c
  +
@@ -1455,7 +1455,8 @@ csched_schedule(
  +
&& prv->ratelimit_us
  +
&& vcpu_runnable(current)
  +
&& !is_idle_vcpu(current)
  +
- && runtime < MICROSECS(prv->ratelimit_us) )
  +
+ && runtime < MICROSECS(prv->ratelimit_us)
  +
+ && !(__runq_elem(runq->next)->pri == CSCHED_PRI_TS_BOOST && scurr->pri != CSCHED_PRI_TS_BOOST) )
  +
{
  +
snext = scurr;
  +
snext->start_time += now;
  +
</pre>
  +
  +
=== Allow vCPUs to preempt ratelimit if it was woken up after an iommu completion ===
  +
(George suggested on 03/Jul/2013)
  +
  +
*If a vCPU is taken offline due to an iommu request it will be blocked until this request is processed. Once the request has been completed the vCPU has to wait to be scheduled again. During this time further requests can't be queued and then processed in a batch after it is woken up.
  +
*Therefore is a HVM vCPU is woken up from an iommu request completion allow it to ignore the ratelimit of other vCPUs.
  +
*This is likely to occur during the boot process and may improve the bootstorm performance and hopefully reduce the impact of other work loads.
  +
**ignore-ratelimit-if-woke-from-iommu.patch
  +
<pre>
  +
# HG changeset patch
  +
# Parent 42a7bd2f4cd2d9223e7b550912d4e938965841e3
  +
  +
diff -r 42a7bd2f4cd2 xen/arch/x86/hvm/hvm.c
  +
--- a/xen/arch/x86/hvm/hvm.c
  +
+++ b/xen/arch/x86/hvm/hvm.c
  +
@@ -905,6 +905,8 @@ static int hvm_load_cpu_ctxt(struct doma
  +
/* Auxiliary processors should be woken immediately. */
  +
v->is_initialised = 1;
  +
clear_bit(_VPF_down, &v->pause_flags);
  +
+ /* Notify scheduler that the domain was block due to an iommu request */
  +
+ v->is_iommu_request = 1;
  +
vcpu_wake(v);
  +
  +
return 0;
  +
diff -r 42a7bd2f4cd2 xen/common/sched_credit.c
  +
--- a/xen/common/sched_credit.c
  +
+++ b/xen/common/sched_credit.c
  +
@@ -1455,7 +1455,8 @@ csched_schedule(
  +
&& prv->ratelimit_us
  +
&& vcpu_runnable(current)
  +
&& !is_idle_vcpu(current)
  +
- && runtime < MICROSECS(prv->ratelimit_us) )
  +
+ && runtime < MICROSECS(prv->ratelimit_us)
  +
+ && !__runq_elem(runq->next)->vcpu->is_iommu_request )
  +
{
  +
snext = scurr;
  +
snext->start_time += now;
  +
@@ -1490,6 +1491,11 @@ csched_schedule(
  +
clear_bit(CSCHED_FLAG_VCPU_YIELD, &scurr->flags);
  +
  +
/*
  +
+ * Clear the is_iommu_flag of snext if it was set
  +
+ */
  +
+ snext->vcpu->is_iommu_request = 0;
  +
+
  +
+ /*
  +
* SMP Load balance:
  +
*
  +
* If the next highest priority local runnable VCPU has already eaten
  +
diff -r 42a7bd2f4cd2 xen/include/xen/sched.h
  +
--- a/xen/include/xen/sched.h
  +
+++ b/xen/include/xen/sched.h
  +
@@ -125,6 +125,8 @@ struct vcpu
  +
bool_t is_running;
  +
/* VCPU should wake fast (do not deep sleep the CPU). */
  +
bool_t is_urgent;
  +
+ /* VCPU was woken from an iommu completion request */
  +
+ bool_t is_iommu_request;
  +
  +
#ifdef VCPU_TRAP_LAST
  +
#define VCPU_TRAP_NONE 0
  +
</pre>
  +
  +
=== Attempt to dynamically adjust the ratelimit of VMs ===
  +
* When a VMs is booting it performs lots of I/O operations, thus blocking itself frequently. Once the request has been handled the VM will need to run again for only a short amount of time.
  +
* After boot, if the VM is running a CPU loop, then it performs very few I/O operations. It also isn't latency sensitive, and doesn't matter if it runs for many small timeslices, or for one long timeslice.
  +
* The stats collected by the scheduler may be able to be used to adjust the ratelimit of VMs.
  +
  +
=== Don't clear dom0 BOOST priority ===
  +
* Don't clear the CSCHED_PRI_TS_BOOST priority of dom0 vCPUs which should allow dom0 vCPUs to be scheduled more often.
  +
** dont-clean-dom0-boost-priotiry.patch
  +
<pre>
  +
# HG changeset patch
  +
# Parent 55fec41cd8ddaa721cdb0910cd7e05f74c293748
  +
diff -r 55fec41cd8dd -r 723eb176d874 xen/common/sched_credit.c
  +
--- a/xen/common/sched_credit.c
  +
+++ b/xen/common/sched_credit.c
  +
@@ -690,7 +690,8 @@ csched_vcpu_acct(struct csched_private *
  +
* If the VCPU is found here, then it's consuming a non-negligeable
  +
* amount of CPU resources and should no longer be boosted.
  +
*/
  +
- if ( svc->pri == CSCHED_PRI_TS_BOOST )
  +
+ if ( svc->vcpu->domain->domain_id != 0 &&
  +
+ svc->pri == CSCHED_PRI_TS_BOOST )
  +
svc->pri = CSCHED_PRI_TS_UNDER;
  +
  +
/*
  +
</pre>
  +
  +
=== Increase timeslice of vCPUs if they are the only vCPU running on a pCPU ===
  +
* If the pCPU only has one vCPU to run (not including the idle domain), increase the timeslice in order to decrease the number of ticks.
  +
* This should reduce the number of times the vCPU is paused and then reschedule. Therefore reducing context switching.
  +
* Maybe it is possible to become tickless if there is only a single vCPU.
  +
** increase-tslice-if-only-dom.patch
  +
<pre>
  +
# HG changeset patch
  +
# Parent 42a7bd2f4cd2d9223e7b550912d4e938965841e3
  +
  +
diff -r 42a7bd2f4cd2 xen/common/sched_credit.c
  +
--- a/xen/common/sched_credit.c
  +
+++ b/xen/common/sched_credit.c
  +
@@ -1402,6 +1402,10 @@ csched_load_balance(struct csched_privat
  +
return snext;
  +
}
  +
  +
+
  +
+#define HAS_RUNQ_ONE_DOM(_head) ( (((_head->next)->next)->next == _head) && \
  +
+ ( is_idle_vcpu(__runq_elem(_head->next)->vcpu) || is_idle_vcpu(__runq_elem(_head->prev)->vcpu) ) )
  +
+
  +
/*
  +
* This function is in the critical path. It is designed to be simple and
  +
* fast for the common case.
  +
@@ -1489,6 +1493,10 @@ csched_schedule(
  +
*/
  +
clear_bit(CSCHED_FLAG_VCPU_YIELD, &scurr->flags);
  +
  +
+ if ( HAS_RUNQ_ONE_DOM(runq) ) {
  +
+ tslice = MILLISECS(prv->tslice_ms * 10);
  +
+ }
  +
+
  +
/*
  +
* SMP Load balance:
  +
*
  +
</pre>
  +
  +
* Version 2
  +
  +
  +
=== Don't migrate dom0 vCPUs to other pCPUs which have another dom0 vCPU in there queue ===
  +
* Don't make dom0 vCPUs compete against each other for resources. Keep them spread out across the system and with only one on any given pCPU.
  +
** Patch
  +
  +
=== Place Domain0 vCPUs at the front of the queue where they scheduled ===
  +
* Potential fix adding dom0 vcpu in front of a pcpu runq whenever it is scheduled,
  +
http://lists.xen.org/archives/html/xen-devel/2013-05/msg03153.html
  +
** Doesn't work as expected; increases dom0 scheduling latency to 10ms.
  +
  +
=== Unpatch Xen 4.2 ===
  +
Used to compare against all the patched versions of Xen4.2.
  +
* CPU Loop Load
  +
* LoginVSI Load
  +
* No Load
  +
   
 
[[Category:Performance]]
 
[[Category:Performance]]

Latest revision as of 12:41, 10 July 2013

<under construction>

Matthew Portas

Marcus Granado

Introduction

This page evaluates the performance of the Xen VCPU schedulers with different parameters/patches and under different guest loads. The motivation was the observation that a configuration where dom0 vcpus were pinned to a set of pcpus in the host using dom0_vcpus_pin and the guests were prevented from running on the dom0 pcpus (here called "exclusively-pinned dom0 vcpus", or xpin) caused the general performance of the guests in a host with around 24 pcpus or more to increase and the desire to understand why this improved behaviour was not present in the default non-pinned state and if it would be possible to obtain this extra performance in the default state by changing some parameter or patching the Xen VCPU scheduler.

T1. Check that pinning/unpinning issue is present

T2. Experiment with scheduler parameter

T3. Understand if there is a latency problem

Accumulated login time for the VMLogin events with xentrace logs taken at 20th and 40th VM start. Exclusive pinning (blue) vs. No pinning (yellow). All the VMs are executing a cpu loop in order to maximize pCPU usage. Accumulated login time for the VMLogin events with xentrace logs taken at 20th and 40th VM start. Exclusive pinning (blue) vs. No pinning (yellow). All the VMs are executing a cpu loop in order to maximize pCPU usage.

  • At the 20th VM we have more vCPUs than pCPUs on the host.
  • With no pinning the Xen scheduler tends to group the new VM's vCPU running events into larger chucks on the same pCPU.
  • With exclusive pinning the Xen scheduler tend to interleave the new VM's vCPU running events with other VMs' vCPUs running events.
  • with exclusive pinning the new VM's vCPU (black triangle) is believed to be doing IO at boot time and is yield its vCPU to Xen in order to handle this IO request after about 10us. Xen then schedule a different VM's vCPU (green square) which then runs (in the cpu loop) for a timeslice of about 1ms. Only then is the new VM give its vCPU back to handle its next IO request.

(ToDo: Run the same test but with a maximum Xen timeslice of less than 1ms so that the new VM's IO request are blocked for a short period of time.)

  • With no pinning when the new VM's vCPU is yielded Xen doesn't schedule another VM's vCPU on the same pCPU. Therefore once the IO request has been handle, the VM's vCPU can be rescheduled and process this IO request immediately.
  • This highlight a preference of the scheduler for vCPUs which are running the CPU loops, but only in the exclusive pinned case.

Accumulated login time for the VMStart events with xentrace logs taken at 20th and 40th VM start. Exclusive pinning (blue) vs. No pinning (yellow). All the VMs are executing a cpu loop in order to maximize pCPU usage. Accumulated login time for the VMStart events with xentrace logs taken at 20th and 40th VM start. Exclusive pinning (blue) vs. No pinning (yellow). All the VMs are executing a cpu loop in order to maximize pCPU usage.

sched_ratelimit_us: The Xen parameter sched_ratelimit_us is used to set the minimum amount of time for which a VM is allowed to run without being preempted. The default value is 1000 (1ms). Setting sched_ratelimit_us=0

  • For VMLogin with cpu_loop as the VM load the boot time decreased dramatically for exclusive pinning, resulting in a time similar to that of no pinning. http://perf/?t=577
  • For VMStart with cpu_loop as the VM load the boot time decreased dramatically for both exclusive pinning and no pinning. http://perf/?t=580
  • However, when performing the LoginVSI tests no improvement in score was achieved and instead a slightly worse score was achieved. http://perf/?t=583
  • When no load is running in the VM there is barely any difference when using VMStart http://perf/?t=585.
  • However, when using VMLogin with no load in the VM it performs better with this parameter if pinning is not used. If exclusive pinning is used then i actually performs much worse. http://perf/?t=593

Conclusion

  • The results should that sched_ratelimit_us can have a big effect on the bootstorm performance. However, it is very depended on the load that is being run in the VMs.
  • If Xen was able to categories the work time of the VM then the sched_ratelimit_us parameter could be adjusted automatically.
  • A prototype of this would be a suitable improvement to implement as part of this TDP.


T4. Understand what is causing latency

A Scale of Measure (SOM) to measure latency between dom0 and a guest was created, using a minimal vmping-pong round-trip protocol over an event channel. In this measurement, (t1) dom0 notifies the guest via an event channel; the guest receives this notification and immediately notifies back dom0, which receives it (t2). The latency reported in this SOM is (t2 - t1). These two values are comparable because both are obtained from dom0's clock.

T4.1. Run experiments for this SOM

The typical values observed are between 20us and 40us. Interestingly, in the nopin case only there's a second cluster of measurements between 100us-2000us when there are several VMs running. This second cluster is not visible in the xpin case.

No pinning event channel ping time, 60 1-vpcpu win7 vms starting and running a tight cpuloop burning load on a 24 pcpu host. Green dots at the left show when the vms started. Green dots at the right show when the vms shut down.

No pinning event channel ping time.

Exclusive pinning event channel ping time, 60 1-vpcpu win7 vms starting and running a tight cpuloop burning load on a 24 pcpu host. Green dots at the left show when the vms started. Green dots at the right show when the vms shut down.

Exclusive pinning event channel ping time.

  • source for vmping(dom0)
  • source for vmpong(linux guest)/xenpong(windows guest)


T5.1. Prototype patches for Xen to improve scheduling

The main theme for these patches is to make the xen credit1 scheduler with nopin behave more similarly to xpin regarding the improved bootstorm times and vm density numbers, while removing the xpin issue when measuring vmlogin for many vms running a cpuloop load (T3).

These guidelines for the approach detail the idea a bit more:

"We know that xpinning the dom0 vcpus makes bootstorms and vm density more efficient than when the dom0 vcpus are not xpinned. Interesting patches to the xen scheduler are therefore those that try to make the nopinn case closer to the xpin case. Then verify if the benefits of both xpin (larger density of vms) and nopin (more pcpus available for vms) can be obtained.

xpin-like strategy for nopin vcpu scheduling:

  • 1) hard? to patch: dom0 vcpus must not compete for the same pcpu(s).
  • 2) easy to patch: a guest vcpu (boosted or not) must not preempt the rate limit of a dom0 vcpu (regardless if it is boosted or not).
  • 3) easy to patch: a dom0 vcpu runs for a long time (eg. rate limit 100ms) on the same pcpu without being preempted, unless it blocks.

beyond the xpin strategy:

  • 4) easy to patch: a dom0 vcpu is always in the boosted state.
  • 5) easy to patch: a dom0 vcpu never runs out of credit.

"

Allow Domain0 vCPUs to preempt another vCPUs before they complete their ratelimited timeslice

  • Change sched_credit.c function csched_schedule()
    • allow-dom0-to-preempt-before-ratelimit-complete.patch
# HG changeset patch
# Parent 628da3faccb7a1b716a75f96ae2144ff4a5d62ed

diff -r 628da3faccb7 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1455,7 +1455,8 @@ csched_schedule(
          && prv->ratelimit_us
          && vcpu_runnable(current)
          && !is_idle_vcpu(current)
-         && runtime < MICROSECS(prv->ratelimit_us) )
+         && runtime < MICROSECS(prv->ratelimit_us)
+         && __runq_elem(runq->next)->vcpu->domain->domain_id != 0 )
     {
         snext = scurr;
         snext->start_time += now;

CPU Loop Load

  • Test: Waiting for VMLogin events, while starting 1 VM at a time: http://perf/?t=644 http://perf/?t=715
    • Exclusive pinning: minimal change.
    • No pinning: minimal change.
  • Test: Waiting for VMStart events, while starting 1 VM at a time: http://perf/?t=642 http://perf/?t=716
    • Exclusive pinning: no change.
    • No pinning: small regression before CPU contention, but large improvement after. (36%)
  • Test: Waiting for VMStart events, while starting 25 VMs at a time: http://perf/?t=641 http://perf/?t=717
    • Exclusive pinning: minimal change.
    • No pinning: large regression. (30%)

Observation

Minimal changes are observed when exclusive pinning is used. This is as expected as only dom0 vCPUs are queued on the exclusive pCPUs.

Allow blocked vCPUs to be migrated to other pCPUs before the migration limit is reached

  • If vCPU yields itself (i.e blocked state) allow for migration to other pCPUs regardless of time left before migration limit reached
    • The next vCPU on that pCPU may hold it for upto 1ms (ratelimit time) thus if request is handled by dom0 before then it can't run even if there is a free pCPU elsewhere.
  • Change sched_credit.c function __csched_vcpu_is_migrateable()
    • allow-migrate-before-limit-if-block.patch
# HG changeset patch
# Parent d125cfd1125347ab8ace76402b36621c22dc8747

diff -r d125cfd11253 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -456,7 +456,7 @@ __csched_vcpu_is_migrateable(struct vcpu
      * peer PCPU. Only pick up work that's allowed to run on our CPU.
      */
     return !vc->is_running &&
-           !__csched_vcpu_is_cache_hot(vc) &&
+           (!__csched_vcpu_is_cache_hot(vc) || vc->runstate.state == RUNSTATE_offline) &&
            cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
 }

Increase cache hits for dom0 vCPUs in the no pinning case

  • Try to reduce the amount that dom0 vCPUs move about, so that the dom0 moving behaviour is closer to what happens when xpinning its vcpus. This should increase the cache hits and potentially increase the performance of dom0.

Schedule a vm vcpu to run on the same pcpu runqueue as the dom0 vcpu that woke the vm vcpu

(George suggested on 03/Jul/2013)

  • This could reduce impact of cold cache when accessing memory.

Allow CSCHED_PRI_TS_BOOST vCPUs to preempt other vCPUs

  • If the vCPUs is woken up because its request has been handled its priority is set to CSCHED_PRI_TS_BOOST. However, if another vCPU is using the pCPU then it has to wait up to the rate limit before it can process the request.
    • allow-BOOST-vcpus-to-preempt-ratelimit.patch
# HG changeset patch
# Parent 42a7bd2f4cd2d9223e7b550912d4e938965841e3
diff -r 42a7bd2f4cd2 -r af469897377e xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1455,7 +1455,8 @@ csched_schedule(
          && prv->ratelimit_us
          && vcpu_runnable(current)
          && !is_idle_vcpu(current)
-         && runtime < MICROSECS(prv->ratelimit_us) )
+         && runtime < MICROSECS(prv->ratelimit_us)
+         && !__runq_elem(runq->next)->pri == CSCHED_PRI_TS_BOOST )
     {
         snext = scurr;
         snext->start_time += now;

Version 2

  • Only allow vCPUs with priority CSCHED_PRI_TS_BOOST to preempt the current running vCPU if it isn't in the CSCHED_PRI_TS_BOOST state.
  • Hopefully this will reduce some context switching of two boosted vCPUs on the same pCPU.
    • boost-preempt-if-curr-not-boost.patch
# HG changeset patch
# Parent 42a7bd2f4cd2d9223e7b550912d4e938965841e3

diff -r 42a7bd2f4cd2 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1455,7 +1455,8 @@ csched_schedule(
          && prv->ratelimit_us
          && vcpu_runnable(current)
          && !is_idle_vcpu(current)
-         && runtime < MICROSECS(prv->ratelimit_us) )
+         && runtime < MICROSECS(prv->ratelimit_us)
+         && !(__runq_elem(runq->next)->pri == CSCHED_PRI_TS_BOOST && scurr->pri != CSCHED_PRI_TS_BOOST) )
     {
         snext = scurr;
         snext->start_time += now;

Allow vCPUs to preempt ratelimit if it was woken up after an iommu completion

(George suggested on 03/Jul/2013)

  • If a vCPU is taken offline due to an iommu request it will be blocked until this request is processed. Once the request has been completed the vCPU has to wait to be scheduled again. During this time further requests can't be queued and then processed in a batch after it is woken up.
  • Therefore is a HVM vCPU is woken up from an iommu request completion allow it to ignore the ratelimit of other vCPUs.
  • This is likely to occur during the boot process and may improve the bootstorm performance and hopefully reduce the impact of other work loads.
    • ignore-ratelimit-if-woke-from-iommu.patch
# HG changeset patch
# Parent 42a7bd2f4cd2d9223e7b550912d4e938965841e3

diff -r 42a7bd2f4cd2 xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -905,6 +905,8 @@ static int hvm_load_cpu_ctxt(struct doma
     /* Auxiliary processors should be woken immediately. */
     v->is_initialised = 1;
     clear_bit(_VPF_down, &v->pause_flags);
+    /* Notify scheduler that the domain was block due to an iommu request */
+    v->is_iommu_request = 1;
     vcpu_wake(v);
 
     return 0;
diff -r 42a7bd2f4cd2 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1455,7 +1455,8 @@ csched_schedule(
          && prv->ratelimit_us
          && vcpu_runnable(current)
          && !is_idle_vcpu(current)
-         && runtime < MICROSECS(prv->ratelimit_us) )
+         && runtime < MICROSECS(prv->ratelimit_us)
+         && !__runq_elem(runq->next)->vcpu->is_iommu_request )
     {
         snext = scurr;
         snext->start_time += now;
@@ -1490,6 +1491,11 @@ csched_schedule(
     clear_bit(CSCHED_FLAG_VCPU_YIELD, &scurr->flags);
 
     /*
+     * Clear the is_iommu_flag of snext if it was set
+     */
+    snext->vcpu->is_iommu_request = 0;
+
+    /*
      * SMP Load balance:
      *
      * If the next highest priority local runnable VCPU has already eaten
diff -r 42a7bd2f4cd2 xen/include/xen/sched.h
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -125,6 +125,8 @@ struct vcpu
     bool_t           is_running;
     /* VCPU should wake fast (do not deep sleep the CPU). */
     bool_t           is_urgent;
+    /* VCPU was woken from an iommu completion request */
+    bool_t           is_iommu_request;
 
 #ifdef VCPU_TRAP_LAST
 #define VCPU_TRAP_NONE    0

Attempt to dynamically adjust the ratelimit of VMs

  • When a VMs is booting it performs lots of I/O operations, thus blocking itself frequently. Once the request has been handled the VM will need to run again for only a short amount of time.
  • After boot, if the VM is running a CPU loop, then it performs very few I/O operations. It also isn't latency sensitive, and doesn't matter if it runs for many small timeslices, or for one long timeslice.
  • The stats collected by the scheduler may be able to be used to adjust the ratelimit of VMs.

Don't clear dom0 BOOST priority

  • Don't clear the CSCHED_PRI_TS_BOOST priority of dom0 vCPUs which should allow dom0 vCPUs to be scheduled more often.
    • dont-clean-dom0-boost-priotiry.patch
# HG changeset patch
# Parent 55fec41cd8ddaa721cdb0910cd7e05f74c293748
diff -r 55fec41cd8dd -r 723eb176d874 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -690,7 +690,8 @@ csched_vcpu_acct(struct csched_private *
      * If the VCPU is found here, then it's consuming a non-negligeable
      * amount of CPU resources and should no longer be boosted.
      */
-    if ( svc->pri == CSCHED_PRI_TS_BOOST )
+    if ( svc->vcpu->domain->domain_id != 0 &&
+         svc->pri == CSCHED_PRI_TS_BOOST )
         svc->pri = CSCHED_PRI_TS_UNDER;
 
     /*

Increase timeslice of vCPUs if they are the only vCPU running on a pCPU

  • If the pCPU only has one vCPU to run (not including the idle domain), increase the timeslice in order to decrease the number of ticks.
  • This should reduce the number of times the vCPU is paused and then reschedule. Therefore reducing context switching.
  • Maybe it is possible to become tickless if there is only a single vCPU.
    • increase-tslice-if-only-dom.patch
# HG changeset patch
# Parent 42a7bd2f4cd2d9223e7b550912d4e938965841e3

diff -r 42a7bd2f4cd2 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -1402,6 +1402,10 @@ csched_load_balance(struct csched_privat
     return snext;
 }
 
+
+#define HAS_RUNQ_ONE_DOM(_head)    (  (((_head->next)->next)->next == _head) &&  \
+                           ( is_idle_vcpu(__runq_elem(_head->next)->vcpu) || is_idle_vcpu(__runq_elem(_head->prev)->vcpu) ) )
+
 /*
  * This function is in the critical path. It is designed to be simple and
  * fast for the common case.
@@ -1489,6 +1493,10 @@ csched_schedule(
      */
     clear_bit(CSCHED_FLAG_VCPU_YIELD, &scurr->flags);
 
+    if ( HAS_RUNQ_ONE_DOM(runq) ) {
+        tslice = MILLISECS(prv->tslice_ms * 10);
+    }
+
     /*
      * SMP Load balance:
      *
  • Version 2


Don't migrate dom0 vCPUs to other pCPUs which have another dom0 vCPU in there queue

  • Don't make dom0 vCPUs compete against each other for resources. Keep them spread out across the system and with only one on any given pCPU.
    • Patch

Place Domain0 vCPUs at the front of the queue where they scheduled

  • Potential fix adding dom0 vcpu in front of a pcpu runq whenever it is scheduled,

http://lists.xen.org/archives/html/xen-devel/2013-05/msg03153.html

    • Doesn't work as expected; increases dom0 scheduling latency to 10ms.

Unpatch Xen 4.2

Used to compare against all the patched versions of Xen4.2.

  • CPU Loop Load
  • LoginVSI Load
  • No Load