Difference between revisions of "Performance of Xen VCPU Scheduling"

From Xen
m
Line 42: Line 42:
 
* If Xen was able to categories the work time of the VM then the sched_ratelimit_us parameter could be adjusted automatically.
 
* If Xen was able to categories the work time of the VM then the sched_ratelimit_us parameter could be adjusted automatically.
 
* A prototype of this would be a suitable improvement to implement as part of this TDP.
 
* A prototype of this would be a suitable improvement to implement as part of this TDP.
  +
  +
  +
== T5.1. Prototype patches for Xen to improve scheduling ==
  +
  +
=== Allow Domain0 vCPUs to preempt another vCPUs before they complete their ratelimited timeslice ===
  +
* Change sched_credit.c function csched_schedule()
  +
** allow-dom0-to-preempt-before-ratelimit-complete.patch
  +
  +
==== CPU Loop Load ====
  +
* Test: Waiting for VMLogin events, while starting 1 VM at a time: http://perf/?t=644 http://perf/?t=715
  +
** Exclusive pinning: minimal change.
  +
** No pinning: minimal change.
  +
* Test: Waiting for VMStart events, while starting 1 VM at a time: http://perf/?t=642 http://perf/?t=716
  +
** Exclusive pinning: no change.
  +
** No pinning: small regression before CPU contention, but large improvement after. (36%)
  +
* Test: Waiting for VMStart events, while starting 25 VMs at a time: http://perf/?t=641 http://perf/?t=717
  +
** Exclusive pinning: minimal change.
  +
** No pinning: large regression. (30%)
  +
  +
====Observation ====
  +
Minimal changes are observed when exclusive pinning is used. This is as expected as only dom0 vCPUs are queued on the exclusive pCPUs.
  +
  +
=== Allow blocked vCPUs to be migrated to other pCPUs before the migration limit is reached ===
  +
* If vCPU yields itself (i.e blocked state) allow for migration to other pCPUs regardless of time left before migration limit reached
  +
** The next vCPU on that pCPU may hold it for upto 1ms (ratelimit time) thus if request is handled by dom0 before then it can't run even if there is a free pCPU elsewhere.
  +
* Change sched_credit.c function __csched_vcpu_is_migrateable()
  +
** allow-migrate-before-limit-if-block.patch
  +
  +
  +
=== Increase cache hits for dom0 vCPUs in the no pinning case ===
  +
* Try to reduce the amount that dom0 vCPUs move about, so that the dom0 moving behaviour is closer to what happens when xpinning its vcpus. This should increase the cache hits and potentially increase the performance of dom0.
  +
  +
=== Schedule a vm vcpu to run on the same pcpu runqueue as the dom0 vcpu that woke the vm vcpu ===
  +
(George suggested on 03/Jul/2013)
  +
* This could reduce impact of cold cache when accessing memory.
  +
  +
=== Allow CSCHED_PRI_TS_BOOST vCPUs to preempt other vCPUs ===
  +
* If the vCPUs is woken up because its request has been handled its priority is set to CSCHED_PRI_TS_BOOST. However, if another vCPU is using the pCPU then it has to wait up to the rate limit before it can process the request.
  +
** allow-BOOST-vcpus-to-preempt-ratelimit.patch
  +
  +
==== Version 2 ====
  +
* Only allow vCPUs with priority CSCHED_PRI_TS_BOOST to preempt the current running vCPU if it isn't in the CSCHED_PRI_TS_BOOST state.
  +
* Hopefully this will reduce some context switching of two boosted vCPUs on the same pCPU.
  +
** boost-preempt-if-curr-not-boost.patch
  +
  +
  +
=== Allow vCPUs to preempt ratelimit if it was woken up after an iommu completion ===
  +
(George suggested on 03/Jul/2013)
  +
  +
*If a vCPU is taken offline due to an iommu request it will be blocked until this request is processed. Once the request has been completed the vCPU has to wait to be scheduled again. During this time further requests can't be queued and then processed in a batch after it is woken up.
  +
*Therefore is a HVM vCPU is woken up from an iommu request completion allow it to ignore the ratelimit of other vCPUs.
  +
*This is likely to occur during the boot process and may improve the bootstorm performance and hopefully reduce the impact of other work loads.
  +
**ignore-ratelimit-if-woke-from-iommu.patch
  +
  +
=== Attempt to dynamically adjust the ratelimit of VMs ===
  +
* When a VMs is booting it performs lots of I/O operations, thus blocking itself frequently. Once the request has been handled the VM will need to run again for only a short amount of time.
  +
* After boot, if the VM is running a CPU loop, then it performs very few I/O operations. It also isn't latency sensitive, and doesn't matter if it runs for many small timeslices, or for one long timeslice.
  +
* The stats collected by the scheduler may be able to be used to adjust the ratelimit of VMs.
  +
  +
=== Don't clear dom0 BOOST priority ===
  +
* Don't clear the CSCHED_PRI_TS_BOOST priority of dom0 vCPUs which should allow dom0 vCPUs to be scheduled more often.
  +
** dont-clean-dom0-boost-priotiry.patch
  +
  +
=== Increase timeslice of vCPUs if they are the only vCPU running on a pCPU ===
  +
* If the pCPU only has one vCPU to run (not including the idle domain), increase the timeslice in order to decrease the number of ticks.
  +
* This should reduce the number of times the vCPU is paused and then reschedule. Therefore reducing context switching.
  +
* Maybe it is possible to become tickless if there is only a single vCPU.
  +
** increase-tslice-if-only-dom.patch
  +
  +
* Version 2
  +
  +
  +
=== Don't migrate dom0 vCPUs to other pCPUs which have another dom0 vCPU in there queue ===
  +
* Don't make dom0 vCPUs compete against each other for resources. Keep them spread out across the system and with only one on any given pCPU.
  +
** Patch
  +
  +
=== Place Domain0 vCPUs at the front of the queue where they scheduled ===
  +
* Potential fix adding dom0 vcpu in front of a pcpu runq whenever it is scheduled,
  +
http://lists.xen.org/archives/html/xen-devel/2013-05/msg03153.html
  +
** Doesn't work as expected; increases dom0 scheduling latency to 10ms.
  +
  +
=== Unpatch Xen 4.2 ===
  +
Used to compare against all the patched versions of Xen4.2.
  +
* CPU Loop Load
  +
* LoginVSI Load
  +
* No Load
  +
   
 
[[Category:Performance]]
 
[[Category:Performance]]

Revision as of 10:56, 10 July 2013

<under construction>

Matthew Portas

Marcus Granado

Introduction

This page evaluates the performance of the Xen VCPU schedulers with different parameters/patches and under different guest loads. The motivation was the observation that a configuration where dom0 vcpus were pinned to a set of pcpus in the host using dom0_vcpus_pin and the guests were prevented from running on the dom0 pcpus (here called "exclusively-pinned dom0 vcpus") caused the general performance of the guests in a host with around 24 pcpus or more to increase and the desire to understand why this improved behaviour was not present in the default non-pinned state and if it would be possible to obtain this extra performance in the default state by changing some parameter or patching the Xen VCPU scheduler.

T1. Check that pinning/unpinning issue is present

T2. Experiment with scheduler parameter

T3. Understand if there is a latency problem

Accumulated login time for the VMLogin events with xentrace logs taken at 20th and 40th VM start. Exclusive pinning (blue) vs. No pinning (yellow). All the VMs are executing a cpu loop in order to maximize pCPU usage. Accumulated login time for the VMLogin events with xentrace logs taken at 20th and 40th VM start. Exclusive pinning (blue) vs. No pinning (yellow). All the VMs are executing a cpu loop in order to maximize pCPU usage.

  • At the 20th VM we have more vCPUs than pCPUs on the host.
  • With no pinning the Xen scheduler tends to group the new VM's vCPU running events into larger chucks on the same pCPU.
  • With exclusive pinning the Xen scheduler tend to interleave the new VM's vCPU running events with other VMs' vCPUs running events.
  • with exclusive pinning the new VM's vCPU (black triangle) is believed to be doing IO at boot time and is yield its vCPU to Xen in order to handle this IO request after about 10us. Xen then schedule a different VM's vCPU (green square) which then runs (in the cpu loop) for a timeslice of about 1ms. Only then is the new VM give its vCPU back to handle its next IO request.

(ToDo: Run the same test but with a maximum Xen timeslice of less than 1ms so that the new VM's IO request are blocked for a short period of time.)

  • With no pinning when the new VM's vCPU is yielded Xen doesn't schedule another VM's vCPU on the same pCPU. Therefore once the IO request has been handle, the VM's vCPU can be rescheduled and process this IO request immediately.
  • This highlight a preference of the scheduler for vCPUs which are running the CPU loops, but only in the exclusive pinned case.

Accumulated login time for the VMStart events with xentrace logs taken at 20th and 40th VM start. Exclusive pinning (blue) vs. No pinning (yellow). All the VMs are executing a cpu loop in order to maximize pCPU usage. Accumulated login time for the VMStart events with xentrace logs taken at 20th and 40th VM start. Exclusive pinning (blue) vs. No pinning (yellow). All the VMs are executing a cpu loop in order to maximize pCPU usage.

sched_ratelimit_us: The Xen parameter sched_ratelimit_us is used to set the minimum amount of time for which a VM is allowed to run without being preempted. The default value is 1000 (1ms). Setting sched_ratelimit_us=0

  • For VMLogin with cpu_loop as the VM load the boot time decreased dramatically for exclusive pinning, resulting in a time similar to that of no pinning. http://perf/?t=577
  • For VMStart with cpu_loop as the VM load the boot time decreased dramatically for both exclusive pinning and no pinning. http://perf/?t=580
  • However, when performing the LoginVSI tests no improvement in score was achieved and instead a slightly worse score was achieved. http://perf/?t=583
  • When no load is running in the VM there is barely any difference when using VMStart http://perf/?t=585.
  • However, when using VMLogin with no load in the VM it performs better with this parameter if pinning is not used. If exclusive pinning is used then i actually performs much worse. http://perf/?t=593

Conclusion

  • The results should that sched_ratelimit_us can have a big effect on the bootstorm performance. However, it is very depended on the load that is being run in the VMs.
  • If Xen was able to categories the work time of the VM then the sched_ratelimit_us parameter could be adjusted automatically.
  • A prototype of this would be a suitable improvement to implement as part of this TDP.


T5.1. Prototype patches for Xen to improve scheduling

Allow Domain0 vCPUs to preempt another vCPUs before they complete their ratelimited timeslice

  • Change sched_credit.c function csched_schedule()
    • allow-dom0-to-preempt-before-ratelimit-complete.patch

CPU Loop Load

  • Test: Waiting for VMLogin events, while starting 1 VM at a time: http://perf/?t=644 http://perf/?t=715
    • Exclusive pinning: minimal change.
    • No pinning: minimal change.
  • Test: Waiting for VMStart events, while starting 1 VM at a time: http://perf/?t=642 http://perf/?t=716
    • Exclusive pinning: no change.
    • No pinning: small regression before CPU contention, but large improvement after. (36%)
  • Test: Waiting for VMStart events, while starting 25 VMs at a time: http://perf/?t=641 http://perf/?t=717
    • Exclusive pinning: minimal change.
    • No pinning: large regression. (30%)

Observation

Minimal changes are observed when exclusive pinning is used. This is as expected as only dom0 vCPUs are queued on the exclusive pCPUs.

Allow blocked vCPUs to be migrated to other pCPUs before the migration limit is reached

  • If vCPU yields itself (i.e blocked state) allow for migration to other pCPUs regardless of time left before migration limit reached
    • The next vCPU on that pCPU may hold it for upto 1ms (ratelimit time) thus if request is handled by dom0 before then it can't run even if there is a free pCPU elsewhere.
  • Change sched_credit.c function __csched_vcpu_is_migrateable()
    • allow-migrate-before-limit-if-block.patch


Increase cache hits for dom0 vCPUs in the no pinning case

  • Try to reduce the amount that dom0 vCPUs move about, so that the dom0 moving behaviour is closer to what happens when xpinning its vcpus. This should increase the cache hits and potentially increase the performance of dom0.

Schedule a vm vcpu to run on the same pcpu runqueue as the dom0 vcpu that woke the vm vcpu

(George suggested on 03/Jul/2013)

  • This could reduce impact of cold cache when accessing memory.

Allow CSCHED_PRI_TS_BOOST vCPUs to preempt other vCPUs

  • If the vCPUs is woken up because its request has been handled its priority is set to CSCHED_PRI_TS_BOOST. However, if another vCPU is using the pCPU then it has to wait up to the rate limit before it can process the request.
    • allow-BOOST-vcpus-to-preempt-ratelimit.patch

Version 2

  • Only allow vCPUs with priority CSCHED_PRI_TS_BOOST to preempt the current running vCPU if it isn't in the CSCHED_PRI_TS_BOOST state.
  • Hopefully this will reduce some context switching of two boosted vCPUs on the same pCPU.
    • boost-preempt-if-curr-not-boost.patch


Allow vCPUs to preempt ratelimit if it was woken up after an iommu completion

(George suggested on 03/Jul/2013)

  • If a vCPU is taken offline due to an iommu request it will be blocked until this request is processed. Once the request has been completed the vCPU has to wait to be scheduled again. During this time further requests can't be queued and then processed in a batch after it is woken up.
  • Therefore is a HVM vCPU is woken up from an iommu request completion allow it to ignore the ratelimit of other vCPUs.
  • This is likely to occur during the boot process and may improve the bootstorm performance and hopefully reduce the impact of other work loads.
    • ignore-ratelimit-if-woke-from-iommu.patch

Attempt to dynamically adjust the ratelimit of VMs

  • When a VMs is booting it performs lots of I/O operations, thus blocking itself frequently. Once the request has been handled the VM will need to run again for only a short amount of time.
  • After boot, if the VM is running a CPU loop, then it performs very few I/O operations. It also isn't latency sensitive, and doesn't matter if it runs for many small timeslices, or for one long timeslice.
  • The stats collected by the scheduler may be able to be used to adjust the ratelimit of VMs.

Don't clear dom0 BOOST priority

  • Don't clear the CSCHED_PRI_TS_BOOST priority of dom0 vCPUs which should allow dom0 vCPUs to be scheduled more often.
    • dont-clean-dom0-boost-priotiry.patch

Increase timeslice of vCPUs if they are the only vCPU running on a pCPU

  • If the pCPU only has one vCPU to run (not including the idle domain), increase the timeslice in order to decrease the number of ticks.
  • This should reduce the number of times the vCPU is paused and then reschedule. Therefore reducing context switching.
  • Maybe it is possible to become tickless if there is only a single vCPU.
    • increase-tslice-if-only-dom.patch
  • Version 2


Don't migrate dom0 vCPUs to other pCPUs which have another dom0 vCPU in there queue

  • Don't make dom0 vCPUs compete against each other for resources. Keep them spread out across the system and with only one on any given pCPU.
    • Patch

Place Domain0 vCPUs at the front of the queue where they scheduled

  • Potential fix adding dom0 vcpu in front of a pcpu runq whenever it is scheduled,

http://lists.xen.org/archives/html/xen-devel/2013-05/msg03153.html

    • Doesn't work as expected; increases dom0 scheduling latency to 10ms.

Unpatch Xen 4.2

Used to compare against all the patched versions of Xen4.2.

  • CPU Loop Load
  • LoginVSI Load
  • No Load