Credit Scheduler

From Xen


Overview

Credit is a (weighted) proportional fair share virtual CPU scheduler. It was the first Xen scheduler thought from the beginning to be fully work conserving on SMP hosts. Each virtual machine is assigned a weight and a cap. A cap of 0 puts the VM in work-conserving mode. A non-zero cap means the vCPUs of the VM will not run above a certain amount of CPU time, even if the system is idle (non-work conserving mode).

It is quantum based, and the timeslice is 30 ms. This (roughly) means that a vCPU can run for up to 30 ms before being preempted by another vCPU. That is nowadays a rather long interval of time, but:

  • it was not that long at the time when Credit was designed and implemented,
  • less frequent preemption is good for throughput of CPU bound workload.

Credit tries to compensate for its long timeslice by giving I/O intensive vCPU a priority boost. Roughly speaking, this means that vCPUs that wakes up after having been waiting for I/O, will likely get to run immediately.

Credit is the default scheduler of Xen. It is providing satisfactory performance, for a lot of workloads. In case applications in need of low latancies (some class of networking applications, audio, etc.) suffers, a potential mitigation would be to change the timeslice (see below).

Global Scheduler Parameters

Timeslice

Timeslice (also known, in other contexts, as scheduling quantum) is for how long a vCPU can run before the scheduler itself chimes in, and if a preemption should occur. And if yes, some other vCPU is put into execution.

A long timeslice is usually good for achieving high throughput of CPU intensive workloads, as it prevents context switches to happen too frequently, which may lead to trashing of CPU and cache(s). The best timeslice value, though, is highly workload dependant. Credit has, by default, a timeslice of 30ms, which can be considered a faiirly long.

In Xen 4.2, we introduced the tslice_ms scheduler parameter. This can be set either using the Xen command-line option, sched_credit_tslice_ms, or, at run time, with xl sched-credit:

# xl sched-credit -t [n]

Possible good values may be 10ms, 5ms, and 1ms, with smaller values allegedly being better suited for latency-sensitive workloads, but at the cost of increased the overhead from context, and reduced CPU cache effectiveness.

The default value of 30ms is universally recognised as being anachronistically too high. There has been an attempt to change it to something smaller, but, unfortunately, because of the intrinsic characteristics of the Credit algorithm, changing timeslice has some not easily predictable side effects, so the change was pushed back.

Therefore, using a different (smaller?) timeslice value may be potentially beneficial for a particular workflow, but that can only be assessed by experimentations.

Context-Switch Rate Limiting

There may be cases where interrupt intensive workloads (i.e., an interrupt wakes up a VM, which does a few microseconds work, and goes back to sleep), coupled with the boosting of vCPUs doing I/O enacted by Credit, causes thousands of scheduler invocation per second. Measurements done by Intel on the SpecVirt benchmark found out that there may be up to 15,000 schedules per second.

We therefore introduced context-switching rate limiting, configured via the ratelimit_us parameter. If different than zero, the ratelimit value (expressed in microseconds) is the minimum amount of time for which a VM is allowed to run without being preempted. The default value is 1000 (1ms). So if a VM starts running, even if another VM with higher priority wakes up, there will not be a preemption until the the first VM has run for 1ms. This caused significant increase in SpecVirt performance, according to above measurements.

This feature can be disabled by setting the ratelimit to 0.

The value of context switching rate-limiting can be set either from the Xen command-line, using sched_ratelimit_us, or from the xl command-line:

# xl sched-credit -r [n]

Curent values can be viewed with:

# xl sched-credit
Cpupool Pool-0: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
Domain-0                             0    256    0

VM Scheduling Parameters

Each domain (including Domain0) is assigned a weight and a cap.

Weight

A domain with a weight of 512 will get twice as much CPU as a domain with a weight of 256 on a contended host. Legal weights range from 1 to 65535 and the default is 256.

Cap

The cap, if set, fixes the maximum amount of CPU a domain will be able to consume, even if the host has idle CPU cycles. The cap is expressed in percentage of one physical CPUs: 100 is 1 physical CPU, 50 is half a CPU, 400 is 4 CPUs, etc. The default, 0, means there is no cap.

Interactions with Power Management

Many systems have features that will scale down the computing power of a CPU (or shuts down hardware blocks of the CPU itself, like Intel's C-States) that is not 100% utilized. This can be in the operating system, in the hypervisor, or even below (e.g., in the BIOS). If you set a cap such that individual cores are running at less than 100%, this may have an impact on the performance of your workload over and above the impact of the cap.

For example, if your processor runs at 2GHz, and you cap a vm at 50%, the power management system may also reduce the clock speed to 1GHz; the effect will be that your VM gets 50% of 1GHz power, which means 25% (not 50%!) of 2GHz. If you are not getting the performance you expect, look at things like CPUfreq and/or C-States options in your operating system, hypervisor and BIOS.

Usage

The xl sched-credit command is used to tune a VM's scheduler parameters:

xl sched-credit -d [domain]
xl sched-credit -d [domain] -w [weight]
xl sched-credit -d [domain] -c [cap]

Technical Details

Algorithm

Each physical CPU manages a local run queue of runnable virtual CPUs. This queue is sorted by vCPU priority. A vCPU's priority can be one of two value: OVER or UNDER representing wether this vCPU has or hasn't yet exceeded its fair share of CPU resource in the ongoing accounting period. When inserting a vCPU in a run queue, it is put after all other vCPUs of the same priority.

As a VCPU runs, it consumes credits. Every so often, a system-wide accounting thread recomputes how many credits each active VM has earned and bumps the credits. Negative credits imply a priority of OVER. Until a vCPU consumes its alloted credits, it priority is UNDER.

On each CPU, at every scheduling decision (when a vCPU blocks, yields, completes its time slice, or is awaken), the next vCPU to run is picked off the head of the run queue. Originally, there was no accounting done in this code path (for the sake of keeping it quick). However, this has been found to be a security problem, and a vector for Denial-of-Service attacks (see Scheduler Vulnerabilities and Coordinated Attacks in Cloud Computing). Therefore, since commit Accurate accounting for credit scheduler, accounting is done precisely, with nanoseconds granularity timestamps.

The Credit scheduler uses 30ms time slices for CPU allocation. A VM (VCPU) receives 30 ms before being preempted to run another VM. Once every 30ms, the priorities (credits) of all runnable VMs are recalculated.

SMP load balancing

The credit scheduler automatically load balances guest VCPUs across all available physical CPUs on an SMP host. The administrator does not need to manually pin vCPUs to load balance the system. However, she can restrict which CPUs a particular vCPU may run on using the generic vcpu-pin interface.

When a CPU doesn't find a vCPU of priority UNDER on its local run queue, it will look on other CPUs for one. This helps making sure that each VM receives its fair share of CPU time. Before a CPU goes idle, it will look on other CPUs to find any runnable vCPU. This guarantees that the scheduler act as a work-conserving one.