Credit2 Scheduler

From Xen

Introduction

Design Goals

Credit2 is a general purpose scheduler for Xen, designed with particular focus on fairness, responsiveness and scalability. In some more details, what this means is:

  • fairness: each VM should get its fair share of physical CPU capacity. At first approximation, the amount of CPU time each VM (each vCPU, actually) will receive depends on how many VMs (vCPUs) there are in the system. Total fairness is reached if everyone gets the same, which is also equal to the total capacity divided by the number of entities competing for it. VMs can be given scheduling parameters, though, and the absolute and relative (to the values of the same parameters for the other VMs) will affect the actual amount of CPU capacity that each VM receives.

Such fair share should depend on the VM's scheduling parameters, such as weight, cap, and reservation.

  • responsiveness: interactive and latency sensitive workloads inside VMs should work "well enough".
  • scalability: the overhead on large servers with lots of physical CPUs should be small.

In fact, from a more high level point of view, the end goal was (is!) to create a scheduler which can work well in use cases, such as:

  • server consolidation,
  • virtual desktop server solutions,
  • client virtualization.

Using Credit2

Credit2 is not in use by default. In order to use it as the Xen scheduler, sched=credit2 be passed to the hypervisor at boottime.

Once the system is live, for creating a cpupool with Credit2 as its scheduler, either fill-in a cpupool configuration file, as described in docs/man/xlcpupool.cfg.pod.5 (and as exemplified in tools/examples/cpupool), or use just xl directly:

~# xl cpupool-create name=\"pool1\" sched=\"credit2\" cpus=[1,2]
~# xl sched-credit2
Cpupool pool1: ratelimit=1000us

Two kind of interactions with the scheduler are possible:

  1. checking or changing the global parameters, via, e.g.:
    1. xl sched-credit2 -s
    2. xl sched-credit2 -s -p pool1
    3. xl sched-credit2 -s -r 100
  2. checking or changing a VM scheduling parameters, via, e.g.:
    1. xl sched-credit2 -d vm1
    2. xl sched-credit2 -d vm1 -w 1024

For instance, after 1.3 and 2.2, you would get:

~# xl sched-credit2 -s -p pool
Cpupool pool: ratelimit=100us
Name                                ID Weight
vm1                                  1   1024

Contributing

Thoughts, suggestions, questions and patches via email to xen-devel. See the list info page for subscription information and the archives.

If asking questions, please refer to Asking Developer Questions. If sending patches, please see Submitting Xen Patches.

For non-developers, even just testing your favorite workload, and observing and reporting any regressions, improvements, and so on, would be much appreciated. Remember that a scheduler's job is only hard when there's more work to do than there is CPU to do it, so make sure you're focusing on multiple competing workloads.

Status

Credit2 is (together with Credit) one of the two general purpose supported Xen schedulers. Some more details about the current status are:

  • it is considered stable and reliable, as it is being tested in our CI (OSSTest) since quite some time (although, not in all tests);
  • benchmarks have shown improved fairness, with respect to Credit, and better handling of latency sensitive workloads;
  • we're eagerly chasing performance numbers from all possible and diverse combinations of workloads.

Missing Features

  • CPU Topology:
    • Soft scheduling affinity: support for soft scheduling affinity (useful to implemente, e.g., NUMA aware scheduling). Patches are under review in the xen-devel mailing list.
  • Algorithm:
    • Caps: as it is available in Credit. Patches are under review in the xen-devel mailing list.
    • Reservation: new feature or Credit2, not available in Credit.

Open Issues

Here they are some open questions. Open problems that, although not critical, looks worth investigating:

  • How to deal well with mixed workload, e.g., situations where someone is listening to iTunes and there's a flash ad in background burning 100% of the CPU? Can we allow the audio to play well while still giving the VM only its fair share?
  • There exists a lot of CPU and/or IO throughput benchmarks, automatically outputting tons of numbers for one to look at. But as far as, e.g., audio and video are concerned, what are ways of automatically and programmatically measuring quality?

Milestones

  • 2016 2 November: Credit2 declared supported in (Xen 4.8)
  • 2010 24 December: Simple load balancer checked into xen-unstable; Credit2 can now handle multiple sockets.
  • 2010 December: Credit2 integrated into the OSSTest automated test suite
  • 2010 14 April: Credit2 checked into xen-unstable
  • Prototype Credit2 scheduler designed for latency-sensitive workloads and very large systems included in Xen 4.1

Internals

Icon Ambox.png This section of the page is a work-in-progress, both from the form and from the content point of views. It will be updated frequently, so come back regularly to check it out, if interested.


Xen Scheduling Framework

Xen comes with a really flexible and extensible scheduler interface. Basically, some code is generic enough to be common to all schedulers and scheduling algorithms.

The details of a particular scheduling algorithm can be implemented in a dedicated source file. The interaction between generic and specific algorithm code is mandated to happen by means of a well defined interface.

Generic scheduling code is in xen/common/schedule.c.

The interface available for the various schedulers, in the form of a set of functions pointers (often called hooks and/or callbacks), is declared at the bottom of xen/include/sched-if.h (look for struct scheduler). Credit2 implementation is to be found in xen/common/sched_credit2.c.

Runqueues

One of the most important data structures in a scheduler is the one used for representing the runqueue. In credit2, that is struct csched2_runqueue_data.

Runqueues to pCPUs mapping

In Credit2, a runqueue can be associated to a specific set of pCPUs during scheduler initialization, which is different, e.g., from what happens in Credit1, where there always be one runqueue for each pCPU.

Mapping from pCPUs and runqueues occurs at pCPU initialization time. In fact as a part of the pCPU bringup process, init_pcpu() is called, for every pCPU (except than for pCPU #0, which is dealt with in a special way, it being the so called "boot CPU"). This happens by means of the CPU notifiers mechanism, more specifically by means of the CPU_STARTING callback (see cpu_credit2_callback(), struct notifier_block cpu_credit2_nfb and csched2_global_init()). The fact that Credit2 needs CPU notifiers is, by the way, the reason why for this scheduler we need the `global_init` hook to be defined (to csched2_global_init()), while that is not the case nor Credit1, neither in any other scheduler.

At the time of writing this document, one runqueue per each [physical socket] the host has is created, and pCPUs are assigned to runqueues accordingly (i.e., pCPUs on socket 1 to runqueue 1, pCPUs of socket 2 to runqueue 2, etc.). It would be interesting to try changing this so that we can have, for instance, one runqueue per each [physical core].In fact, it has been proven that moving vCPUs around a socket too much, has negative effects on the system performance (that is why the "migration resistance" mechanism was introduced). Having one runqueue per core would avoid this, and there are good chances that it will improve substantially the performances of the scheduler in [Hyperthreaded] systems.

A further extension to this would be to make it the pCPUs to runqueue mapping really dynamic, e.g., making it possible for it to be decided at boot time (with a Xen boot parameter) and/or when a Credit2 cpupool is created. We can support a bunch of possible configuration, i.e., something like this:

  • `opt_credit2_runquque=pcpu`: we create one runqueue per-each pCPU. That means, on hyperthreaded system, one runqueue per-each hyperthread, as it happens for Credit1;
  • `opt_credit2_runqueue=core`: we create one runqueue per each physical core, as described above;
  • `opt_credit2_runqueue=socket`: we create one runqueue per each physical socket, as it is right now in the code;
  • `opt_credit2_runqueue=node`: we create one runqueue per each NUMA node;
  • `opt_credit2_runqueue=host`: we create only one runqueue, accommodating all the pCPUs of the host.

This will give users the chance to try and pick up the best configuration for their workloads. It is very likely that each of the above configuration will need a different value of the migration resistance threshold (unless `opt_migrate_resist` is specified explicitly at boot time). Experiments must be run to determine what meaningful values are in each case.

Serialization (a.k.a. Locking)

Runqueue Locks

Modifications to the runqueues must be serialized, i.e., protected from concurrent access from multiple parties (typically Xen code running on multiple pCPUs at the same time). Not doing so, leaves chances open for the runqueue data structures to enter inconsistent states. In credit2 (similarly to what happens in all the other Xen schedulers) runqueue locking is per-runqueue. In fact, each runqueue has a `spinlock_t lock` field, and such a lock must be held before starting manipulating any other field of `struct csched2_runqueue_data`.

Taking a runqueue lock happens via calls like these:

  • `lock = vcpu_schedule_lock_irq(vc);`
  • `old_lock = pcpu_schedule_lock(cpu);`
  • `lock = spin_lock(per_cpu(schedule_data, cpu).schedule_lock);`

To see how functions like `vcpu_schedule_lock()` and `pcpu_schedule_lock()` (and their `_irq`, `_irqsave`, etc., variants) are defined, look in `xen/include/xen/sched-if.h`. The doc comment above `struct schedule_data` does a pretty good job at explaining how things work. In credit2, the `schedule_lock` pointer from `struct schedule_data` is actually remapped (as explained in the said doc comment). The re-mapping happens at pCPU initialization time, withing credit2 code, i.e., in `init_pcpu()`, together with all the other operations necessary to map a runqueue to a set of pCPUs.

Calling to the above functions mainly happens from:

  • `xen/common/schedule.c`,
  • `xen/common/sched_credit2.c` (at least as far as credit2 is concerned);

this mainly mean that, even if one does not see any such call around in `sched_credit2.c`, the runqueue lock may have been taken already by the caller, which may be some function in `schedule.c`.

Scheduler Lock

Credit2 makes use of a scheduler-wide data structure, `struct csched2_private`, for accommodating runqueue independent information. That includes being able to tell what runqueues are active (which means they have at least one pCPUs associated to them), and about the mapping of pCPUs to runqueues. It also has fields used for load tracking and a list of domain, for debugging purposes.

Serialization is required here too, and that happens by means of a dedicated spinlock variable, embedded in the structure itself. This lock is often referred to as (both in this document and in the code) the "global scheduler lock" or the "private scheduler lock".

In order to avoid deadlock situations, if both it is necessary to take and hold this scheduler-wide lock and one (or more) runqueue locks, the private scheduler lock must be acquired *fist*. In other words, the runqueue locks _nests_ inside the private scheduler lock. In fact, as this is how things happens already in a few places in credit2 code, inverting the order of the operations even just in one spot leaves room to introducing a deadlock.

vCPUs and Domains

For a Xen scheduler, the actual schedulable entities are the various domains' vCPUs. In Xen, each domain is associated with a `struct domain`, and for each each vCPU there is a `struct vcpu`. Credit2 maintains its own data structures for both domains and vCPUs, which are `struct csched2_dom` and `struct csched2_vcpu`. These hold per-domain and per-vCPU information which are relevant to the Credit2 scheduler. They are not completely independent from their broader counterpart, and in fact there are up-pointers linking a `struct csched2_dom` to a `struct domain` (`csched2_dom.dom`), and a `struct csched2_vcpu` to a `struct vcpu` (`csched2_vcpu.vcpu`).

csched2_vcpu

Algorithm-wise, the most important fields in `struct csched2_vcpu` are `credit`, `start_time` and `flags`. In fact, `start_time` and `credit` are used to figure out how much time a vCPU is allowed to stay on a pCPU, `flags` is required for tracking the status of the vCPU itself.

The `weight` and `residual` fields are also quite important to that effect, actually.

The various fields of `struct list_head` type are what allows to *queue* the vCPU in various places. In details:

  • `rqd_elem` is used to queue the vCPU in the list of vCPU "associated" to a runqueue (e.g., for load tracking purposes), maintained in `csched2_runqueue_data.svc`;
  • `sdom_elem` is used to queue the vCPU in the list of vCPUs of a domain, maintained in `csched2_dom.vcpu`;
  • `runq_elem` is used to queue the vCPU in the actual runqueue, maintained in `csched2_runqueue_data.runq`.

The `rqd` pointer is what allows us, at any given time, to reach out to the runqueue where a vCPU belongs, should we need it (and we indeed do!).

What remains, is useful for tracking the load present on each runqueue.

csched2_dom

`weight` is the only relevant bit, for the Credit2 algorithm. The `list_head`-s are:

  • `vcpu` is the list of vCPUs of the domain;
  • `sdom_elem` is used to queue the domain in the list of all domain an instance of Credit2 is handling.

Scheduler Management Operations

A scheduler needs to be initialized. This usually happens at boot time, inside `scheduler_init()` in `schedule.c` (look for the call to `SCHED_OP(&ops, init)`). It is also possible to de-initialize a scheduler, as well as initialize it at a different time than boot time. That is required to support cpupools (if interesting, look at `scheduler_alloc()` and `scheduler_free()` still in `schedule.c`).

Credit2's init and deinit functions are `csched2_init()` and `csched2_deinit()`. Nothing fancy about them, though, they just allocate (and initialize to a known state) and free, respectively, all the relevant data structure.

Credit2 is special, with respect to the other schedulers, in that it needs a "global" initialization step (the reason why that is necessary has been explained already). That is achieved by means of `csched2_global_init()`, called via the `global_init` scheduler hook, and that happens only at boot time, if Credit2 is selected as the default Xen scheduler.

pCPU Management Operation

In Xen, a scheduler's job is to decide what vCPU(s) should run on what pCPU(s), for how long, etc. In order to do that, the scheduler needs information about both vCPUs and pCPUs. For instance, in Credit1, where there is one runqueue per each pCPU, it exists an internal scheduler data structure, called `csched_pcpu`, to hold such information. Credit2 doesn't do that, and what it focuses on are runqueues. This means, for Credit2, "initializing" a pCPU means properly account for it when building/modifying the scheduler's runqueues.

In principle, it works as follows. For each pCPU the scheduler is entitled to handle:

  • check whether it has been initialized already;
  • if not, check on what runqueue such pCPU should belong;
  • put it there. If it was the first pCPU of that runqueue, mark the runqueue as active;
  • (re-)arrange runqueue spinlock mappings so that `pcpu_schedule_lock()` works as expected.

This is exactly what happens in `init_pcpu()`, in `sched_credit2.c`. Wether or not a pCPU is initialized already, is tracked in the `csched2_private.initialized` cpumap. pCPU to runqueue assignment is led by `cpu_to_socket()`, which is what gives us (bugs aside, see http://bugs.xenproject.org/xen/bug/36) the "one runqueue per socket" setup, as the code stands right now. Runqueue activation, if necessary, is done by `activate_runqueue()`, which really just inits the proper element of the array (of `csched2_runqueue_data` type) `csched2_private.rqd`. Note that `csched2_private.active_queues`, despite being of `cpumask_t` type, is actually a map of *runqueues*, i.e., the _i-eth_ bit is set iff runqueue _i_ has been activated. (That may look a bit misleading at first, but it certainly was not worth defining a new data type only for this!) Finally, lock remapping is done by assigning `per_cpu(schedule_data, cpu).schedule_lock`.

After `init_pcpu(i)` has run, `runq_map[i]`, in the `csched2_private` instance for the scheduler, contains the ID of the runqueue where pCPU _i_ has been put (say _r_), and the _i-eth_ bit of `rqd[r].active` and `rqd[r].idle` are set, to signify that pCPU _i_ "insists" on runqueue _r_ and is idle, i.e., no vCPU is scheduled on it.

As anticipated before, the way this `init_pcpu()` function is called is a bit more involved than how one may expect. In Xen, as soon as a pCPU is ready to be considered by the scheduler `cpu_schedule_up()` is called, via the CPU notifiers mechanism (check `cpu_schedule_callback()` in `xen/common/schedule.c` and, for the notifiers mechanism, in `xen/common/cpu.c`). `cpu_schedule_up()` calls the `alloc_pdata` scheduler hook which, in all Xen schedulers but Credit2, does the scheduler related pCPU initialization directly. In Credit2, the implementation of the `alloc_pdata` hook, i.e., `csched2_alloc_pdata()` only does that for pCPU #0. For all the other pCPUs, a Credit2 specific notifier is used: `csched2_cpu_starting()`. That is called by `cpu_credit2_callback()`, in case the pCPU is being actually started (`CPU_STARTING` as an action). The notifier itself is registered during Credit2 global initialization, in `csched2_global_init()`.

The reason for all this is that we want some of the pCPU topology information --such as what is returned by `cpu_to_socket()`, that we want to use to setup the runqueues-- to be there already, when looking at the pCPUs from Credit2 code, and that would not be the case without this one more level of indirection. pCPU #0 is handled differently because, since that is the pCPU where Xen itself runs during boot time (e.g. from where the other pCPUs are configured and their notifiers registered), it does not get any `CPU_STARTING` notifier callback, which would mean it would remain uninitialized, if it were not special cased!

`csched2_free_pdata()` is a lot more boring than its alloc counterpart. It really just reset things and clear bits, undoing what `csched2_alloc_pdata` did. The only thing that is worth mentioning is that the spinlock remapping for the pCPU being deallocated from a Credit2 instance is also undone. If the pCPU was the last one on a particular runqueue, the runqueue itself is deactivated and made unusable.

vCPU and Domain Management Operations

vCPU and Domain Allocation

In order to be scheduled, vCPUs need to be allocated and initialized. Also, since domains come and go, vCPUs will, at some point, disappear, and hence the data structure related to them needs to be deallocated.

This is the scope of the following functions:

  • `csched2_dom_init()`
  • `csched2_alloc_vdata()`
  • `csched2_dom_destroy()`
  • `csched2_free_vdata()`

All the first twos do is calling `csched2_alloc_domdata()` and `csched2_free_domdata()`, respectively. Similarly to what happens for scheduler initialization, there is nothing fancy going on in `csched2_[alloc|free]_domdata()`: all the field of `struct cched2_dom` that are dynamic are allocated and everything is set to a known state. The scheduling weight of the domain is set to `CSCHED2_DEFAULT_WEIGHT`.

Finally, the new domain (`sdom`) is queued among the domains managed by the specific instance of the Credit2 scheduler. Such list is available via `SCHED2_PRIV(ops)->sdom`, and `sdom->sdom_elem` is used to put the domain there. Note that this operation occurs while holding the global scheduler lock.

`cshced2_dom_init()` and `csched2_dom_destroy()` are called from `schedule.c`, via their scheduler hooks, `init_domain` and `destroy_domain` (look for `sched_init_domain()` and `sched_destroy_domain()` in `schedule.c`).

`csched2_alloc_domdata()` and `csched2_free_domdata()`, also have their own hooks, and in fact they are called from `schedule.c()`, inside `sched_move_domain()`, and this is again because of cpupools.

`csched2_alloc_vdata()` (and free) also performs just a bunch of pretty trivial allocation and initialization of embedded list queuing elements and up pointers. The scheduling weight for the vCPU is set to the scheduling weight of the domain it belongs to. The vCPU contribution to system load, is initialized to 50%. The "natural" caller of `csched2_alloc_vdata()`, via the `alloc_vdata` scheduling hook, is `sched_init_vcpu()`, in `schedule.c`, as the last step of the more general vCPU initialization procedure. Other occurrences are due to cpupools.

Inserting vCPUs on Runqueue for First Time

After the scheduler is aware a vCPU exists, having allocated a data structure for it, it needs to "link" it to a runqueue, e.g., for load tracking purposes. Note that, in Credit2, this is not what makes the vCPU actually runnable and able to be picked up and executed on a pCPU yet, it is really just a (changeable) association between a vCPU and a runqueue.

This is the purpose of `csched2_vcpu_insert()`, called via the `insert_vcpu` scheduler hook, called, (leaving cpupools aside) from `sched_init_vcpu()` (in `schedule.c`), right after `alloc_vdata`. This does 3 things:

  • sets the vCPU's `rqd` pointer to point to the runqueue where the vCPU is being inserted;
  • add the vCPU to the list of vCPUs of its own domain (and update the `nr_vcpus` counter accordingly);
  • calls `runq_assign()`. `runq_assign()` (by calling `__runq_assign()`) adds the vCPU to the list of vCPUs that will live in the runqueue, and tries to update the maximum weight, if necessary, and the average load.

As one can easily imagine, `csched2_vcpu_remove()` just undo all that have been described above (by calling `runq_deassign()`).

Scheduling Operations

vCPUs Going to Sleep

If a vCPU is running and, wants or has to (for reasons independent from the scheduler itself) stop doing so, `csched2_vcpu_sleep()` is called. As it can be seen in `vcpu_sleep_nosync()` --the sole caller of the `sleep` scheduler hook-- when `csched2_vcpu_sleep()` is entered, the proper runqueue lock has been already taken!

Taking care of a vCPU going to sleep is, luckily enough, pretty easy. Basically, all we want is to make sure that some other vCPU get a chance to be picked and executed on the pCPU where the vCPU was running. To do that, we check whether the vCPU in question was the `curr` vCPU on its pCPU (`v->processor`) and, if it was, we send a remote request to go through the scheduler to such pCPU (via `cpu_raise_softirq()`).

OTOH, if the vCPU was not running, we just remove it from the runqueue, and update the runqueue load accordingly. Removing from runqueue happens via `__runq_remove()`, which is basically just a wrapper to a `list_del_init()`, i.e., the function from removing an element from a list.

Finally, if we find the `__CSFLAG_delayed_runq_add` bit set in the vCPU flags, it means that the vCPU woke up, but could not be put back on any runqueue, and so such requeuing operation is pending, and will be performed later. However, since the vCPU is going to sleep again, no point in putting it back on a runqueue, so we really want to clear that flag.

vCPU Waking Up

If a vCPU comes back from being inactive to wanting to run again, the scheduler has to take some action. Credit2 does so in `csched2_vcpu_wake()`, the sole caller of which (via the `wake` scheduler hook) is `vcpu_wake()`, in `schedule.c`.

First of all, there are chances that, when `csched2_vcpu_wake()` is called, the vCPU it is called on may still be dealing with the tail of a scheduling operation. This needs to be checked, by testing the presence of the `__CSFLAG_scheduled` flag and, if that is the case, we can't do anything and we should ask someone else to take care of the re-queuing. We do this by raising the `__CSFLAG_delayed_runq_add`.

Again, quite simple, we put the vCPU back on the queue with `runq_insert()` (which calls `__runq_insert()`), and we go seeing whether the pCPUs associated to the runqueue where we're putting it need to go through the scheduler again.

Inserting (`__runq_insert()`) is a bit more complex than just calling `list_add_tail()`, and involves scanning at least part of the runqueue, because in Credit2 the runqueues are sorted by credits.

(Possibly) Requesting rescheduling is also more complex than in the sleep case. In fact, this time we want to check which one pCPU we better request to go through scheduling. In both Credit1 and Credit2, this operation is called "tickling". Once tickled, a pCPU will pick up the newly woken task from the runqueue by itself, and as soon as it can.

The best candidate for running the waking up task, is the pCPU where it was executing when he went to sleep (which is still in its `struct vcpu.processor` field). And this is exactly what `runq_tickle()` does first, look whether the vCPU running on the old pCPU of the waking vCPU has less credit, and if yes, we're done. A reason for doing this is that, if the sleep has been reasonably small, the waking vCPU may well find some of the caches still hot with some of its data.

Next attempt is seeing if there are pCPUs that are idle, because if yes, we can tickle one of them! For making this easy, Credit2 maintains a map of pCPUs that are currently idle in `csched2_runqueue_data.idle`. However, what about idle pCPUs that may have been tickled already? Tickling a pCPU more than once, before it has a chance to actually go through the scheduler again, is useless, and we risk leaving one or more vCPUs waiting in their runqueue! For this reason, Credit2 maintains, in `csched2_runqueue_data.tickled` a map of those pCPUs that have been tickled and have not yet rescheduled. So, what we really want is to tickle a pCPU which is part of the intersection between the set of idle pCPUs, and the set of pCPUs that have not been tickled... And that is exactly what is attempted next in `runq_tickle()`.

Last stand, we want to try tickling a pCPU which is not the pCPU where the waking up task was running, and that is not idle. Among all of these, we really want the one which is running the vCPU with lowest credits. That is what happens in the loop, in the second half of `runq_tickle()`.

Notice that, in both places where comparing credits is necessary, we call `burn_credits()` before actually performing the comparison. This is necessary because we can't know when the last credit accounting (i.e., burning) event happened for that particular vCPU, and so we need to do it ourselves, in order for the comparison with the amount of credit of the waking up task to be fair.

Finally, to avoid moving vCPU around too light-heartedly, we ask for the newly waking task to beat the vCPU running on the pCPU we picked by more than `CSCHED2_MIGRATE_RESIST` credits. This value defaults to 500 microseconds, but can be changed at boot time using the `migrate_resist` Xen command line argument.

Picking the Best pCPU for a vCPU

There are cases when a vCPU needs to be pushed away from a pCPU as soon as possible. Look, for instance, at `cpu_disable_scheduler()` or `vcpu_set_affinity()`, in `schedule.c`. In both these situations, although for different reasons, to (potentially, in the `vcpu_set_affinity()` case) move a bunch of vCPUs away from where they are running. And where should we put them? Well, we should ask the scheduler! That is actually what happens in `vcpu_migrate()`, when the `pick_cpu` scheduler hook is invoked.

In Credit2, that reflects to a call to `choose_cpu()`. In that function, after trying to honour an explicit request of migrating the vCPU somewhere specific (via `__CSFLAG_runq_migrate_request`), we try to find the best possible runqueue, taking instantaneous load into account.

Icon todo.png To Do:

put down more about this.


Scheduling

The real core of the scheduler: `csched2_schedule()`. Called via the `do_schedule` hook which, for all Xen schedulers, is called only from `schedule()`, in `schedule.c`. On its turn, `schedule()` is called under very specific circumstances. This mainly happens when `SCHEDULE_SOFTIRQ` is raised, and that can be explicitly requested via calls like `cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);` (like it happens in the tickling mechanism), or be triggered by the firing of timers, or other (few) events.

`schedule()` is the function that has the responsibility of deciding what vCPU should run next. Well, actually, `schedule()` really only takes care of some generic stuff, like handling tasklets and timers, but it actually relies on whatever the `do_schedule` hook maps to, for actual scheduling decisions. One thing that `schedule()` actually does, before invoking `do_schedule`, is taking the runqueue lock (for the runqueue of the pCPU that `schedule()` itself is invoked upon), via `pcpu_schedule_lock_irq(cpu)`, so such lock is already taken while inside `csched2_schedule()`.

So, about `csched2_schedule()`. First thing: accounting. We want to call `burn_credits()` on the currently running vCPU, so that its amount of credit is updated, taking into account the time it executed since the last scheduling event/accounting instant.

Tasklet: the `tasklet_work_scheduled` "flag" comes directly from `schedule.c`'s `schedule()`. If it is set, it is wise to give the pending tasklet work a chance to run, since tasklets usually perform rather urgent tasks, which needs to execute as soon as possible. That happens by just selecting the idle vCPU as the next vCPU to be executed on the pCPU, and it will take care of everything.

At this point, what vCPU should run must be decided. To do so, we traverse the runqueue and check whether we find a vCPU there with more credit than the currently running task in it. If we find one, then that is the vCPU we want to run next. If we don't, and the currently running vCPU is still runnable, then we want to continue running it. If neither of the above is the case, the pCPU will just go idle. In case a vCPU with more credit than current is found, but on another pCPU (among the ones associated to the same runqueue, of course!), migration resistance should be considered. This logic is implemented in `runq_candidate()`.

At this point, what remains is:

  • remove the vCPU that is leaving the vCPU (if that is actually happening) from the runqueue, with `__runq_remove()`, and mark the vCPU that is entering the pCPU as `__CSFLAG_scheduled`, to avoid races with sleep and wakeup code paths;
  • handle reset condition, if necessary (i.e., if the amount of credits of the vCPU to be scheduled next is below the `CSCHED2_CREDIT_RESET` threshold);
  • update the map of idle pCPUs, if necessary;
  • deal with migration compensation, if the next to be executed vCPU actually comes from a different pCPU.
Icon todo.png To Do:

More about this


Credit Burning and Accounting

Icon todo.png To Do:

put down something about this


Load Tracking and Load Balancing

Icon todo.png To Do:

put down something about this


Changing Scheduling Parameters

In Credit2, it is possible to dynamically change the scheduling parameters of a domain during runtime. That basically means chaging the weight associated to the domain itself, which in its turn reflects in changing the weight of each vCPU of such domain. `csched2_dom_cntl()`, called via the `adjust` scheduler hook, takes care of that.

The interesting part of all this is that, when updating a weight (i.e., the `XEN_DOMCTL_SCHEDOP_putinfo` branch of the `if` in `csched2_dom_cntl()`) we need to loop across all the vCPUs of the domain and call `update_max_weight()` for each of them.

This is because, for each runqueue, we keep track of the vCPU with the highest way, and if a weight changes, that may require changing accordingly, which is exactly what `update_max_weight()` does. Note that, while doing such operation, we really try as hard as possible to avoid traversing all the runqueue, which may be quite costly.

One interesting thing to notice is locking. for manipulating domains, we need the global scheduler lock. However, since, if changing the parameters, we also need to manipulate one or more runqueue, the runqueue lock(s) (for the runqueue where each vCPU lives) is necessary too. Both the locks are taken within the function, and the order is the global lock fist, and the runqueue lock afterwards, as described in the proper subsection above.

Dumping Status and Params

Very quickly, `csched2_dump()` and `csched2_dump_pcpu()` are the Credit2 implementation of the scheduler hooks invoked to print on Xen's console some information. This is used mostly for developing and debugging purposes, and the functions are just a collection of `printk()` calls.

`csched2_dump()` prints information about the whole scheduler, while `csched2_dump_pcpu()` dumps data related to a specific pCPU. Both require dumping information about a set of vCPUs, which is achieved by means of the `csched2_dump_vcpu()` internal function.

Both are called, via their respective hooks, `schedule_dump()`, in `schedule.c`, which is called within `dump_runq()`, which is the keyhandler for the 'r' debug key (look for `dump_runq_keyhandler` in `xen/common/keyhandler.c`). For more on using a serial console and debugging Xen through it, see http://wiki.xen.org/wiki/Xen_Serial_Console.

Tracing

Tracing is something really important, when doing scheduling development. In fact, there is pretty much no other way to figure out what went on, while running a particular workload, without disturbing the workload itself. Of course, tracing has some overhead too, but it's a lot lighter even than just adding a few `printk()` around.

Tracing in Xen happens with a tool called `xentrace`, and another tool, called `xenalize` is usually a lot helpful too. The events related to Credit2 which it is possible to intercept and trace are defined in `sched_credit2.c`, i.e., they are the ones starting with `TRC_CSCHED2_*`. For an example on how to use them check, for instance, the `runq_insert()` function. The relevant part is the call to `trace_var()`, and the few lines above that (for preparing and formatting the data).

For a few more information and some examples of using `xentrace` and `xenalize`, see this blog post https://blog.xenproject.org/2012/09/27/tracing-with-xentrace-and-xenalyze/.

References

  • XenSummit 2015: "Scheduling in Xen: Past, Present and Future" (slides, video)
  • mailing list discussion relevant to Credit2 design and characteristics: here,
  • XenSummit Asia 2009: a brief summary of the design target, goals, interface, credit design, and load balancing design. (text, slides, video)
  • "In Perfect Xen, a Performance Study of the Emerging Xen Scheduler" (Dec 2013) text