Virtualization Spectrum

From Xen
Icon Info.png A new article, Understanding the Virtualization Spectrum, summarizes this article and incorporates a great article by Brendan Gregg which cleans up the terminology. The article on this page has more detail, but the new one is more focused on explaining the general concepts behind the different modes.


At XenSummit 2012 in San Diego, Mukesh Rathor from Oracle presented his work on a new virtualization mode, called “PVH”. Adding this mode, there are now a rather dizzying array of different terms thrown about — “HVM”, “PV”, “PVHVM”, “PVH” — what do they all mean? And why do we have so many?

The reason we have all these terms is that virtualization is no longer binary; there is a spectrum of virtualization, and the different terms are different points along the spectrum. Part of the reason the terminology is a little unclear is the history; any language and terminology evolves over time in response to the changing situation. However, changing the terminology after-the-fact, once certain usages become common, is difficult.

So in this article I will introduce just enough history to understand how the current situation came about, and (hopefully) introduce a consistent set of terminology which may help clear things up, while balancing this against the fact that people will still continue to use existing terminology.

This article will give a general introduction to virtualization, and to paravirtualization, Xen’s unique contribution to the field, as well as the advent of hardware virtualization extensions (HVM). It will also introduce the idea of adding paravirtualized drivers for disk and network, and cover the motivation and technical descriptions of two more modes which further mix elements of full virtualization and paravirtualization.

Full virtualization

In the early days of virtualization (at least in the x86 world), the assumption was that you needed your hypervisor to provide a virtual machine that was functionally nearly identical to a real machine. This included the following aspects:

  • Disk and network devices
  • Interrupts and timers
  • Emulated platform: motherboard, device buses, BIOS
  • “Legacy” boot: i.e., starting in 16-bit mode and bootstrapping up to 64-bit mode
  • Privileged instructions
  • Pagetables (memory access)

In the early days of x86 virtualization, all of this needed to be virtualized: disk and network devices needed to be emulated, as did interrupts and timers, the motherboard and PCI buses, and so on. Guests needed to start in 16-bit mode and run a BIOS which loaded the guest kernel, which (again) ran in 16-bit mode, then bootstrapped its way up to 32-bit mode, and possibly then to 64-bit mode. All privileged instructions executed by the guest kernel needed to be emulated somehow; and the pagetables needed to be emulated in software.

This mode - where all of the aspects the virtual machine must be functionally identical to real hardware - is what I will call fully virtualized mode.

Xen and paravirtualization

Unfortunately, particularly for x86, virtualizing privileged instructions is very complicated. Many instructions for x86 behave differently in kernel and user mode without generating a trap, meaning that your options for running kernel code were to do full software emulation (incredibly slow) or binary translation (incredibly complicated, and still very slow).

The key question of the original Xen research project at Cambridge University was, “What if instead of trying to fool the guest kernel into thinking it’s running on real hardware, you just let the guest know that it was running in a virtual machine, and changed the interface you provide to make it easier to implement?” To answer that question, they started from the ground up designing a new interface designed for virtualization. Working together with researchers at both the Intel and Microsoft labs, they took both Linux and Windows XP, and ripped out anything that was slow or difficult to virtualize, replacing it with calls into the hypervisor (hypercalls) or other virtualization-friendly techniques. (The Windows XP port to Xen 1.0, as you might imagine, never left Microsoft Research; but it was benchmarked in the original paper.)

The result was impressive — by getting rid of all the difficult legacy interfaces, they were able to make a fast, very lightweight hypervisor in under 70,000 lines of code.

This technique of changing the interface to make it easy to virtualize they called paravirtualization. In a paravirtualized VM, guests run with fully paravirtualized disk and network interfaces; interrupts and timers are paravirtualized; there is no emulated motherboard or device bus; guests boot directly into the kernel in the mode the kernel wishes to run in (32-bit or 64-bit), without needing to start in 16-bit mode or go through a BIOS; all privileged instructions are replaced with paravirtualized equivalents (hypercalls), and access to the page tables was paravirtualized as well.

Xen and full virtualization

In early versions of Xen, paravirtualization was the only mode available. Although Windows XP had been ported to the Xen platform, it was pretty clear that such a port was never going to see the light of day outside Microsoft Research. This meant, essentially, that only open-source operating systems were going to be able to run on Xen.

At the same time the Xen team was coming up with paravirtualization, the engineers at Intel and AMD were working to try to make full virtualization easier. The result was something we now call HVM — which stands for “hardware virtual machine”. Rather than needing to do software emulation or binary translation, the HVM extensions do what might be called “hardware emulation”.

Technically speaking, HVM refers to a set of extensions that make it much simpler to virtualize one component: the processor. To run a fully virtualized guest, many other components still need to be virtualized. To accomplish this, the Xen project integrated qemu to emulate disk, network, motherboard, and PCI devices; wrote the shadow code, to virtualize the pagetables; wrote emulated interrupt controllers in Xen; and integrated ROMBIOS to provide a virtual BIOS to the guest.

Even though the HVM extensions are only one component of making a fully virtualized VM, the “fully virtualized” mode in the hypervisor was called HVM mode, distinguishing it from PV mode. This usage spread throughout the toolstack and into the user interface; to this day, users generally speak of running a VM in “PV mode” or in “HVM mode”.

Once you have a fully-virtualized system, the first thing you notice is that the interface you provide for network and disks — that is, emulating a full PCI card with MMIO registers and so on — is really unnecessarily complicated. Because nearly all modern kernels have ways to load third-party device drivers, it’s a fairly obvious step to create disk and network drivers that can use the paravirtualized interfaces. Running in this mode can be called fully virtualized with PV drivers.

From Poles to a Spectrum

I have introduced the concepts of full virtualization and paravirtualization (PV), as well as the hardware virtualization (HVM) feature used by Xen (among other things) to implement full virtualization. I have also introduced the concept of installing paravirtualized drivers on a fully virtualized system.

This small step, from full virtualization towards paravirtualization, begins to hint at the idea of a spectrum of paravirtualization. I will continue with the historical reasons for the development of PVHVM, and finally of the newest mode, PVH.

Problems with paravirtualization: AMD and x86-64

It comes as a surprise to many people that while 32-bit paravirtualized guests in Xen are faster than 32-bit fully virtualized guests, when running in 64-bit mode, paravirtualized guests can sometimes be slower than fully virtualized guests. This is due to some changes AMD made when designing the architecture which simplified things for them, but made things more difficult for Xen.

Most modern operating systems need just two levels of protection: user mode and kernel mode. Kernel mode memory is protected from user mode memory via the pagetable “supervisor mode” bit.

When running a virtual machine, you need at least three levels of protection: user mode, guest kernel, and hypervisor. The hypervisor memory needs to be protected from the guest kernel, and the guest kernel memory needs to be protected from the user. The pagetable protections only provide two levels of protection, so Xen uses another processor feature, called a segmentation limit, to provide the third level of protection. Segmentation limits are a processor feature that was in common use before paging was available. But since paging has been available, segmentation limits have basically not been used; so Xen was able to commandeer them to provide the extra level of necessary protection. The pagetable protections protect both the guest kernel and Xen from userspace; the segmentation limits protect Xen from the guest kernel.

Unfortunately, at the time that Xen team was developing clever new uses for this little-used feature, AMD was designing their 64-bit extensions to the x86 architecture. Any unused processor feature makes hardware much more complicated to design, reason about, and verify. Since basically no operating systems use segmentation limits, AMD decided to get rid of them.

This may have greatly simplified the architecture for AMD, but it made it impossible for Xen to squeeze in 3 levels of protection into the same address space. Instead, for 64-bit PV guests, both guest kernel and guest user-space need to run in ring 3, each with their own address space. Every time a guest process needs to make a system call, it has to bounce up into Xen, which will context-switch to the guest kernel. This not only takes more time for each system call, but requires flushing one of the key CPU caches, called a TLB. Frequent flushing of the TLB causes all execution to run more slowly for some time afterwards, as the TLB is filled up again.

In 64-bit HVM mode, the problem doesn’t occur. The HVM extensions make it easy to have three different protection levels without needing to play clever tricks with little-used processor features. So making system calls in 64-bit HVM mode is just as fast as on real hardware. For this reason, a lot of people began running 64-bit Linux in fully virtualized mode.

Paravirtualizing little by little: PVHVM mode

But fully virtualized mode, even with PV drivers, has a number of things that are unnecessarily inefficient. One example is the interrupt controllers: fully virtualized mode provides the guest kernel with emulated interrupt controllers (APICs and IOAPICs). Each instruction that interacts with the APIC requires a trip up into Xen and a software instruction decode; and each interrupt delivered requires several of these emulations.

As it turns out, many of the the paravirtualized interfaces for interrupts, timers, and so on are actually available for guests running in HVM mode; they just need to be turned on and used. The paravirtualized interfaces use memory pages shared with Xen, and are streamlined to minimize traps into the hypervisor.

So Stefano Stabellini wrote some patches for the Linux kernel that allowed Linux, when it detects that it’s running in HVM mode under Xen, to switch from using the emulated interrupt controllers and timers to the paravirtualized interrupts and timers. This mode he called PVHVM mode, because although it runs in HVM mode, it uses the PV interfaces extensively.

(“PVHVM” mode should not be confused with “PV-on-HVM” mode, which is a term sometimes used in the past for “fully virtualized with PV drivers”.)

With the introduction of PVHVM mode, we can start to see paravirtualization not as binary on or off, but as a spectrum. In PVHVM mode, the disk and network are paravirtualized, as are interrupts and timers. But the guest still boots with an emulated motherboard, PCI bus, and so on. It also goes through a legacy boot, starting with a BIOS and then booting into 16-bit mode. Privileged instructions are virtualized using the HVM extensions, and pagetables are fully virtualized, using either shadow pagetables, or the hardware assisted paging (HAP) available on more recent AMD and Intel processors.

Problems with paravirtualization: Linux and the PV MMU

PVHVM mode allows 64-bit guests to run at near native speed, taking advantage of both the hardware virtualization extensions and the paravirtualized interfaces of Xen. But it still leaves something to be desired. For one, it still requires the overhead of an emulated BIOS and legacy boot. Secondly, it requires the extra memory overhead of a qemu instance to emulate the motherboard and PCI devices. For this reason, memory-conscious or security-conscious users may opt to use 64-bit PV anyway, even if it is somewhat slower.

But there is one PV guest that can never be run in PVHVM mode, and that is domain 0. Because having a domain 0 with the current Linux drivers will always be necessary, it will always be necessary to have a PV mode in the Linux kernel.

But what’s the problem, you ask? Weren’t all of the features necessary to run Linux as a dom0 upstreamed in Linux 3.0?

Yes, they were; but they are still occasionally the source of some irritation. The core changes required to paravirtualize the page tables (also known as the “PV MMU”) are straightforward and work well once the system is up and running. However, while the kernel is booting, before the normal MMU is up and running, the story is a bit different. The changes required for the early MMU are fragile, and are often inadvertently broken when making seemingly innocent changes. This makes both the x86 maintainers and the pvops maintainers unhappy, consuming time and emotional energy that could be used for other purposes.

Almost fully PV: PVH mode

A lot of the choices Xen made when designing a PV interface were made before HVM extensions were available. Nearly all hardware now has HVM extensions available, and nearly all also include hardware-assisted pagetable virtualization. What if we could run a fully PV guest — one that had no emulated motherboard, BIOS, or anything like that — but used the HVM extensions to make the PV MMU unnecessary, as well as to speed up system calls in 64-bit mode?

This is exactly what Mukesh’s PVH mode is. It’s a fully PV kernel mode, running with paravirtualized disk and network, paravirtualized interrupts and timers, no emulated devices of any kind (and thus no qemu), no BIOS or legacy boot — but instead of requiring PV MMU, it uses the HVM hardware extensions to virtualize the pagetables, as well as system calls and other privileged operations.

We fully expect PVH to have the best characteristics of all the modes — a simple, fast, secure interface, low memory overhead, while taking full advantage of the hardware. If HVM had been available at the time the Xen hypervisor was designed, PVH is probably the mode we would have chosen to use. In fact, in the new ARM Xen port, it is the primary mode that guests will operate in.

Once PVH is well-established (perhaps five years or so after it’s introduced), we will probably consider removing non-PVH support from the Linux kernel, making maintenance of Xen support for Linux much simpler. The Xen kernel will probably support older kernels for some time after that. However, rest assured that none of this will be done without consideration of the community.

Given the number of other things in the fully virtualized – paravirtualized spectrum, finding a descriptive name has been difficult. The developers have more or less settled on “PVH” (mainly PV, but with a little bit of HVM), but it has in the past been called other things, including “PV in an HVM container” (or just “HVM containers”), and “Hybrid mode”.

What about KVM?

At this point, some people may be wondering, how would KVM’s virtualization fit into this spectrum?

Strictly speaking, KVM is just a set of kernel extensions designed to help processes implement virtualization. When most people speak of using KVM, they mean “qemu-kvm”, which means qemu running configured to use the KVM extensions. (There are other projects, such as the Native Linux KVM tool, which also use the KVM extensions.) When I say “KVM” here, I mean qemu-kvm.

KVM supports both “legacy boot”, starting in 16-bit mode with a BIOS (or EFI) to load the kernel bootloader, and booting directly into a kernel passed on the qemu command-line. It also provides an emulated motherboard, PCI bus, and so on. It can provide both emulated disk and network cards; and thus it is capable of supporting guests running in fully virtualized mode.

KVM also provides virtio devices, which can be considered paravirtualized, as well as a PV clock, for operating systems that can be modified to support them. KVM’s typical method of paravirtualization is somewhat different than Xen’s. Virtio devices expose a normal device interface, with MMIO control paths and so on, and could in theory be implemented by real hardware. Xen’s PV interfaces are based on shared memory and lockless synchronization. The kinds of actions that need an MMIO context switch for virtio devices probably correspond pretty closely to actions that need hypercalls for Xen PV devices; but in Xen no instruction emulation needs to be done.

KVM does not have a paravirtualized interface for timers or interrupts; instead (if I understand correctly) it uses an emulated local APIC. Handling a full interrupt cycle for an emulated local APIC typically requires several MMIO accesses, each of which requires a context switch and an instruction emulation. The Xen PV interrupt interface is based on memory shared with the hypervisor, supplemented by hypercalls when necessary; so most operations can be done without context switches, and those that do require only a single context switch (and no instruction emulation). This was one of the major reasons for introducing PVHVM mode for Xen guests.

So KVM has paravirtualized devices and a paravirtualized clock, but not paravirtualized interrupts; placing KVM on the spectrum, it would be one step more paravirtualized than “FV with PV drivers”, but not as paravirtualized as PVHVM.

The paravirtualization spectrum

So to summarize: There are a number of things that can be either virtualized or paravirtualized when creating a VM; these include:

   Disk and network devices
   Interrupts and timers
   Emulated platform: motherboard, device buses, BIOS, legacy boot
   Privileged instructions and pagetables (memory access)

Each of these can be fully virtualized or paravirtualized independently. This leads to a spectrum of virtualization modes, summarized in the table below:

pv-spectrum-grid.png

The first three of these will all be classified as “HVM mode”, and the last two as “PV mode” for historical reasons. PVH is the new mode, which we expect to be a sweet spot between full virtualization and paravirtualization: it combines the best advantages of Xen’s PV mode with full utilization of hardware support.

Hopefully this has given you an insight into what the various modes are, how they came about, and what are the advantages and disadvantages of each.