X86 Paravirtualised Memory Management

From Xen

Introduction

One of the original innovations of the Xen hypervisor was the paravirtualisation of the memory management unit (MMU). This allowed for fast and efficient virtualisation of Operating Systems which used paging compared to contemporary techniques.

In this article we will describe the functionality of the PV MMU for X86 Xen guests. A familiarity with X86 paging and related concepts will be assumed.

Other guest types, such as HVM or PVH guests on X86 or guests on ARM achieve virtualisation of the MMU using other techniques, such as the use of hardware assisted or shadow paging.

Direct Paging

In order to virtualise the memory subsystem all hypervisors introduce an additional level of abstraction between what the guest sees as physical memory (often called pseudo-physical in Xen) and the underlying memory of the machine (machine addresses in Xen). This is usually done through the introduction of a Physical to Machine (P2M) mapping. Typically this would be maintained within the hypervisor and hidden from the guest Operating System through techniques such as the use of Shadow Page Tables.

The Xen paravirtualised MMU model instead requires that the guest be aware of the P2M mapping and be modified such that instead of writing page table entries mapping virtual addresses to the (pseudo-)physical address space it would instead write entries mapping virtual addresses directly to the machine address space by mapping performing the mapping from pseudo physical to machine addresses itself using the P2M as it writes its page tables. This technique is known in Xen as direct paging.

Page Types and Invariants

In order to ensure that the guest cannot subvert the system Xen requires that certain invariants are maintained and therefore that all updates to the page tables are vetted by Xen. Xen achieves this by requiring that page tables are always updated using a hypercall.

Xen defines a number of page types and maintains the invariant that any given page has exactly one type at any given time. The type of a page is reference counted and can only be changed when the type count is zero.

The basic types are:

None
No special uses.
LN Page table page
Pages used as a page table at level N. There are separate types for each of the 4 levels on 64-bit and 3 levels on 32-bit PAE guests.
Segment descriptor page
Pages used as part of the Global or Local Descriptor tables (GDT/LDT).
Writeable
Page is writable.

Xen enforces the invariant that only pages with the writable type have a writable mapping in the page tables. Likewise it ensures that no writable mapping exists of a page with any other type. It also enforces other invariants such as requiring that no page table page can make a non-privlieged mapping of the hypervisor's virutal address space etc. By doing this it can ensure that the guest OS is not able to directly modify any critical data structures and therefore subvert the safety of the system, for example to map machine addresses which do not belong to it.

Whenever a set of page-tables is loaded into the hardware page-table base register (cr3) the hypervisor must take an appropriate type reference with the root page-table type (that is, an L4 type reference on 64-bit or an L3 type reference on 32-bit). If the page is not already of the required type then in order to take the initial reference it must first have a type count of zero (remember, a page's type only be change while the type count is zero) and must be validated to ensure that it respects the invariants. For a page with a page table type to be valid it is required any pages referenced by a present page table entry in the page have the type of the next level down. So any page referenced by a page with type L4 Page Table must itself have type L3 Page Table. This invariant is applied recursively down the the L1 page table layer. At L1 the invariant is that any data page mapped by a writeble page table entry must have the Writeable type.

By applying these invariants Xen ensures that the set of page tables as a whole are safe to load into the page table base register.

Similar requirements are placed on other special page types which must likewise be validated and have a type count of the appropriate type taken before they can be passed to the hardware.

In order to maintain the invariants Xen must be involved in all updates to the page tables, as well as various other privileged operations. These are covered in the following sections.

Guest Kernel Privilege Level

In order to prevent guest operating systems from subverting these mechanisms it is also necessary for guest kernels to run without the normal privileges associated with running in processor ring-0. For this reason Xen PV guest kernels usually run in either ring-1 (32-bit guests) or ring-3 (64-bit guests).

Updating Page Tables

Since the page tables are not writable by the guest Xen provides several machanisms by which the guest can update a page table entry.

mmu_update hypercall

The first mechanism provided by Xen is the HYPERVISOR_mmu_update hypercall. This hypercall has the prototype:

 struct mmu_update {
     uint64_t ptr;       /* Machine address of PTE. */
     uint64_t val;       /* New contents of PTE.    */
 };
 
 long HYPERVISOR_mmu_update(const struct mmu_update reqs[],
                            unsigned count, unsigned *done_out,
                            unsigned foreigndom)

The operation takes an array of count requests reqs. The done_out paramter returns the number of successful operations. foreigndom can be used by a suitably privileged domain to access memory belonging to other domains (this usage is not covered here).

Each request is a (ptr,value) pair. The ptr field is further divided into ptr[1:0] which contains the type of update to perform and ptr[:2] which contains the the address to update.

The valid values for ptr[1:0] are:

MMU_NORMAL_PT_UPDATE
A normal page table update. ptr[:2] contains the machine address of the entry to update while val is the Page Table Entry to write. This effectively implements *ptr = val with checks to ensure that Xen's invariants are preserved.
MMU_MACHPHYS_UPDATE
Update the machine to physical address mapping. This is covered below.
MMU_PT_UPDATE_PRESERVE_AD
As per MMU_NORMAL_PT_UPDATE but preserving the Accessed and Dirty bits in the page table entry.

update_va_mapping hypercall

The second mechanism provided by Xen is the HYPERVISOR_update_va_mapping hypercall. This hypercall has the prototype:

 long
 HYPERVISOR_update_va_mapping(unsigned long va, u64 val,
                              enum update_va_mapping_flags flags)

This operation simply updates the leaf (L1) PTE entry that maps the virtual address va with the given value val, while of course performing the expected checks to ensure that the invariants are maintained. This can be thought of as updating the PTE using a linear mapping.

The flags parameter can be used to request that Xen flush the TLB entries associated with the update. See the hypercall's documentation for the valid flags.

Trapping and emulating page table writes

As well as the above Xen can also trap and emulate updates to leaf page table entries (L1) only. This trapping and emulating is relatively expensive and is best avoided but for little used code paths can provide a reasonable trade off vs. the requirement to modify the callsite in the guest OS.

Other privileged operations

As well as moderating page table updates in order to maintain the necessary invariants Xen must also be involved in certain other privileged operations, such as setting a new page table base. Because the guest kernel no longer runs in ring-0 certain other privleged operations must also be done by the hypervisor, such as flushing the TLB.

These operations are performed via the HYPERVISOR_mmuext_op hypercall. This hypercall has the following prototype:

 struct mmuext_op {
     unsigned int cmd; /* => enum mmuext_cmd */
     union {
         /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR
          * CLEAR_PAGE, COPY_PAGE, [UN]MARK_SUPER */
         xen_pfn_t     mfn;
         /* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */
         unsigned long linear_addr;
     } arg1;
     union {
         /* SET_LDT */
         unsigned int nr_ents;
         /* TLB_FLUSH_MULTI, INVLPG_MULTI */
         const void *vcpumask;
         /* COPY_PAGE */
         xen_pfn_t src_mfn;
     } arg2;
 };
 
 long
 HYPERVISOR_mmuext_op(struct mmuext_op uops[],
                      unsigned int count,
                      unsigned int *pdone,
                      unsigned int foreigndom)

The hypercall takes an array of count operations each specified by the mmuext_op struct. This hypercall allows access to various operations which must be performed via the hypervisor either because the guest kernel is no longer privileged or because the hypervisor must be involed in order to maintain safety, in general each available command corresponds to a low-level processor function. These include NEWBASE_PTR (i.e. write to cr3), various types of TLB and cache flush and loading descriptor table base addresses (see below). For more information on the available operations please see the hypercall documentation.

Pinning Page Tables

As discussed above Xen ensures that various invariants are met concerning whether certain pages are mapped writable or not. This in turn means that Xen needs to validate the page tables whenever they are loaded into the page table base register. However this is a potentially expensive operation since Xen needs to walk the complete set of page-tables and validate each one recursively.

In order to avoid this expense every time the page table base changes (i.e. on every context switch). Xen allows a page to be explictly pinned to a give type. This effectively means taking an extra reference of the relevant page table type, thereby forcing Xen to validate the page-table up front and to maintain the invariants for as long as the pin remains in place, even when the page is not referenced by the current page table base. By doing this the guest ensures that when a new page table base is loaded the referenced page already has the appropriate type (L4 or L3) and therefore the type count can simply be incremented without the need to validate.

For maximum performance a guest OS kernel will usually want to perform a pin operation as late as possible during the setup of a new set of page tables, so as to be able to construct them using normal writable mappings before blessing them as a set of page tables. Likewise on page-table teardown a guest OS will usually want to unpin the pages as soon as possible such that it can teardown the page tables without the use of hypercalls. These operations are usually refered to as late pin and early unpin.

The Physical-to-machine and machine-to-physical mapping tables

Direct paging requires that the guest Operating System be aware of the mapping between (pseudo-)physical and machine addresses (the P2M table). In addition in order to be able to read page table entries (which contain machine addresses) and convert them back into (pseudo-)physical addresses a translation from machine to (psuedo-)physical addresses is required, this is provided by the M2P table.

Both the P2M and M2P tables are a simple array of frame numbers, indexed by either physical or machine frames and looking up the other.

Since the P2M is sized according to the guest's pseudo-physical address space size it is left entirely up to the guest to provide and maintain in its own pages.

However the M2P must be sized according to the total amount of RAM in the host and therefore could be of considerable size compared to the amount of RAM available to the guest, not to mention sparse from the guest's point of view since the majority of machine pages will not belong to it.

For this reason Xen exposes a read-only M2P of the entire host to the guest and allows guests to update this table using the MMU_MACHPHYS_UPDATE sub-op of the HYPERVISOR_mmu_update hypercall.

Descriptor Tables

As well as protecting page tables from being writable by the guest Xen also requires that various descriptor tables also must not be writeable by a guest.

Interrupt Descriptor Table

A Xen guest cannot access the Interrupt Descriptor Table (IDT) directly. Instead Xen maintains the IDT used by the physical hardware and provides guests with a completely virtual IDT. A guest writes entries to its virtual IDT using the HYPERVISOR_set_trap_table hypercall. This has the following prototype:

 struct trap_info {
     uint8_t       vector;  /* exception vector                              */
     uint8_t       flags;   /* 0-3: privilege level; 4: clear event enable?  */
     uint16_t      cs;      /* code selector                                 */
     unsigned long address; /* code offset                                   */
 };
 
 long HYPERVISOR_set_trap_table(const struct trap_info traps[]);

The entries of the trap_info struct correspond to the fields of a native IDT entry and each will be validated by Xen before it is used. The hypercall takes an array of traps terminated by an entry where address is zero.

Global/Local Descriptor Tables

A Xen guest is not able to access the Global or Local descriptor Table (GDT/LDT) directly. Pages which are in use as part of either table are given their own distinct type and must therefore be mapped as read-only in the guest.

The guest is also not privileged to update the descriptor base registers and must therefore do so using a hypercall. The hypercall to update the GDT is HYPERVISOR_set_gdt which has the following prototype:

 long HYPERVISOR_set_gdt(const xen_pfn_t frames[], unsigned int entries);

This takes an array of machine frame numbers which are validated and loaded into the GDTR whenever the guest is running. Note that unlike native X86 these are machine frames and not virtual addresses. These frames will be mapped by Xen into the virtual address which it reserves for this purpose.

The LDT is set using the MMUEXT_SET_LDT sub-op of the HYPERVISOR_mmuext_op hypercall.

Finally since the pages cannot be mapped as writable by the guest the HYPERVISOR_update_descriptor hypercall is provided:

 long HYPERVISOR_update_descriptor(u64 pa, u64 desc);

This hypercall takes a machine physical address of a descriptor entry to update and the requested contents of the descriptor itself, in the same format as the native descriptors.

Start Of Day

The initial boot time environment of a Xen PV guest is somewhat different to the normal initial mode of an X86 processor. Rather than starting out in 16-bit mode with paging disabled a PV guest is started in either 32- or 64- bit mode with paging enabled running on an initial set of page tables provided by the hypervisor. These pages will be setup so as to meet the required invariants and will be loaded into the page table base register but will not be explicitly pinned.

The initial virtual and pseudo-physical layout of a new guest is described here, here and here


Before the guest is started, the kernel image is read and the ELF PT_NOTE program header is parsed. The hypervisor looks in the .note sections for the 'Xen' notes. The description fields are Xen specific and contain the required information to find out: where the kernel expects its virtual base address, what type of hypervisor it can work with, certain features the kernel image can support, and the location of the hypercall page, etc. There are two variants of this:

a). A “.note.Xen” section in ELF header conforming to the ELF PT_NOTE format.

The PT_NOTE header is described in [1] and in [2]

The type field (Name, Desc, Type) are of the ELF specification. The specific type values and the description is defined by Xen.

This structure is a 4-byte aligned structure. First section is an numerical key (aligned to 4 bytes); followed by either a string or a numerical value (again, aligned to 4 bytes). The values can up to any length, if the key is assumed to a string. If it is a numerical value, it is a 64-bit long (which means 8 bytes).

For example this XEN_ELFNOTE_XEN_VERSION (5) with the value of "xen-3.0":

 04000000 08000000 05000000 58656e00 78656e2d 332e3000 ........Xen.xen-3.0

Using read-elf would print out as a eight-byte length value with type 5:

 Xen                  0x00000008	Unknown note type: (0x00000005)

Please see the ELF URLs above for understanding of the it.

b). The legacy ASCIIZ string with all of the keys concatenated. Each key

being a string and the equal sign with the value also being an string (numeric values are typed as hexadecimal strings). The delimiter is a comma. The key can be up to 32 characters and the value up to 128 characters. For example:

GUEST_OS=Mini-OS,XEN_VER=xen-3.0,VIRT_BASE=0x0,ELF_PADDR_OFFSET=0x0,HYPERCALL_PAGE=0x2,LOADER=generic


The legacy format should not be used as it has limited values that can be used and is frozen.

The parameters and its purpose are explained in here.

And the XEN_ELF_FEATURES values are explained in here.

For example, if the ELF values were as so:

Name of Xen ELF entry Contents
XEN_ELFNOTE_GUEST_OS (6) linux
XEN_ELFNOTE_GUEST_VERSION (7) 2.6
XEN_ELFNOTE_XEN_VERSION (5) xen-3.0
XEN_ELFNOTE_VIRT_BASE (3) 0xffffffff80000000
XEN_ELFNOTE_ENTRY (1) 0xffffffff81899200
XEN_ELFNOTE_HYPERCALL_PAGE (2) 0xffffffff81001000
XEN_ELFNOTE_FEATURES (10) pae_pgdir_above_4gb
XEN_ELFNOTE_PAE_MODE (9) yes
XEN_ELFNOTE_LOADER (8) generic
XEN_ELFNOTE_SUSPEND_CANCEL (14) 1
XEN_ELFNOTE_HV_START_LOW 0xffff800000000000
XEN_ELFNOTE_PADDR_OFFSET 0

With that setup, the hypervisor constructs an initial page table that spans the region from virtual start address (0xffffffff80000000) up to the end of the p2m map.

Using that ELF program header information, the hypervisor (or toolstack) constructs the domain with the appropriately located data. This ELF data is used to construct a guest which is laid out as enumerated in this header: [3]

NOTE: This is an example of a 64-bit guest and not part of the ABI.

Page Frame (PFN) Virtual Address contents
0x0 0xffffffff80000000 location of struct shared_info. The 3.d entry (start_info_t) contains the machine address of this structure.
0x1000 0xffffffff81000000 location of the kernel
0x1001 0xffffffff81001000 location of the hypercall within the kernel
0x1E3E 0xffffffff81e3e000 ramdisk (NOTE: This is an example and the kernel size or ramdisk will differ)
0xFC69 0xffffffff8fc69000 (NOTE): This is an example and based on the size of the kernel and ramdisk and will differ). phys2mach (P2M) - an array of machine frame numbers. The total size of this array is dependent on the nr_pages in

struct start_info and the architecture of the guest (each entry is four bytes under 32-bit kernels and eight bytes under 64-bit kernels).

0xFCE9 0xffffffff8fce9000 location of

start_info structure

0xFCEA 0xffffffff8fcea000 location of

XenStore structure. Also refer to http://xenbits.xen.org/docs/unstable/misc/xenstore.txt

0xFCEB 0xffffffff8fceb000 Depending on the type of the guest (initial domain or subsequent domains), it can either be the

dom0_vga_console_info structure or XenConsole structure. The parameters defining this location are in the start_info structure

0xFCEC 0xffffffff8fcfc000 bootstrap page tables. These pagetables are loaded in the guest at startup

and cover from 0x0 up to 0xFD6f (the bootstack).

0xFD6F 0xffffffff8fd6f000 bootstrap stack.

When the guest is launched, per [4] explanation, the register %esi contains the virtual address to the start_info_t (0xffffffff8fce9000), the %cr3 points to the beginning of the bootstrap page-tables (0xffffffff8fcfc000), and the %esp points to the bootstrap stack (0xffffffff8fd6f000).

Virtual Address Space

Xen enforces certain restrictions on the virtual addresses which are available to PV guests. These are enforced as part of the machinery for validating writing page table pages.

Xen uses this to reserve certain addresses for its own use. Certain areas are also read-only for guests and contain shared datastructures such as the Machine-to-physical table.

For a 64-bit guest Xen the virtual address space is as follows:

0x0000000000000000-0x00007fffffffffff
Fully available to guests
0x0000800000000000-0xffff7fffffffffff
Inaccessible (addresses are 48-bit sign extended)
0xffff800000000000-0xffff807fffffffff
Read only to guests.
0xffff808000000000-0xffff87ffffffffff
Reserved for Xen use
0xffff880000000000-0xffffffffffffffff
Fully Available to guests

For 32-bit guests running on a 64-bit hypervisor guests the virtual address space under 4G (which is all such guests can access is:

0x00000000-0xf57fffff
Fully available to guests
0xf5800000-0xffffffff
Read only to guests.

For more information see Memory Layout in the relevant header file.

Batching Hypercalls

For some memory management operations the overhead of making many hypercalls can become prohibitively expensive. For this reason many of the hypercalls described above take a list of operations to perform in order to amortise the overhead of making a hypercall.

In addition to this Xen provides the concept of a multicall which can allow hypercalls of different types to be batched together. HYPERVISOR_multicall has this prototype:

 struct multicall_entry {
     unsigned long op, result;
     unsigned long args[6];
 };
 
 long HYPERVISOR_multicall(multicall_entry_t call_list[],
                           unsigned int nr_calls);

Each entry in the array represents a hypercall and its associated arguments in the (hopefully) obvious way.

Guest Specific Details

Linux paravirt_ops

The section describes details of the Xen support in mainline Linux, which uses the paravirt_ops infrastructure in order to allow Xen support to be dynamically selected at boot time.

General PV MMU operation

The paravirt_ops infrastructure provides a mechanism by which the low-level MMU operations are abstracted into function pointers allowing the native operations to be override with Xen versions where necessary.

From the point of view of MMU operations the main entry point is struct pv_mmu_ops. This contains entry points for low level operations such as:

  • Allocating/freeing page table entries. These allow the kernel to mark the pages read-only and read-write as the pages are reused.
  • Creating, writing and reading PTE entries. These allow the kernel to make the necessary translations between pseudo-physical and machine addressing as well as using hypercalls instead of direct writes.
  • Reading and writing of control registers, e.g. cr3, to allow hypercalls to be inserted.
  • Various TLB flush operations, again to allow their replacement by hypercalls.

As well as these the interface includes some higher-level operations which allow for more efficient batching of compound operations such as duplicating (forking) a memory map. This is achieved by using the lazy_mmu_ops hooks to implement buffering of operations (using multicalls) into larger batches.

The Xen paravirt_ops backend uses an additional page flag, PG_pinned in order to track whether a page has been pinned or not and implemented the late-pin early-unpin scheme described above.

Start of Day issues

TBD

References