Introduction to Xen 3.x

From Xen
Icon todo.png To Do:

Obviously this document contains information that would be valuable for Xen 4.0. Significant review and refactor would be needed or better a clone of this document for Xen 4.x

Introduction

All of the following text refers to x86 platform of Xen-unstable, unless otherwise explicitly said. We will deal only with Xen on linux 2.6 ; We are not dealing at all with Xen on linux 2.4 (and as far as we know, in the future, domain 0 is intended to be based ONLY on 2.6). Moreover,currently the 2.4 linux tree is removed from Xen Tree (changeset 7263:f1abe953e401 from 8.10.05) but it can be that it will be back when some problems will be fixed.

This document deals only with Xen 3.0 version unless explictily said otherwise.

This is not intended to be a full and detailed documentation of the Xen project but we hope it will be a starting point to anyone who is interested in Xen and wants to learn more.

The Xen Project team is permitted to take part or all of this document and integrate it with the official Xen documentation or put it as a standalone document in the Xen Web Site if they wish, without any further notice.

This is not a complete detailed document nor a full walkthrough ,and many important issues are omitted. Any feedback is welcomed to : Rami Rosen , ramirose@gmail.com

Xen and IA32 Protection Modes

In the classical protection model of IA-32, there are 4 privilege levels; The highest ring is 0, where the kernel runs. (this level is also called SuperVisor Mode) The lowest is ring 3, where User applications run (this level is also called User Mode) Issuing some instructions , which are called "privileged instructions" , from ring which is NOT ring 0, will cause a General Protection Fault.

Ring 1,2 were not used through the years (except for in the case of OS/2). When running Xen, we run a Hypervisor in ring 0 and the guest OS in ring 1. The applications run unmodified at ring 3.

BTW, there are of course architectures which have a different privilege models; for example, in PPC both domain 0 and the Unprivileged domains run in supervisor mode. Diagram: Xen and IA32 Protection Modes

rings.png

The Xend daemon:

The Xend Daemon handles requests issued from Domain 0; requests can be, for example, creating a new domain ("xm create") or listing the domains ("xm list"), shutting down a domain ("xm destroy"). Running "xm help" will show all possibilities.

You start the Xend daemon by running, after booting into domain0, "xend start". "xend start" creates two daemons: xenstored and xenconsoled (see toos/misc/xend). It also creates an instance of a python SrvDaemon class and calls its start() method. (see tools/python/xen/xend/server/SrvDaemon.py).

The SrvDaemon start() method is in fact the xend main program.

In the past,the start() method of SrvDaemon eventually started an http socket (8000) on which it listened to http requests. Now it does not open an http socket on port 8000 anymore.

Note : There is an altenative to the management layer of Xen which is called libvirt; see http://libvirt.org. This is a free API (LGPL)

The Xen Store:

The Xen Store Daemon provides a simple tree-like database to which we can read and write values. The Xen Store code is mainly under tools\xenstore.

It replaces the XCS, which was a daemon handling control messages.

The physical xenstore resides in one file: /var/lib/xenstored/tdb. (previously it was sacttered in some files; the change to using one file (named "tdb") was probably to increase performance).

Both user space ("tools" in Xen terminology) and kernel code can write to the XenStore.The kernel code writes to the XenStore by using XenBus.

The python scripts (under tools/python) uses lowlevel/xs.c to read/write to the XenStore.

The Xen Store Daemon is started in xenstored_core.c. It creates a device file ("/dev/xen/evtchn") in case such a device file does not exists and it opens it. (see : domain_init() ,file tools/xenstore/xenstored_domain.c).

It opens 2 TCP sockets (UNIX sockets). One of these sockets is a Read-Only socket, and it resides under /var/run/xenstored/socket_ro. The second is /var/run/xenstored/socket.

Connections on these sockets are represented by the connection struct.

A connection can be in one of three states:


        BLOCKED (blocked by a transaction)
        BUSY    (doing some action)
        OK      (completed it's transaction)

struct connection is declared in xenstore/xenstored_core.h; When a socket is ReadOnly,the "can_write" member of it is false.

Then we start an endless loop in which we can get input/output from three sources: the two sockets and the event channel, mentioned above.

Events, which are received in the event channel,are handled by handle_event() method (file xenstored_domain.c).

There are six executables under tools/xenstore, five of which are in fact made from the same module, which is xenstore_client.c, each time built with a different DEFINE passed. (See the Makefile). The sixth tool is built from xsls.c

These executables are : xenstore-exists, xenstore-list, xenstore-read, xenstore-rm, xenstore-write and xsls.

You can use these executable for accessing xenstore. For example: to view the list of fields of domain 0 which has a path "local/domain/0", you run:


xenstore-list /local/domain/0

and a typical result can be the following list:


cpu
memory
name
console
vm
domid
backend

The xsls command is very useful and recursively shows the contents of a specified XenStore path. Essentially it does a xenstore-list and then a xenstore-read for each returned field, displaying the fields and their values and then repeating this recursively on each sub-path. For example: to view information about all VIFs backends hosted in domain 0 you may use the following command.


xsls /local/domain/0/backend/vif

and a typical result may be:


14 = ""
 0 = ""
  bridge = "xenbr0"
  domain = "vm1"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  state = "4"
  frontend = "/local/domain/14/device/vif/0"
  mac = "aa:00:00:22:fe:9f"
  frontend-id = "14"
  hotplug-status = "connected"
15 = ""
 0 = ""
  mac = "aa:00:00:6e:d8:46"
  state = "4"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  frontend-id = "15"
  domain = "vm2"
  frontend = "/local/domain/15/device/vif/0"
  hotplug-status = "connected"

(The xenstored must be running for these six executables to run; If xenstored is not running, then running theses executables will usually hang. The Xend daemon can be stopped).

An instance of struct node is the elementary unit of the XenStore. (struct node is defined in xenstored_core.h). The actual writing to the XenStore is done by write_node() method of xenstored_core.c.

xen_start_info structure has a member named :store_evtchn. (declared in public/xen.h as u16). This is the event channel for store communication.

VT-x (virtual technology) processors - support in Xen

Note: following text refers only to IA-32 unless explicitly said otherwise.

Intel had announced Pentium® 4 672 and 662 processors in November 2005 with virtualization support. (see, for example: http://www.physorg.com/news8160.html).

How does Xen support the Intel Virtualization Technology ?

The VT extensions support in Xen3 code is mostly in xen/arch/x86/hvm/vmx*.c.

  • and xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.

arch_vcpu structure (file xen/include/asm-x86/domain.h) contains a member which is called arch_vmx and is an instance of arch_vmx_struct. This member is also important to understand the VT-x mechanism.

But the most important structure for VT-x is the VMCS( vmcs_struct in the code) which represents the VMCS region.

The definition (file include/asm-x86/vmx_vmcs.h) is short:

struct vmcs_struct

  • { u32 vmcs_revision_id; unsigned char data [0]; /* vmcs size is read from MSR */ };

The VMCS region contains six logical regions; most relevant to our discussions are Guest-state area and Host-state area. We will also deal with the other four regions: VM-execution control fields,VM-exit control fields, VM-entry control fields and VM-exit information fields.

Intel added 10 new opcodes in VT-x to support Intel Virtualization Technology. They are detailed in the end of this section.

When using this technology, Xen runs in "VMX root operation mode" while the guest domains (which are unmodified OSs) run in "VMX non-root operation mode". Since the guest domains run in "non-root operation" mode, it is more restricted,meaning that certain actions will cause "VM exit" to the VM.

Xen enters VMX operation in start_vmx() method. ( file xen/arch/x86/vmx.c)

This method is called from init_intel() method (file xen/arch/x86/cpu/intel.c.) (CONFIG_VMX should be defined).

First we check the X86_FEATURE_VMXE bit in ecx register to see if the cpuid shows that there is support for VMX in the processor. In IA-32 Intel added in the CR4 control register a bit specifying whether we want to enable VMX. So we must set this bit to enable VMX on the processor (by calling set_in_cr4(X86_CR4_VMXE)); This bit is bit 13 in CR4 (VMXE).

Then we call _vmxon to start VMX operation. If we will try to start VMX operation by _vmxon when the VMXE bit in CR4 is not set we will get exception (#UD , for undefined opcode)

In IA-64, things are a little different due to different architecture structure: Intel added a new bit in IA-64 in the Processor Status Register (PSR). This is bit 46 and it's called VM. It should be set to 1 in guest OSs; and when it's values is 1 , certain instructions cause virtualization fault.

VM exit:

Some instructions can cause unconditionally VM exit and some can cause VM exit under certain VM-execution control fields. (see the discussion about VMX-region above)

The following instructions will cause VM exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR, WRMSR, and all the new VT-x instructions (which are listed below).

There are other instruction like HLT,INVPLG (Invalidate TLB Entry instruction) MWAIT and others which will cause VM exit if a corresponding VM-execution control was set.

Apart from VM-execution control fields, there are 2 bitmpas which are used for determining whether to perform VM exit: The first is the exception bitmap (see EXCEPTION_BITMAP in vmcs_field enum , file xen/include/asm-x86/vmx_vmcs.h). This bitmap is 32 bit field; when a bit is set in this bitmap, this causes a VM exit if a corresponding exception occurs; by default ,the entries which are set are EXCEPTION_BITMAP_PG (for page fault) and EXCEPTION_BITMAP_GP (for General Protection). see MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h.

The second bitmap is the I/O bitmap (in fact, there are 2 I/O bitmaps,A and B, each is 4KB in size) which controls I/O instructions on ports. I/O bitmap A contains the ports in the range 0000-7FFF and I/O bitmap B contains the ports in the range 8000-FFFF. (one bit for each I/O port). see IO_BITMAP_A and IO_BITMAP_B in vmcs_field enum (VMCS Encordings).

When there is an "VM exit" we reach the vmx_vmexit_handler(struct cpu_user_regs regs) in vmx.c. We handle the VM exit according to the exit reason which we read from the VMCS region. We read the vmcs by calling vmread() ; The return value of vmread is 0 in case of success.

We sometimes also need to read some additional data (VM_EXIT_INTR_INFO) from the vmcs.

We get additional data by getting the "VM-exit interruption information" which is a 32 bit field and the "Exit qualification" (64 bit value).

For example, if the exception was NMI, we check if it is valid by checking bit 31 (valid bit) of the VM-exit interruption field. In case it is not valid we call _hvm_bug() to print some statistics and crash the domain.

Example of reading the "Exit qualification" field is in the case where the VMEXIT was caused by issuing INVPLG instruction.

When we work with vt-x, the guest OSs work in shadow mode, meaning they use shadow page tables; this is because the guest kernel in a VMX guest does not know that it's being virtualized. There is no software visible bit which indicates that the processor is in VMX non-root operation. We set shadow mode by calling shadow_mode_enable() in vmx_final_setup_guest() method (file vmx.c).

There are 43 basic exit reasons - you can see part of them in vmx.h (fields starting with EXIT_REASON_ like EXIT_REASON_EXCEPTION_NMI, which is exit reason number 0, and so on).

In VT-x, Xen will probably use an emulated devices layer which will send virtual interrupts to the VMM. We can prevent the OS from receiving interrupts by setting the IF flag of EFLAGS.

The new ten opcodes which Intel added in Vt-x are detailed below:

1) VMCALL: (VMCALL_OPCODE in vmx.h)

  • This simply calls the VM monitor, causing vm exit.

2) VMCLEAR: (VMCLEAR_OPCODE in vmx.h)

  • copies VMCS data to memory in case it does not written there.
  • wrapper : _vmpclear (u64 addr) in vmx.h.

3) VMLAUNCH (VMLAUNCH_OPCODE in vmx.h)

  • launched a virtual machine; changes the launch state of the VMCS to
    • launched (if it is clear)

4) VMPTRLD (VMPTRLD_OPCODE file vmx.h)

  • loads a pointer to the VMCS.
    • wrapper : _vmptrld (u64 addr) (file vmx.h)

5) VMPTRST (VMPTRST_OPCODE in vmx.h)

  • stores a pointer to the VMCS.
wrapper : _vmptrst (u64 addr) (file vmx.h.)

6) VMREAD (VMREAD_OPCODE in vmx.h)

  • read specified field from VMCS.
  • wrapper : _vmread(x, ptr) (file vmx.h)

7) VMRESUME (VMRESUME_OPCODE in vmx.h)

  • resumes a virtual machine ; in order it to resume the VM,
    • the launch state of the VMCS should be "clear.

8) VMWRITE (VMWRITE_OPCODE in vmx.h)

  • write specified field in VMCS. wrapper _vmwrite (field, value).

9) VMXOFF (VMXOFF_OPCODE in vmx.h)

  • terminates VMX operation.
    • wrapper : _vmxoff (void) (file vmx.h.)

10) VMXON (VMXON_OPCODE in vmx.h)

  • starts VMX operation.
wrapper : _vmxon (u64 addr)  (file vmx.h.)

QEMU and VT-D The io in Vt-x is performed by using QEMU. The QEMU code which Xen uses is under tools/ioemu. It is based on version 0.6.1 of QEMU. This version was patched accrording to Xen needs. Also AMD SVM uses QEUMU emulation.

The default network card which QEMU uses in Vt-x is AMD PCnet-PCI II Ethernet Controller. (file tools/ioemu/hw/pcnet.c). The reason to prefer this nic emulation to the other alternative, ne2000, is that pcnet uses DMA whereas ne2000 does not.

There is of course a performance cost for using QEMU, so there are chances that usage of QEMU will be replaced in the future with different soulutions which have lower performance costs.

Intel had annouced in March 2006 its VT-d Technology (Intel Virtualization Technology for Directred I/O). This technology enables to assign devices to virtual machines. It also enables DMA remapping, which can be configured for each device. There is a cache called IOTLB which improves performance.

Vmxloader

There are some restrictions on VMX operation. Guest OSes in VMX cannot operate in Real Mode. If bit PE (Protection Enabled) of CR0 is 0 or bit PG ("Enable Paging") of CR0 is 0, then trying to start the VMX operation (VMXON instruction) fails.If after entering VMX operation you try to clear these bits, you get an exception (General Protection Exception). When using a linux loader, it starts in real mode. As a result, a vmxloader was written for vmx images. (file tools/firmware/vmxassist/vmxloader.c.)

(In order to build vmxloader you must have dev86 package installed; dev86 is a real mode 80x86 assembler and linker).

After installing Xen, vmxloader is under /usr/lib/xen/boot. In order to use it, you should specify kernel = "/usr/lib/xen/boot/vmxloader" in the config file (which is an input to your "xm create" command.)

The vmxloader loads ROMBIOS at 0xF0000, then VGABIOS at 0xC0000, and then VMXAssist at D000:0000.

What is VMXAssist? The VMXAssist is an emulator for real mode which uses the Virtual-8086 mode of IA32. After setting Virtual-8086 mode, it executes in a 16-bit environment.

There are certain instructions which are not recognized in virtual-8086 mode. For example, LIDT (Load Interrupt Register Table), or LGDT (Load Global DescriptorTable).

These instructions cause #GP(0) when trying to run them in protected mode.

So the VMXAssist assist checks the opcode of the instructions which are being executed, and handles them so that they will not cause General Protection Exception (as would have happened without its intervention).

VT-i (virtual technology) processors - support in Xen

Note : the files mentioned in this sections are from the unstable xen version).

In Vt-i extension for IA64 processors,intel added a bit to the PSR (process status register). This bit is bit 46 of the PSR and is called PSR.vm. When this bit is set, some instructions will cause a fault.

A new instruction called vmsw (Virtual Machine Switch) was added. This instruction sets the PSR.vm to 1 or 0. This instruction can be used to cause transition to or from a VM without causing an interruption.

Also a descriptor named VPD was added; this descriptor represents the resources of a virtual processor. It's size is 64 K. (It must be 32 aligned).

A VPD stands for "Virtual Processor Descriptor". A structure named vpd_t represents the VPD descriptor (file include/public/arch-ia64.h).

Two vectors were added to the ivt: One is the External Interrupt vector (0x3400) and the other is the Virtualization vector (0x6100).

The virtualization vector handler is called when an instruction which need virtualization was called. This handler cannot be raised by IA-32 instructions.

Also nine PAL services were added. PAL stands for Processor Abstraction Layer.

The services that were added are: PAL_VPS_RESUME_NORMAL,PAL_VPS_RESUME_HANDLER,PAL_VPS_SYNC_READ, PAL_VPS_SYNC_WRITE,PAL_VPS_SET_PENDING_INTERRUPT,PAL_VPS_THASH, PAL_VPS_TTAG, PAL_VPS_RESTORE and PAL_VPS_SAVE, (file include/asm-ia64/vmx_pal_vsa.h)

AMD SVM

AMD will hopefully release PACIFICA processors with virtualization support in Q2 2006. (probably on June 2006). The IOMMU virtualization support is to be out in 2007.

Them xen-unstable tree now includes both intel VT and SVM support, using a common API which is called HVM.

The inclusion of HVM in the unstable tree is since changeset 8708 from 31/1/06, which is a "Big merge the HVM full-virtualisation abstractions."

You can download the code by: hg clone http://xenbits.xenproject.org/xen-unstable.hg

The code for AMD SVM is mostly under xen/arch/x86/hvm/svm.

The code is developed by AMD team: Tom Woller, Mats Petersson, Travis Betak, Nagib Gulam, Leo Duran, Rosilmildo Dasilva and Wei Huang.

SVM stands for "Secure Virtual Machine".

One major difference between Vt-x and AMD SVM is that the AMD SVM virtualization extensions include tagged TLB (whereas Intel virtualization extensions for IA-32 does not). The benefit of a tagged TLB is significantly reducing the number of TLB flushes ; this is achieved by using an ASID (Address Space Identifer) in the TLB. Using tagged TLB is common in RISC processors.

In AMD SVM, the most important struct (which is parallel to the VT-x vmcs_struct) is the vmcb_struct. (file xen/include/asm-x86/hvm/svm/vmcb.h). VMCB stands for Virtual Machine Control Block.

AMD added the following eight instructions to the SVM processor:

VMLOAD loads the processor state from the VMCB. VMMCALL enables the guest to communicate with the VMM. VMRUN starts the operation of a guest OS. VMSAVE store the processor state from the VMCB. CLGI clears the global interrupt flag (GIF) SLGI sets the global interrupt flag (GIF) INVPLGA invalidates the TLB mapping of a specified virtual page

  • and a specfied ASID.

SKINIT reinitilizes the CPU.

To issue these instructions SVM must be enabled. Enabling SVM is done by setting bit 12 of the EFER MSR register.

In VT-x, the vmx_vmexit_handler() method handles VM Entries. In AMD SVM, the svm_vmexit_handler() method is the one which handles VM exits. (file xen/arch/x86/hvm/svm/svm.c) When VM exit occurs, the processor saves the reason for this exit in the exit_code member of the VCMB. The svm_vmexit_handler() handles the VM EXIT according to the exit_reason of the VMCB.

Xen On Solaris

On 13 Feb 2006, Sun had released the Xen sources for Solaris x86. See : http://opensolaris.org/os/community/xen/opening-day.

This version currently supports 32 bit only ;it enables openSolaris to be a guest OS where dom0 is a modifed Linux kernel running Xen. Also this version is currently only for x86 (porting to SPARC processor is much more difficult). The members of the Solaris Xen project are Tim Marsland, John Levon, Mark Johnson, Stu Maybee, Joe Bonasera, Ryan Scott, Dave Edmondson and others. Todd Clayton is leading the 64-bit solaris Xen project. In order to boot the Solaris Xen guest many changes were done; can see more details in http://blogs.sun.com/roller/page/JoeBonasera.

You can download the Xen Solaris sources from : http://dlc.sun.com/osol/xen/downloads/osox-src-12-02-2006.tar.gz

Frontend net virtual device sources are in uts/common/io/xennet/xennetf.c. (xennet is the net front virtual driver.).

Frontend block virtual device sources are in uts/i86xen/io/xvbd (xvbd is the block front virtual driver.).

Currently the front block device does not work. There are many things which are similiar between Xen on Solaris and Xen on Linux.

In Xen Solaris Hypercall are also made by calling int 0x82 . (see #define TRAP_INSTR int $0x82 (file /uts/i86xen/ml/hypersubr.s)

Sun also released in february 2006 the specs for the T1 prcoessor, which supports virtualization: see : http://opensparc.sunsource.net/nonav/opensparct1.html

see: http://www.prnewswire.com/cgi-bin/stories.pl?ACCT=104&STORY=/www/story/02-14-2006/0004281587&EDATE=

Also the UltraSPARC T1 Hypervisor API Specification was released: http://opensparc.sunsource.net/specs/Hypervisor_api-26-v6.pdf

T1 virtualization:

The Hyperprivileged edition of the UltraSPARC Architecture 2005 Specification describes the Nonprivileged, Privileged, and Hyperprivileged (hypervisor/virtual machine firmware) spec.

The virtual processor on Sun supports three privilege modes:

  • 1) User Mode 2) Privileged Mode
3) HyperPrivileged Mode.

2 bits determine the privilege mode of the processor: HPSTATE.hpriv and PSTATE.priv When both are 0 ,we are in nonprivileged mode When both are 1 ,we are in privileged mode When HPSTATE.hpriv is 1 , we are in Hyperprivileged mode (regardless of the value of PSTATE.priv). PSTATE is the Processor State register. HPSTATE is the Hyperprivileged State register HPSTATE.(64 bit). Each virtual processor has only one instance of the PSTATE and HPSTATE registers. The HPSTATE is one of the HPR state registers, and it is also called HPR 0. It can be read by the RDHPR instructions, and it can be written by the WRHPR instruction.

Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6

This secrion describes a step by step example of creating guest OS based on FC6 i386 with Virtual Machine Manager in a Fedora Core 6 machine by installing from a WEB URL:

Go to : Application->System Tools->Virtual Machine Manager Choose : Local Xen Host Press New. Enter a name for the guest. You reach now the "Locating installation media" dialog. In "install media URL" you should enter a URL of Fedora Core 6 i386 download. For example, "http://download.fedora.redhat.com/pub/fedora/linux/core/6/i386/os/" then press forward. Choose simple file, and give a path to a non existing file in some existing folder. than: File size: choose 3.5 GB for example ; if you will assign less space, you will not be able to finish the installation, assuming it is a typical , non custom, installation.

Then press forward ; accept the defaults for memory/cpu and then press forward. Than press finish. That's it! When insalling from web like this it can take 2-4 hours, depending on your bandwidth. You will get to the text mode installation of fedora core 6, and have to enter parameters for the installation.

After the installation is finished and you want to restart the guest OS , you do it by simply: "xm create /etc/xen/NameOfGuest", where NameOfGuest is of course the name of guest you choose in the installation.

Physical Interrupts

In Xen, only the Hypervisor has an access to the hardware so that to achieve isolation (it is dangerous to share the hardware and let other domains access directly hardware devices simultaneously).

Let's take a little walkthrough dealing with Xen interrupts:

Handling interrupts in Xen is done by using event channels. Each domain can hold up to 1024 events. An event channel can have 2 flags associated with it : pending and mask. The mask flag can be updated only by guests. The hypervisor cannot update it. These flags are not part of the event channel structure itself. (struct evtchn is defined in xen/include/xen/sched.h ). There are 2 arrays in struct shared_info which contains these flags: evtchn_pending[] and evtchn_mask[] ; each holds 32 elements. (file xen/include/public/xen.h)

(The shared_info is a member in domain struct; it is the domain shared data area).

TBD: add info about event selectors (evtchn_pending_sel in vcpu_info).

Registration (or binding) of irqs in guest domains:

The guest OS calls init_IRQ() when it boots (start_kernel() method calls init_IRQ() ; file init/main.c).

(init_IRQ() is in file sparse/arch/xen/kernel/evtchn.c)

There can be 256 physical irqs; so there is an array called irq_desc with 256 entries. (file sparse/include/linux/irq.h)

All elements in this array are initialized in init_IRQ() so that their status is disabled (IRQ_DISABLED).

Now, when a physical driver starts it usually calls request_irq().

This method eventually calls setup_irq() (both in sparse/kernel/irq/manage.c). which calls startup_pirq().

startup_pirq() send a hypercall to the hypervisor (HYPERVISOR_event_channel_op) in order to bind the physical irq (pirq) . The hypercall is of type EVTCHNOP_bind_pirq. See: startup_pirq() (file sparse/arch/xen/kernel/evtchn.c)

On the Hypervisor side, handling this hypervisor call is done in: evtchn_bind_pirq() method (file /common/event_channel.c) which calls pirq_guest_bind() (file arch/x86/irq.c). The pirq_guest_bind() changes the status of the corresponding irq_desc array element to be enabled (~IRQ_DISABLED). it also calls startup() method.

Now when an interrupts arrives from the controller (the APIC), we arrive at do_IRQ() method as is also in usual linux kernel (also in arch/x86/irq.c). The Hypervisor handles only timer and serial interrupts. Other interrupts are passed to the domains by calling _do_IRQ_guest() (In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts). _do_IRQ_guest() send the interrupt by calling send_guest_pirq() to all guests who are registered on this IRQ. The send_guest_pirq() creates an event channel (an instance of evtchn) and sets the pending flag of this event channel. (by calling evtchn_set_pending()) Then, asynchronously, Xen will notify this domain regarding this interrupt (unless it is masked).

TBD: shared interrupts; avoiding problems with shared interrupts when using PCI express.

Backend Drivers:

The Backend Drivers are started from domain 0. We will deal mainly with the network and block drivers. The network backend drivers reside under sparse/drives/xen/netback, and the block backend drivers reside under sparse/drives/xen/blkback.

There are many things in common between the netback and blkback device drivers. There are some differences, though. The blkback device drivers runs a kernel daemon thread (named :xenblkd) whereas the netback device driver does not run any kernel thread.

The netback and blkback register themselves with XenBus by calling xenbus_register_backend().

This method simply calls xenbus_register_driver_common(); both are in sparse/drivers/xen/xenbus/xenbus_probe.c.

(The xenbus_register_driver() method calls the generic kernel method for registering drivers, driver_register()).

Both netback (network backend driver) and blkback (block backend driver) has a module named xenbus.c. There are drivers which are not splitted to backend/frontend drivers;for example, the balloon driver.The balloon driver calls register_xenstore_notifier() in its initialization (balloon_init() method). The register_xenstore_notifier() uses a generic linux callback mechanism for passing status changes (notifier_block in include/linux/notifier.h).

The USB driver also has a backend and frontend drivers; currently it has no support to the xenbus/xenstore API so it does not have a module named xenbus.c but it will probably be adjusted in the future. As of writing of this document, the USB backend/frontend code was removed temporarily from the sparse tree.

Each of the backend drivers registers two watches: one for the backend and one for the frontend. The registration of the watches is done in the probe method:

  • In netback it is in netback_probe() method (file netback/xenbus.c).
  • In blkback it is in blkback_probe() method (file blkback/xenbus.c).

A registration of a watch is done by calling the xenbus_watch_path2() method. This method is implemented in sparse/drivers/xen/xenbus/xenbus_client.c. Evntually the watch registration is done by calling register_xenbus_watch(), which is implemented in sparse/drivers/xen/xenbus/xenbus_xs.c.

In both cases, netback and blkback, the callback for the backend watch is called backend_changed, and the callback for the forntend watch is called frontend_changed.

xenbus_watch is a simple struct consisting of 3 elements:

A reference to a list of watches (list_head)

A pointer to a node (char*)

A callback function pointer.

The xenbus.c in both netback and blkback defines a struct called backend_info; These structs have much in common: there are minor differences between them. One difference is that in the netback the communications channel is an instance of netif_t whereas in the blkback the communications channel is an instance of blkif_t; In the case of blkback, it includes also the major/minor numbers of the device and the mode (whereas these members don't exist in the backend_info struct of the netback).

In the case of netback, there is also a XenbusState member. The state machine for XenBus includes seven states: Unknown, Initialising, InitWait (early initialisation was finished, and xenbus is waiting for information from the peer or hotplug scripts), Initialised (waiting for a connection from the peer), Connected, Closing (due to an error or an unplug event) and Closed.

One of the members of this struct (backend_info) is an instance of xenbus_device.(xenbus_device is declared in sparse/include/asm-xen/xenbus.h). The nodename looks like a directory path, for example, dev->nodename in the blkback case may look like:


backend/vbd/94e11324-7eb1-437f-86e6-3db0e145136e/771

and dev->nodename in the netback may look like:


backend/vif/94e11324-7eb1-437f-86e6-3db0e145136e/1

We create an event channel for communication between the two domains by calling a bind_interdomain Hypervisor call. (HYPERVISOR_event_channel_op).

For the networking,this is done in netif_map() in netback/interface.c. For the block device, this is done in blkif_map() in blkback/interface.c.

We use the grant tables to create shared memory between frontend and backend domain. In the case of network drivers,this is done by calling: gnttab_grant_foreign_transfer_ref(). (called in: network_alloc_rx_buffers(), file netfront.c)

gnttab_grant_foreign_transfer_ref() sets a bit named GTF_accept_transfer in the grant_entry.

In the case of block drivers,this is done by calling: gnttab_grant_foreign_access_ref() in blkif_queue_request() (file blkfront.c)

gnttab_grant_foreign_access_ref() sets a bit named GTF_permit_access in the grant entry. grant entry (grant_entry_t) represents a page frame which is shared between domains.

Diagram: Virtual Split Devices

virtualDevices.png

Migration and Live Migration:

Xend must be configured so that migration (which is also termed relocation) will be enabled. In /etc/xen/xend-config.sxp, there is the definition of the relocation-port, so the following line should be uncommented:


(xend-relocation-port 8002)

The line "(xend-address localhost)" prevents remote connections on the localhost,so this line must be commented.

Notice: if this line is commented in the side to which you want to migrate your domain, you will most likely get the following error after issuing the migrate command:


"Error: can't connect: Connection refused"

This error can be traced to domain_migrate() method in /tools/python/xen/xend/XendDomain.py which start a TCP connection on the relocation port (which is by default 8002)


...
...
def domain_migrate(self, domid, dst, live=False, resource=0):
        """Start domain migration."""
        dominfo = self.domain_lookup(domid)
        port = xroot.get_xend_relocation_port()
        try:
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.connect((dst, port))
        except socket.error, err:
            raise XendError("can't connect: %s" % err[1])
...
...

See more details on the relocation protocol (implemented in relocate.py) below.

The line "(xend-relocation-server yes)" should be uncommented so the migration server will be running.

When we issue a migration command like "xm migrate #numOfDomain ipAddress" or , if we want to use live-migration , we add the --live flag thus: "xm migrate #numOfDomain --live ipAddress", we call server.xend_domain_migrate() after validating that the arguments are valid. (file /tools/python/xen/xm/migrate.py)

We enter the domain_migrate() method of XendDomain, which first performs a domain lookup and then creates a TCP connection for the migration with the target machine; then it sends a "receive" packet (a packets which contains the string "receive") on port 8002. The other sides gets this message in in the dataReceived() method (see of web/connection.py) and delegates it to dataReceived() of RelocationProtocol (file /tools/python/xen/xend/server/relocate.py). Eventually it calls the save() method of XendChekpoint to start the migration process.

We end up with a call to op_receive() which ultimately sends back a "ready receive" message (also on port 8002). See op_receive() method in relocation.py.

The save() method of XendChekpoint opens a thread which calls the xc_save executable (which is in /usr/lib/xen/bin/xc_save). (file /tools/python/xen/xend/XendCheckpoint.py).

The xc_save executable is build from tools/xcutils/xc_save.c.

The xc_save executable calls xc_linux_save() in tools/libxc/xc_linux_save, which in fact performs most of the migration process. (see xc_linux_save() in /tools/libxc/xc_linux_save.c)

The xc_linux_save() returns 0 on success, and 1 in case of failure.

Live migration is currently supported for 32-bit and 64-bit architectures. TBD: find out if there is support for live-migration for pae architectures. Jacob Gorm Hansen is doing an interesting work with migration in Xen; see http://www.diku.dk/~jacobg/self-migration.

He wrote code which performs Xen "self-migration" , which is a migration which is done without the hypervisor involvement. The migration is done by opening a User Space socket and reading from a special file (/dev/checkpoint).

His code works with linux-2.6 bases xen-unstable.

You can get it by: hg clone http://www.distlab.dk/hg/index.cgi/xen-gfx.hg

His work was inspired by his own 'NomadBIOS' for L4 microkernel, which also uses this approach of self-migration.

It might be that this new alternative to managed migration will be adopted in future versions of Xen (time will tell).

Migration of operating systems has much in common with migration of single processes. see Master's thesis: "Nomadic Operating Systems" Jacob G. Hansen , Asger kahl Henriksen,2002 http://nomadbios.sunsite.dk/NomadicOS.ps

You can find there a discussion about migration of single processes in Mosix (Barak and La'adan).

Creating of a domain - behind the scenes:

We create a domain by:


xm create configFile -c

A simple config file may look like (using ttylinux,as in the user manual):


kernel = "/boot/vmlinuz-2.6.12-xenU"
memory = 64
name = "ttylinux"
nics = 1
ip = "10.0.0.3"
disk = ['file:/home/work/downloads/tmp/ttylinux-xen,hda3,w']
root = "/dev/hda3 ro"

The create() method of XendDomainInfo handles creation of domains. When a domain is created it is assigned a unique id (uuid ) which is cretaed using uuidgen command line utility of e2fsprogs.(file python/xen/xend/uuid.py).

If the memory paramter specified a too high memory which the hypervisor cannot allocate, we end up with the following message: "Error: Error creating domain: The privileged domain did not balloon!"

The devices of a domain are created using the createDevice() method which delegates the call to the createDevice() method of the Device Controller (see XendDomainInfo.py) The createDevice() in turn calls writeDetails() method (also in DevController). This writeDetails() method write the details in XenStore to trigger the creation of the device. The getDeviceDetails() is an abstract method which each subclass of DevController implements. Writing to the store is done by calling Write() method of xstransact. (file tools/pyhton/xen/xend/xenstore/xstransact.py) which returns the id of the newly created device.

By using transaction you can batch together some actions to perform against the xenstored (the common use is some read actions). You can create a domain also without Xend and without Python bindings; Jacob Gorm Hansen had demonstrated it in 2 little programs (businit.c and buscrate.c) (see http://lists.xenproject.org/archives/html/xen-devel/2005-10/msg00432.html).

However, true to now these programs should be adjusted beacuse there were some API changes, especially that creation of interdomain event channel is done now with sending ioctl to event_channel (IOCTL_EVTCHN_BIND_INTERDOMAIN).

HyperCalls Mapping to code Xen 3.0.2

Mapping of HyperCalls to code :

Follwoing is the location of all hypercalls: The HyperCalls appear according to their order in xen.h.

The hypercall table itself is in xen/arch/x86/x86_32/entry.S (ENTRY(hypercall_table)).

HYPERVISOR_set_trap_table => do_set_trap_table() (file xen/arch/x86/traps.c)

HYPERVISOR_mmu_update => do_mmu_update() (file xen/arch/x86/mm.c)

HYPERVISOR_set_gdt => do_set_gdt() (file xen/arch/x86/mm.c)

HYPERVISOR_stack_switch => do_stack_switch() (file xen/arch/x86/x86_32/mm.c)

HYPERVISOR_set_callbacks => do_set_callbacks() (file xen/arch/x86/x86_32/traps.c)

HYPERVISOR_fpu_taskswitch => do_fpu_taskswitch(int set) (file xen/arch/x86/traps.c)

HYPERVISOR_sched_op_compat => do_sched_op_compat() (file xen/common/schedule.c)

HYPERVISOR_dom0_op => do_dom0_op() (file xen/common/dom0_ops.c)

HYPERVISOR_set_debugreg => do_set_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_get_debugreg => do_get_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_update_descriptor => do_update_descriptor() (file xen/arch/x86/mm.c)

HYPERVISOR_memory_op => do_memory_op() (file xen/common/memory.c)

HYPERVISOR_multicall => do_multicall() (file xen/common/multicall.c)

HYPERVISOR_update_va_mapping => do_update_va_mapping() (file /xen/arch/x86/mm.c)

HYPERVISOR_set_timer_op => do_set_timer_op() (file xen/common/schedule.c)

HYPERVISOR_event_channel_op => do_event_channel_op() (file xen/common/event_channel.c)

HYPERVISOR_xen_version => do_xen_version() (file xen/common/kernel.c)

HYPERVISOR_console_io => do_console_io() (file xen/drivers/char/console.c)

HYPERVISOR_physdev_op => do_physdev_op() (file xen/arch/x86/physdev.c)

HYPERVISOR_grant_table_op => do_grant_table_op() (file xen/common/grant_table.c)

HYPERVISOR_vm_assist => do_vm_assist() (file xen/common/kernel.c)

HYPERVISOR_update_va_mapping_otherdomain =>

  • do_update_va_mapping_otherdomain() (file xen/arch/x86/mm.c)

HYPERVISOR_iret => do_iret() (file xen/arch/x86/x86_32/traps.c) /* x86/32 only */

HYPERVISOR_vcpu_op => do_vcpu_op() (file xen/common/domain.c)

HYPERVISOR_set_segment_base => do_set_segment_base (file xen/arch/x86/x86_64/mm.c) /* x86/64 only */

HYPERVISOR_mmuext_op => do_mmuext_op() (file xen/arch/x86/mm.c)

HYPERVISOR_acm_op => do_acm_op() (file xen/common/acm_ops.c)

HYPERVISOR_nmi_op => do_nmi_op() (file xen/common/kernel.c)

HYPERVISOR_sched_op => do_sched_op() (file xen/common/schedule.c)

(Note: sometimes hypercalls are also called hcalls.)

Virtualization and the Linux Kernel

Virtualization in computer context can be thought of as extending the abilities of a computer beyond what a straight, non-virtual implelmentation allows.

In this category we can include also virtual memory, which allows a process to access 4GB virtual address space even though the physical RAM is usually much lower.

We can also think of the Linux IP Virtual Server (which is now a part of the linux kernel) as a kind of virtualization. By using the Linux IP Virtual Server you can configure a router to redirect service requests from a virtual server address to other machines (called real servers).

The IP Virtual Server is part of the kernel starting 2.6.10 (In the 2.4.* kernels it is also available as a patch; the code for 2.6.10 and above kernels is under net/ipv4/ipvs under the kernel tree ;there is still no implementation for ipv6).

The Linux Virtual Server (LVS) was started quite a time ago,in 1998; see http://www.linuxvirtualserver.org.

The idea of virtualization in the sense of enabling of running more than one operating system on a single platform is not new and was researched for many years. However, it seems that the Xen project is the first which produces performance benchmark metrics of such a feature which make this idea more practical and more attractive.

Origins of the Xen project: The Xen project is based on the Xenoservers project; It was originally built as part of the XenoServer project, see http://www.cl.cam.ac.uk/Research/SRG/netos/xeno.

Also the arsenic project has some ideas which were used in Xen. (see http://www.cl.cam.ac.uk/Research/SRG/netos/arsenic)

In the arsenic project, written by Ian Pratt and Keir Fraser, a big part of the Linux kernel TCP/IP stack was ported to user space. The arsenic project is based on Linux 2.3.29. After a short look at the Arsenic porject code you can find some data structures which can remind of parallel data structures in Xen, like the event rings. (for exmaple,the ring_control_block struct in arsenic-1.0/acenic-fw-12.3.10/nic/common/nic_api.h)

Meiosys is a French Company which was purchased by IBM. It deals with another different type of virtualization - Application Virtualization.

see http://www.virtual-strategy.com/article/articleview/680/1/2/ and http://www.infoworld.com/article/05/06/23/HNibmbuy_1.html

In context of the Meiosys project, it is worth to mention that a patch was sent recently to the Linux Kernel Mailing List from Serge E. Hallyn (IBM): see http://lwn.net/Articles/160015

This patch deals with process IDs. (the pid should stay the same after starting anew the application in Meiosys).

Another article on PID virtualization can be found in "PID virtualization: a wealth of choices" http://lwn.net/Articles/171017/?format=printable This article deals with PID virtualization in a context of a diffenet project (openVZ).

There is also the colinux open source project (see:http://colinux.sourceforge.net for more details) and the openvz project, which is based on Virtuozzoâ„¢. (Virtuozzo is a commercial solution).

The openvz offers server virtualization, linux-based solution: see http://openvz.org.

There are other projects which probably ispire virtualization; to name of few:

Denali Project uses (uses paravirtualization). http://denali.cs.washington.edu

A paper: Denali: Lightweight Virtual Machines for Distributed and Networked Applications By Andrew Whitaker et al. http://denali.cs.washington.edu/pubs/distpubs/papers/denali_usenix2002.pdf

Nemesis Operating System. http://www.cl.cam.ac.uk/Research/SRG/netos/old-projects/nemesis/index.html

Exokernel: see "Application Performance and Flexibility on Exokernel Systems" by M. Frans Kaashoek et al http://www.cl.cam.ac.uk/~smh22/docs/kaashoek97application.ps.gz

TBD: more details.

Pre-Virtualization

Another interesting virtulaization technique is Pre-Virtualization; in this method, we rewite sensitive instructions using the assembler files (whether generated by compiler, as is the usual case, or assembler files created manually). There is a problem in this method because there are instuctions which are sensitive only when they are performed in a certain context. A solution for this is to generate profiling data of a guest OS and then recompile the OS using the profiling data.

See:

http://l4ka.org/projects/virtualization/afterburn/

and an article: Pre-Virtualization: Slashing the Cost of Virtualization Joshua LeVasseur, Volkmar Uhlig, Matthew Chapman et al. http://l4ka.org/publications/2005/previrtualization-techreport.pdf

This technique is based on a paper by Hideki Eiraku and Yasushi Shinjo, "Running BSD Kernels as User Processes by Partial Emulation and Rewriting of Machine Instructions" http://www.usenix.org/publications/library/proceedings/bsdcon03/tech/eiraku/eiraku_html/index.html

Xen Storage

You can use iscsi for Xen Storage. The xen-tools package of OpenSuse has an example of using iscsi, called xmexample.iscsi. The disk entry for iscsi in the configuration file may look like: disk = [ 'iscsi:iqn.2006-09.de.suse@0ac47ee2-216e-452a-a341-a12624cd0225,hda,w' ]

TBD: more on iSCSI in Xen.

Solutions for using CoW in Xen: blktap (part of the xen project).

UnionFS: a stackable filesystem (used also in Knoppix Live-CD and other Live-CDs)

dm-userspace (A tool which uses device-mapper and a daemon called cowd; written by Dan Smith) You may download dm-userspace by:

To build as a module out-of-tree, copy dm-userspace.h to: /lib/modules/`uname -r`/build/include/linux and then run "make".

Home of dm-userspace:

Copy-on-write NFS server: see http://www.russross.com/CoWNFS.html

kvm - Kernel-based Virtualization Driver

Kvm is as an open source virtualization project , written by Avi Kivity and Yaniv Kamay from qumranet. See : http://kvm.sourceforge.net.

It is included in the linux kerel tree since 2.6.20-rc1; see: http://lkml.org/lkml/2006/12/13/361 ("kvm driver for all those crazy virtualization people to play with")

Currently it deals with Intel processors with the virtual extension (VT-X). and AMD SVM processors. You can know if your processor has these extensions by issuing from the command line: "egrep '^flags.*(vmx|svm)' /proc/cpuinfo"

kvm.ko is a kernel module which handles userspace requests through ioctls. It works with a character device (/dev/kvm). The userspace part is built from patched quemu. One of KVM advantages is that it uses linux kernel mechanisms as they are without change (such as the linux scheduler). The Xen project, for example, made many changes to parts of the kernel to enable para-virtualization. Another advantage is the simplicty of the project: there is a kernel part and a userspace part. An advantage of KVM is that future versions of linux kernel will not entail changes in the kvm module code (and of course not in the user space part). The project currently support SMP hosts and will support SMP guests in the future.

Currently there is no support to live migration in KVM (but there is support for ordinary migration, when the migrated OS is stopped and than transfrerred to the target and than resumed).

In intel vt-x , VM-Exits are handled by the kvm module by kvm_handle_exit() method in kvm_main.c according to the reason which caused them (and which is specified and read from the VMCS). in AMD SVM , exit are handled by handle_exit() in svm.c.

There is an interesting usage of memory slots . There is already an rpm for openSUSE by Gerd Hoffman.

Tip: How to build Xen with your own tar ball

If you want to run "make world" without downloading the kernel (beacuse that you want to to your own tar ball which is a bit different from the original one because you made few changes inside the kernel), then do the following:

1) Let's say that the kernel tar ball is named: my_linux-2.6.18.tar.bz2.

  • First, move my_linux-2.6.18.tar.bz2 to the folder from where you build Xen

2) Run from bash: XEN_LINUX_SOURCE=tarball make world

That's it; it will use the my_linux-2.6.18.tar.bz2. tar ball that you copied to that folder.

Xen in the Linux Kernel

According to the following thread from xen-devel: http://lists.xenproject.org/archives/html/xen-devel/2005-10/msg00436.html, there is a mercurial repository in which xen is a subarch of i386 and x86_64 of the linux kernel, and there is an intention to send releavant stuff to Andrew/Linus for the upcoming 2.6.15 kernel. In 22/3/2006 , a patchest of 35 parts was sent to the Linux Kernel Mailing List (lkml) for Xen i386 paravirtualization support in the linux kernel: see http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/2313.html

VMI : Virtual Machine Interface

On 13/3/06 , a patchset titled "VMI i386 Linux virtualization interface proposal" was sent to the LKML (Linux Kernel Mailing List) by Zachary Amsden and othes. (see http://lkml.org/lkml/2006/3/13/140) It suggests for a common interfcace which abstracts the specifics of each hypervisor and thus can be used by many hypervisors. According to the vmi_spec.txt of this patchset, when an OS is ported to a paravirtulizable x86 processor, it should access the hypervisor through the VMI layer.

The VMI layer interface:

The VMI is divided to the following 10 types of calls:

CORE INTERFACE CALLS (like VMI_Init)

PROCESSOR STATE CALLS (like VMI_DisableInterrupts, VMI_EnableInterrupts,VMI_GetInterruptMask)

DESCRIPTOR RELATED CALLS (like VMI_SetGDT,VMI_SetIDT)

CPU CONTROL CALLS (like VMI_WRMSR,VMI_RDMSR)

PROCESSOR INFORMATION CALLS (like VMI_CPUID)

STACK/PRIVILEGE TRANSITION CALLS (like VMI_UpdateKernelStack,VMI_IRET)

I/O CALLS (like VMI_INB,VMI_INW,VMI_INL)

APIC CALLS (like VMI_APICWrite,VMI_APICRead)

TIMER CALLS (VMI_GetWallclockTime)

MMU CALLS (like VMI_SetLinearMapping)

Links

1) Xen Project HomePage http://www.cl.cam.ac.uk/Research/SRG/netos/xen

2) Xen Mailing Lists Pge: http://lists.xenproject.org

(don't forget to read the XenUsersNetiquette before posting on the lists)

3) Atricle : Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor http://www.usenix.org/publications/library/proceedings/sec2000/full_papers/robin/robin_html/index.html

4) Xen Summits: 2005: http://www-archive.xenproject.org/xensummit/xensummit_2005.html 2006 fall: http://www-archive.xenproject.org/xensummit/xensummit_fall_2006.html 2006 winter: http://www-archive.xenproject.org/xensummit/xensummit_winter_2006.html 2007 spring: http://www-archive.xenproject.org/xensummit/xensummit_spring_2007.html

5) Intel Virtualizatiuon technology: http://www.intel.com/technology/virtualization/index.htm

6) Article by Ryan Maueron Linux Journal in 2 parts:

6-1) Xen Virtualization and Linux Clustering, Part 1 http://www.linuxjournal.com/article/8812

6-2) Xen Virtualization and Linux Clustering, Part 2 http://www.linuxjournal.com/article/8816

Commercial Companies: 7)XenSource:

http://www.xenproject.org/

8) Enomalism: http://www.enomalism.com/

9) Thoughtcrime is a brand new company specialising in opensource virtualisation solutions see http://debian.thoughtcrime.co.nz/ubuntu/README.txt

IA64:

10)IA64 Master Thesis HPC Virtualization with Xen on Itanium by Havard K. F. Bjerke http://openlab-mu-internal.web.cern.ch/openlab%2Dmu%2Dinternal/Documents/2_Technical_Documents/Master_thesis/Thesis_HarvardBjerke.pdf

11) vBlades: Optimized Paravirtualization for the Itanium Processor Family Daniel J. Magenheimer and Thomas W. Christian http://www.usenix.org/publications/library/proceedings/vm04/tech/full_papers/magenheimer/magenheimer_html/index.html

12) Now and Xen Feature Story Article Written by Andrew Warfield and Keir Fraser http://www.linux-mag.com/2004-10/xen_01.html

13) Self Migration: http://www.diku.dk/~jacobg/self-migration

14) online magazine: http://www.virtual-strategy.com

15) Denali Project http://denali.cs.washington.edu

general links about virtualization:

16) http://www.virtualization.info

17) http://www.kernelthread.com/publications/virtualization

18) A Survey on Virtualization Technologies Susanta Nanda Tzi-cker Chiueh http://www.ecsl.cs.sunysb.edu/tr/TR179.pdf

AMD:

19) AMD I/O virtualization technology (IOMMU) specification Rev 1.00 - February 03, 2006 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/34434.pdf

20) AMD64 Architecture Programmer's Manual: Vol 2 System Programming : Revision 3.11 added chapter 15 on virtualization ("Secure Virtual Machine").(december 2005) http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

21) AMD64 Architecture Programmer's Manual: Vol 3: General-Purpose and System Instructions Revision 3.11 added SVM instructions (december 2005) http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24594.pdf

22) AMD virtualization on the Xen Summit: http://www.xenproject.org/files/xs0106_amd_virtualization.pdf

23) AMD Press Release: SUNNYVALE, CALIF. -- May 23, 2006: availability of AMD processors with virtualization extensions: http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~108605,00.html

Open Solaris:

24) Open Solaris Xen Forum: http://www.opensolaris.org/jive/category.jspa?categoryID=32

25) Update to opensolaris Xen: adding OpenSolaris-based dom0 capabilities, as well as 32-bit and 64-bit MP guest. 14/07/2006 http://www.opensolaris.org/os/community/xen/announcements/?monthYear=July+2006

26) Open Sparc Hypervisor Spec: http://opensparc.sunsource.net/nonav/Hypervisor_api-26-v6.pdf

27) Open Sparc T1 page: http://opensparc.sunsource.net/nonav/opensparct1.html extension to the Solaris Zones: http://opensolaris.org/os/community/brandz

28) OSDL Development Wiki Homepage (Virtualization) http://www.osdl.org/cgi-bin/osdl_development_wiki.pl?Virtualization

29) fedora xen mailing list archive: https://www.redhat.com/archives/fedora-xen

30) Xen Quick Start for FC4 (Fedora Core 4). http://www.fedoraproject.org/wiki/FedoraXenQuickstart?highlight=%28xen%29

31) Xen Quick Start for FC5 (Fedora Core 5): http://www.fedoraproject.org/wiki/FedoraXenQuickstartFC5

32) Xen Quick Start for FC6 (Fedora Core 6): http://fedoraproject.org/wiki/FedoraXenQuickstartFC6

Fedora 7 quick start: http://fedoraproject.org/wiki/Docs/Fedora7VirtQuickStart Fedora 8 quick start: http://fedoraproject.org/wiki/Docs/Fedora8VirtQuickStart

33) The Xen repository is handled by the mercurial version system. mercurial download: http://www.selenic.com/mercurial/release

34) Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor Cherkasova Ludmila and Gardner, Rob http://www.hpl.hp.com/techreports/2005/HPL-2005-62.html

35) XenMon: QoS Monitoring and Performance Profiling Tool Gupta Diwaker and Gardner Rob; Cherkasova, Ludmila http://www.hpl.hp.com/techreports/2005/HPL-2005-187.html

36) Potemkin VMM: A virtual machine based on xen-unstable ; used in a honeypot By Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. http://www.cs.ucsd.edu/~savage/papers/Sosp05.pdf

37) Memory Resource Management in VMware ESX Server http://www.usenix.org/events/osdi02/tech/waldspurger.html

books:

38) Virtualization: From the Desktop to the Enterprise By Erick M. Halter , Chris Wolf Published: May 2005 http://www.apress.com/book/bookDisplay.html?bID=449

39) Virtualization with VMware ESX Server Publisher: Syngress; 2005 by Al Muller, Seburn Wilson, Don Happe, Gary J. Humphrey http://www.syngress.com/catalog/?pid=3315

40) VMware ESX Server: Advanced Technical Design Guide by Ron Oglesby, Scott Herold http://www.amazon.com/exec/obidos/ASIN/0971151067/virtualizatio-20/002-5634453-4543251?creative=327641&camp=14573&link_code=as1

41) PPC: Hollis Blanchard, IBM Linux Technology Center Jimi Xenidis, IBM Research http://wiki.xenproject.org/xenwiki/Xen/PPC

42) http://about-virtualization.com/mambo

43) PID virtualization: a wealth of choices http://lwn.net/Articles/171017/?format=printable

44) The Xen Hypervisor and its IO Subsystem: http://www.mulix.org/lectures/xen-iommu/xen-io.pdf

45) G. J. Popek and R. P. Goldberg, Formal requirements for virtualizable third generation architectures, Commun. ACM, vol. 17, no. 7, pp. 412 421, 1974. http://www.cis.upenn.edu/~cis700-6/04f/papers/popek-goldberg-requirements.pdf

46) "Running multiple operating systems concurrently on an IA32 PC using virtualization techniques" by Kevin Lawton (1999). http://www.floobydust.com/virtualization/lawton_1999.txt

47) Automating Xen Virtual Machine Deployment (talks about integrating SystemImager with Xen and more) by Kris Buytaert http://howto.x-tend.be/AutomatingVirtualMachineDeployment/

48)Virtualizing servers with Xen Evaldo Gardenali VI International Conference of Unix at UNINET http://umeet.uninet.edu/umeet2005/talks/evaldoa/xen.pdf

49)Survey of System Virtualization Techniques Robert Rose March 8, 2004 http://www.robertwrose.com/vita/rose-virtualization.pdf

NetBSD

50)http://www.netbsd.org/Ports/xen/

51)Interview on Xen with NetBSD develope Manuel Bouyer http://ezine.daemonnews.org/200602/xen.html

52) netbsd xen mailing list: http://www.netbsd.org/MailingLists/#port-xen

53) NetBSD/xen Howto http://www.netbsd.org/Ports/xen/howto.html

54) "C" API for Xen (LGPL) By Daniel Veillard and others

http://libvir.org/

55) Fraser Campbell page: http://goxen.org/

56) Another page from Fraser Campbell : http://linuxvirtualization.com/

57) Virtualization blog

58)Hardware emulation with QEMU (article)

59) http://linuxemu.retrofaction.com

60) Linux Virtualization with Xen http://www.linuxdevcenter.com/pub/a/linux/2006/01/26/xen.html

61) The virtues of Xen by Alex Maier http://www.redhat.com/magazine/014dec05/features/xen/

62) Deploying VirtualMachines as Sandboxes for the Grid Sriya Santhanam, Pradheep Elango, Andrea Arpaci Dusseau, Miron Livny http://www.cs.wisc.edu/~pradheep/SandboxingWorlds05.pdf

63) article: Xen and the new processors:

Infiniband:

64) Infiniband (Smart IO) wiki page http://wiki.xenproject.org/xenwiki/XenSmartIO hg repository: http://xenbits.xenproject.org/ext/xen-smartio.hg

65) Novell Infiniband and virtualization, Patrick Mullaney , may 1, 2007: http://www.openfabrics.org/archives/spring2007sonoma/Monday%20April%2030/Novell%20xen-ib-presentation-sonoma.ppt

66) A Case for High Performance Computing with Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/huangwei-ics06.pdf

67) High Performance VMM-Bypass I/O in Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. (usenix 06) http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/usenix06.pdf

68) User Mode Linux , a book By Jeff Dike. Bruce Perens' Open Source Series. Published: Apr 12, 2006; http://www.phptr.com/title/0131865056

69) Xen 3.0.3 features,schedule : http://lists.xenproject.org/archives/html/xen-devel/2006-06/msg00390.html

70) Practical Taint-Based Protection using Demand Emulation Alex Ho, Michael Fetterman, Christopher Clark et al. http://www.cs.kuleuven.ac.be/conference/EuroSys2006/papers/p29-ho.pdf

71) http://stateless.geek.nz/2006/09/11/current-virtualisation-hardware/ Current Virtualisation Hardware by Nicholas Lee

72) RAID: Installing Xen with RAID 1 on a Fedora Core 4 x86 64 SMP machine: http://www.freax.be/wiki/index.php/Installing_Xen_with_RAID_1_on_a_Fedora_Core_4_x86_64_SMP_machine

73) RAID 1 and Xen (dom0) : (On Debian) http://wiki.kartbuilding.net/index.php/RAID_1_and_Xen_(dom0)

74) OpenVZ Virtualization Software Available for Power Processors http://lwn.net/Articles/204275/

75) Kernel-based Virtual Machine patchset (Avi Kivity) adding /dev/kvm which exposes the virtualization capabilities to userspace. http://lwn.net/Articles/205580

76) Intel Technology Journal : Intel Virtulaization Technology: articles by Intel Staff (96 pages) http://download.intel.com/technology/itj/2006/v10i3/v10_iss03.pdf

77) kvm site: (Avi Kivity and others) Includes a howto and a white paper, download , faq sections. http://kvm.sourceforge.net/

78) kvm on debian: http://people.debian.org/~baruch/kvm http://baruch.ev-en.org/blog/Debian/kvm-in-debian

79) Linux Virtualization Wiki

80) " New virtualisation system beats Xen to Linux kernel" (about kvm) http://www.techworld.com/opsys/news/index.cfm?newsID=7586&pagtype=all

81) article about kvm: http://linux.inet.hr/finally-user-friendly-virtualization-for-linux.html

82) Virtual Linux : An overview of virtualization methods, architectures, and implementations An article by M. Tim Jones (auhor of "GNU/Linux Application Programming","AI Application Programming", and "BSD Sockets Programming from a Multilanguage Perspective". http://www-128.ibm.com/developerworks/library/l-linuxvirt/index.html

83) Lguest: The Simple x86 Hypervisor by Rusty Russel (formerly lhype) http://lguest.ozlabs.org/

84) "Infrastructure virtualisation with Xen advisory" - a wiki atricle : using iscsi for Xen-clustering ; shared storage http://docs.solstice.nl/index.php/Infrastructure_virtualisation_with_Xen_advisory

85) Xen with DRBD, GNBD and OCFS2 HOWTO http://xenamo.sourceforge.net/

86) Virtualization with Xen(tm): Including Xenenterprise, Xenserver, and Xenexpress (Paperback) by David E Williams Syngress Publishing (April 1, 2007) # ISBN-10: 1597491675 # ISBN-13: 978-1597491679 (Author)http://www.amazon.com/Virtualization-Xen-Including-Xenenterprise-Xenexpress/dp/1597491675/ref=pd_bbs_sr_2/103-3574046-3264617?ie=UTF8&s=books&qid=1173899913&sr=8-2

Paperback: 512 pages

87) Professional XEN Virtualization (Paperback) by William von Hagen (Author)


88) Xen and the Art of Consolidation Tom Eastep Linuxfest NW. April 29, 2007. http://www.shorewall.net/Linuxfest-2007.pdf

89) Optimizing Network Virtualization in Xen

By Willy Zwaenepoel, Alan L. Cox, Aravind Menon, usenix 2006 http://www.usenix.org/events/usenix06/tech/menon/menon_html/paper.html

90) Virtual Machine Checkpointing

Brendan Cully,University of British Columbia with Andrew Warfield, University of Cambridge

http://xcr.cenit.latech.edu/hapcw2006/program/papers/cr-xen-hapcw06-final.pdf

Adding new device and triggering the probe() functions

The following is a simple example which shows how to add a new device and trigger the probe() function of a backend driver using xenstore-write tool. This is relevant for Xen 3.1

Currently in Xen, triggering of the probe() method in a backend driver or a frontend driver is done by writing some values to the xenstore into directories where the xenbus poses watches. This writing to the xenstore is currently done in Xen from the python code, and it is wrapped deep inside the xend and/or xm commands. Eventually it is done in the writeDetails method of the DevController class. (And both blkif and netif use it).

For those who want who want to be able to trigger the probe() function without diving too deeply into the python code, this should suffice.

For the purposes of this little tutorial, let's assume that you have built and installed Xen 3.1 from source and have used it to fire up a guest domain at least once. After you've done that, let's say we want to add new device. We will add a device named "mydevice". Let's begin with the backend. For this purpose, we will add a directory named "deviceback" to linux-2.6-sparse/drivers/xen. This directory will store the backend portion of our driver.

First, create linux-2.6-sparse/drivers/xen/deviceback. Next, add the following three files to that directory: deviceback.c, xenbus.c, common.h and Makefile.

Here is a minimal skeleton implementation of these files:

deviceback.c

#include <linux/module.h>
#include "common.h"
static int __init deviceback_init(void)
{
        device_xenbus_init();
}
static void deviceback_cleanup()
{
}
module_init(deviceback_init);
module_exit(deviceback_cleanup);

xenbus.c

#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/slab.h>
struct backendinfo
{
        struct xenbus_device* dev;
        long int frontend_id;
        struct xenbus_watch backend_watch;
        struct xenbus_watch watch;
        char* frontpath;
};
//////////////////////////////////////////////////////////////////////////////
static int device_probe(struct xenbus_device* dev,
                        const struct xenbus_device_id* id)
{
        struct backendinfo* be;
        char* frontend;
        int err;
        be = kmalloc(sizeof(*be),GFP_KERNEL);
        memset(be,0,sizeof(*be));
        be->dev = dev;
        printk("Probe fired!\n");
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static int device_uevent(struct xenbus_device* xdev,
                          char** envp, int num_envp,
                          char* buffer, int buffer_size)
{
        return 0;
}
static int device_remove(struct xenbus_device* dev)
{
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_device_id device_ids[] =
{
        { "mydevice" },
        { "" }
};
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_driver deviceback =
{
        .name    = "mydevice",
        .owner   = THIS_MODULE,
        .ids     = device_ids,
        .probe   = device_probe,
        .remove  = device_remove,
        .uevent  = device_uevent,
};
//////////////////////////////////////////////////////////////////////////////
void device_xenbus_init()
{
        xenbus_register_backend(&deviceback);
}

common.h

#ifndef COMMON_H
#define COMMON_H
void device_xenbus_init(void);
#endif

Makefile

#Makefile
obj-y += xenbus.o deviceback.o

Next, we should add our new backend device to the Makefile in linux-2.6-sparse/drivers/xen/Makefile. Add the following line to the bottom of that file:


obj-y += deviceback/

This will make sure that it will be included in the build.

Next, we need to add symlinks from linux-2.6-sparse/drivers/xen/deviceback into linux-2.6.18-xen/drivers/xen/deviceback:

  1. Create linux-2.6.18-xen/drivers/xen/deviceback
  2. Change into that directory
  3. Add the symlinks: 'ln -s ../../../../linux-2.6-xen-sparse/./drivers/xen/deviceback/./* .'

Now we should build the new drivers and reboot with the new Xen image. You can do this by going back to the root directory of the source tree (the place where you typed 'make world' when doing your normal build before) and do 'make install-kernels'. (Note: This will overwrite the previous Xen kernel!) Finally, reboot.

After the machine boots back up, go ahead and start a guest domain and you should notice that device_probe() does not get executed. (Check /var/log/syslog on dom0 to look for the printk() to show up.)

How can we trigger the probe() function of our backend driver? We just need to write the correct key/value pairs into the xenstore.

The call to xenbus_register_backend() in xenbus.c causes xenbus to set a watch on local/domain/0/backend/mydevice in the xenstore. Specifically, anytime anything is written into that location in the store the watch fires and checks for a specific set of key/value pairs that indicate the probe should be fired.

So performing the following 4 calls using xenstore-write will trigger our probe() function. Change the X with the ID of a running guest domain. (Check 'xm list' for this. If you've only started one guest, this number is probably 1.)


xenstore-write /local/domain/X/device/mydevice/0/state 1
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend-id X
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend /local/domain/X/device/mydevice/0
xenstore-write /local/domain/0/backend/mydevice/X/0/state 1

You should see the probe message appear the Dom0's /var/log/syslog. What happened here behind the scenes ,without going too deep, is that the xenbus_register_backend() put a watch on the xenback directory of /local/domain/0 in the xenstore. Once frontend, frontend-id, and state are all written to the watched location, the xenbus driver will gather all of that information, as well as the state of the frontend driver (written in that first line) and use it to setup the appropriate data structures. From there, the probe() function is finally fired.

Adding a frontend device

For this purpose,we will add a directory named "devicefront" to linux-2.6-sparse/drivers/xen.

We will create 2 files there: devicefront.c and Makefile.

We will also add directories and symlinks as we did in the deviceback case.

Makefile

obj-y := devicefront.o

devicefront.c

The devicefront.c will be (a minimalist implementation):


// devicefront.c
#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/list.h>
struct device_info
{
        struct list_head list;
        struct xenbus_device* xbdev;
};
//////////////////////////////////////////////////////////////////////////////
static int devicefront_probe(struct xenbus_device* dev,
                             const struct xenbus_device_id* id)
{
        printk("Frontend Probe Fired!\n");
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_device_id devicefront_ids[] =
{
        {"mydevice"},
        {}
};
static struct xenbus_driver devicefront =
{
        .name  = "mydevice",
        .owner = THIS_MODULE,
        .ids   = devicefront_ids,
        .probe = devicefront_probe,
};
static int devicefront_init(void)
{
        printk("%s\n",__FUNCTION__);
        xenbus_register_frontend(&devicefront);
}
module_init(devicefront_init);

We should also remember to add the following to the Makefile under linux-2.6-sparse/drivers/xen:


obj-y   += devicefront/

Getting the frontend driver to fire is a bit more complicated, the following bash script should help you:


#!/bin/bash
if [ $# != 2 ]
then
        echo "Usage: $0 <device name> <frontend-id>"
else
        # Write backend information into the location the frontend will look
        # for it.
        xenstore-write /local/domain/${2}/device/${1}/0/backend-id 0
        xenstore-write /local/domain/${2}/device/${1}/0/backend \
                       /local/domain/0/backend/${1}/${2}/0
        # Write frontend information into the location the backend will look
        # for it.
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend-id ${2}
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend \
                       /local/domain/${2}/device/${1}/0
        # Set the permissions on the backend so that the frontend can
        # actually read it.
        xenstore-chmod /local/domain/0/backend/${1}/${2}/0 r
        # Write the states.  Note that the backend state must be written
        # last because it requires a valid frontend state to already be
        # written.
        xenstore-write /local/domain/${2}/device/${1}/0/state 1
        xenstore-write /local/domain/0/backend/${1}/${2}/0/state 1
fi

Here's how to use it:

  • Startup a Xen guest that contains your frontend driver, and be sure dom0 contains the backend driver.
  • Figure out the frontend-id for the guest. This is the ID field when running xm list. Let's say that number is 3.
  • Run the script as so: ./probetest.sh mydevice 3

That should fire both the frontend driver. (You'll have to check /var/log/messages in the guest to verify that the probe was fired.)