Xen PCI Passthrough

From Xen

PCI passthrough allows you to give control of physical devices to guests: that is, you can use PCI passthrough to assign a PCI device (NIC, disk controller, HBA, USB controller, firewire controller, soundcard, etc) to a virtual machine guest, giving it full and direct access to the PCI device.

This has several potential uses, including the following:

  • One of the major overheads of virtual networking is the unavoidable overhead of copying; passing through a network card directly completely avoids this overhead.
  • Passing through graphics cards to guests allows them full access to the 3D acceleration capabilities. This can be useful not only for desktops, but also to enable Virtual Desktops for high-end CAD users, for example.
  • You can set up a Driver Domain, which can increase both security and reliability of your system.

Contents

Overview of passthrough

PCI devices are specified by BDF Notation. You can determine the BDF for the device by running lspci in domain 0.

Domain 0 has responsibility for all devices on the system. Normally, as it discovers PCI devices, it passes those to drivers within the Linux kernel. In order for a device to be accessed by a guest, the device must instead be assigned to a special domain 0 driver. This driver is called xen-pciback in pvops kernels, and called pciback in classic kernels. PV guests access the device via a kernel driver in the guest called xen-pcifront (pcifront in classic xen kernels), which connects to pciback. HVM guests see the device on the emulated PCI bus presented by QEMU.

Guests are allowed to set up DMA for devices, but access to the PCI configuration space must be arbitrated for security reasons. For HVM guests, this is done by qemu. For PV guests, this is done by the pciback driver in dom0.

Normally devices are allowed to do DMA to and from any part of the host's physical memory. This presents two problems. First, it is a potential reliability or security issue: a guest with a buggy driver could accidentaly overwrite some of Xen's memory; a guest controlled by an attacker could read and write memory of other guests. Secondly, the guest's idea of the memory layout is virtualized, but the device's idea isn't. PV guests can overcome this because they can "look behind" the virtualized memory layout; but HVM guests cannot.

The solution to both of these is called an IOMMU. (The Intel name for the IOMMU functionality is VT-d; this document will use IOMMU to refer to both the AMD and Intel feature.) The IOMMU allows Xen to limit what memory a device is allowed to access. It also allows Xen to give the device the same virtualized memory layout that the guest sees. This solves both the security problem and the memory virtualization problem.

(Note that IOMMU/VT-d support is not the same as HVM support; it is possible to have HVM support without an IOMMU, or vice versa.)

For these reasons, it is highly recommended to use passthrough only on systems that have an IOMMU. On systems without an IOMMU, devices can be passed through to trusted PV guests, but doing so removes the security or stability advantages (though not the performance advantages). Devices cannot be passed through to HVM guests on systems without an IOMMU.

Determining if you have IOMMU / VT-d support is covered in the FAQ below.

Using passthrough

Preparing a device for passthrough

First, determine the BDF of the device you wish to pass through. This is usually done by running lspci in the guest.

Then you need to assign the device to pciback instead of its normal driver in dom0, to make it available to pass through to guests. This can be done statically, at boot time, or dynamically, after the system has booted. Static is less flexible and requires you to reboot your system each time you want to change something. Dynamic is the most flexible and doesn't require a reboot, but requires more steps, especially before Xen 4.2. If xen-pciback is compiled into your kernel, static is the easiest option; if xen-pciback is compiled as a module, it's the hardest option.

The options are listed below in order of ease of use.

Static assignment for built-in xen-pciback (when xen-pciback is compiled into the kernel and NOT loaded as a module)

If you have pciback built into your kernel (i.e., not built as a module), pass the BDFs in the "hide" option to the xen-pciback module on the dom0 kernel command-line. For instance, if you wanted to pass through the devices at BDFs 08:00.0 and 08:00.1, you would add the following to the dom0 linux kernel command line:

xen-pciback.hide=(08:00.0)(08:00.1)

(If you're using a classic xen kernel, use "pciback.hide=..." instead.) This will hide the devices from the normal guest drivers and assign them to pciback at boot.

Dynamic assignment with xl (when xen-pciback is loaded as a module and NOT compiled into the kernel)

(This option is only available from Xen 4.2.) Begin by making sure that dom0 has the pciback module loaded:

# modprobe xen-pciback

(Or modprobe pciback for classic Xen kernels.)

Then make a device assignable by using xl pci-assignable-add. For example, if you wanted to make the device at BDF 08:00.0 available for guests, you could type the following:

# xl pci-assignable-add 08:00.0

More information on the "assignable" commands can be found here: Xen_4.2:_xl_and_pci_pass-through.

Dynamic assignment with sysfs (when xen-pciback is loaded as a module and NOT compiled into the kernel)

If you're not using xl, or are using Xen 4.1 or earlier, you can assign the device manually using Linux's sysfs commands.

Begin by making sure that dom0 has the pciback module loaded:

# modprobe xen-pciback

(Or modprobe pciback for classic Xen kernels.)

The general steps are:

  • Unbind from the old driver
  • Create a new slot in pciback for the device
  • Bind to pciback

Below are the commands to do this for BDF 08:00.0. Note that for sysfs, the domain is also required as part of the BDF. This is almost always 0000.

echo 0000:08:00.0 > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
echo 0000:08:00.0 > /sys/bus/pci/drivers/pciback/new_slot
echo 0000:08:00.0 > /sys/bus/pci/drivers/pciback/bind

Note that the first command (unbind) has the BDF in the path as well as the echo. Also note that the path is "pciback", not "xen-pciback", even for pvops kernels.

Alternately, you can use this script (pciback.sh):

#!/bin/bash

if [ $# -eq 0 ]; then
    echo "Require PCI devices in format:  <domain>:<bus>:<slot>.<function>" 
    echo "Eg: $(basename $0) 0000:00:1b.0"
    exit 1
fi

modprobe pciback 

for pcidev in $@; do
    if [ -h /sys/bus/pci/devices/"$pcidev"/driver ]; then
        echo "Unbinding $pcidev from" $(basename $(readlink /sys/bus/pci/devices/"$pcidev"/driver))
        echo -n "$pcidev" > /sys/bus/pci/devices/"$pcidev"/driver/unbind
    fi
    echo "Binding $pcidev to pciback"
    echo -n "$pcidev" > /sys/bus/pci/drivers/pciback/new_slot
    echo -n "$pcidev" > /sys/bus/pci/drivers/pciback/bind
done

Static assignment for xen-pciback module (when xen-pciback is loaded as a module and NOT compiled into the kernel)

The basic parameters for static assignment when xen-pciback is built as a module is similar to those of the built-in static assignment above. The problem, however, is that xen-pciback must be loaded before any other module that might try to grab this device. To do this, you need to modify /etc/modprobe.conf to introduce a dependency between the normal module containing the driver for the device and pciback.

To do this, first determine the module that you need for the dependency. The easiest way to do that is to use lspci -k, which lists the driver that's currently using it. For instance, if you want to know the driver for the device at BDF 08:00.0:

# lspci -k
...
08:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet (rev a3)
	Subsystem: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet
	Kernel driver in use: tg3
	Kernel modules: tg3

So the name of the module is tg3. Now add a line to the /etc/modprobe.conf like this:

install tg3 /sbin/modprobe pciback ; /sbin/modprobe --first-time --ignore-install tg3 

And finally add the line to tell pciback to grab it:

options xen-pciback hide=(0000:08:00.0) 
Note 
This is distro specific. Alpine Linux configures modules in /etc/modules and does not allow for script lines to be used.
Therefore preventing the usual driver to load and allow xen-pciback to bind to the device has to be achieved differently (i.e use an rc script to unload a module and then load the xen-pciback)

You may also need to refresh your initrd. The next time you boot, the device should be assigned to pciback.

Note
For Debian Wheezy, please see PCI_Passthrough_with_Debian_Wheezy. (I think this reference belongs here, but this whole page is very confusing ...)

Verifying that the device is ready to be passed through

At this point, the device is ready to be assigned to a guest. You can verify this in one of the following ways:

  • When using xm:
# xm pci-list-assignable-devices
08:00.0
  • Using xl for Xen 4.2 or later:
# xl pci-assignable-list
08:00.0

Guest configuration

HVM guests require no special configuration for the guest kernel, as all accesses are emulated and virtualized by the IOMMU hardware.

PV guests need the xen-pcifront module (just 'pcifront' for classic Xen kernels). Additionally you must enable swiotlb on the guest kernel command-line. For pvops kernels, you add the following:

iommu=soft

For classic Xen (xenlinux) kernels, add the following instead:

swiotlb=force

Actually assigning the device to the guest can be done at vm creation time with the config file. Alternately, if the guest has hotplug capabilities, the devices can be added dynamically as well.

Configuration file for the domU

Suppose you want to pass through the devices at BDFs 08:00.0 and 08:00.1. Add the following line to your configuration file:

pci=['08:00.0','08:00.1']

Hotplug

Commands to hot-plug and unplug a device into a running VM are below:

xl pci-attach <domain-id> <pci device> <guest virtual slot number>
xl pci-detach <domain-id> <pci device> <guest virtual slot number>

(Replace xl with xm if you're using xend.)

PV guests and PCI quirks

As mentioned above, access to the PCI configuration space is arbitrated by pciback for PV guests. Unfortunately, pciback is frequently too strict in what it will not allow, and you will get an error message like this one:

pciback 0000:08:00.0: Driver tried to write to a read-only configuration space field at offset 0xe0, size 2. This may be harmless, but if you have problems with your device:
   1) see permissive attribute in sysfs
   2) report problems to the xen-devel mailing list along with details of your device obtained from lspci.

The easiest way around this is to enable permissive mode.

Permissive mode for xl

(This is only availabe for xl on Xen 4.2 or later.)
To enable permissive mode for a device using xl, you can enable it for all devices for a given domain in the /etc/xen/<domain> configuration file, like this:

pci_permissive=1

Or, you can add ",permissive=1" to the BDF of a particular device as it's passed through, either in the config file:

pci=['08:00.0,permissive=1']

or when hot-plugging it:

xl pci-attach 5 '08:00.0,permissive=1'

Permissive mode for xm/xend

xend works a little differently than xl; rather than specifying a particular device for a particular domain, it has a global list of devices which it allowes to be set as "permissive" in /etc/xen/xend-pci-permissive.sxp.

First, find the hexadecimal vendor id for the device you want to pass through using lspci -nn:

# lspci -nn
08:04.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme BCM5715 Gigabit Ethernet [14e4:1678] (rev a3)

So the hex vendor id is 14e4:1678. Now add that code to /etc/xen/xend-pci-permissive.sxp, in the section called unconstrained_dev_ids. The end result should look like this:

(unconstrained_dev_ids
    ('14e4:1678')
)

Enabling permisive mode manually

Unfortunately xl in 4.1 didn't implement a way to automatically enable permissive mode. The only way to enable it is to do it manually. Fortunately, it's fairly straightforward. After the device is assigned to pciback but before starting the VM, do the following:

echo 0000:08:00.0 > /sys/bus/pci/drivers/pciback/permissive

Further information and FAQ

Xen dom0 pciback driver backend modes

List of Xen pciback modes that you can set in the kernel configuration (.config file) in xen/stable-2.6.32.x kernel:

  • CONFIG_XEN_PCIDEV_BACKEND_PASS=y means PCI device gets the same PCI ID in the guest than in dom0.
  • CONFIG_XEN_PCIDEV_BACKEND_VPCI=y means PCI device gets virtual PCI ID in the guest, not the same PCI ID as in dom0.

Note that in upstream Linux 3.1.0 and later versions you can set PASS/VCPI as a module/driver option when loading the driver!

You can use the following on dom0 Linux kernel command line in grub.conf (if xen-pciback is built-in to the kernel):

xen-pciback.passthrough=1

or the following if loading xen-pciback driver as a module:

modprobe xen-pciback passthrough=1

In Linux 3.1+ this will give the same behaviour as earlier CONFIG_XEN_PCIDEV_BACKEND_PASS .config option.

Xen PCI passthru limitations

When the guest has PCI passthru devices in use, operations like save/restore/migration are not possible. You have to detach (unplug) the passthru device before save, restore or live migration is possible.

Xen VGA graphics adapter passthru

Please see the XenVGAPassthrough wiki page for more information about VGA graphics card passthru.

How can I tell if I have IOMMU / VT-D support?

To verify you have IOMMU support enabled:

  • Check if IOMMU (Intel VT-d or AMD IOMMU) is enabled in the system BIOS. Some BIOSes call this feature "IO virtualization" or "Directed IO". After changing settings in the BIOS make sure you completely poweroff the machine, unplug the power cord, let it be without power for a while, and then restart the system. Some systems are known to not enable IOMMU for real until you poweroff the system completely!
  • If running Xen 3.4.x (or older version) you need to add iommu=1 flag (or vtd=1 in even older versions) for Xen hypervisor (xen.gz) to grub.conf and reboot. Xen 4.0.0 and newer versions enable IOMMU support as a default if supported by the hardware and BIOS, no additional boot flags required for the hypervisor.
  • read "xm dmesg" Xen hypervisor boot messages and check if "IO virtualisation" gets enabled.
  • Unfortunately there are many buggy BIOSes causing Xen to disable IO virtualization because of errors in the BIOS DMAR/ACPI tables. Xen tries to workaround these bugs in the BIOS, but sometimes it's not possible. Please report all the details about your hardware and software to xen-devel mailinglist if IO virtualization gets disabled due to buggy BIOS. Also see below for troubleshooting tips.

Please see the VTdHowTo wiki page for more information about PCI passthru to Xen HVM guest.

I get "non-page-aligned MMIO BAR" error when trying to start the guest

If using linux-2.6.18-xen, add these options to grub.conf for the 2.6.18.8 dom0 kernel which should fix the alignment:


guestdev=01:00.0,01:02.0 reassign_resources

replace "01:00.0" and "01:02.0" with your actual PCI devices you want to passthru. Note the "," to separate the entries.

There was a change in Apr 2009 in linux-2.6.18-xen (http://xenbits.xenproject.org/linux-2.6.18-xen.hg?rev/a3ad7a5f2dcd) that changed the syntax.. the earlier/old syntax for linux-2.6.18-xen is:


pciback.permissive pciback.hide=(01:00.0)(02:01.0) reassigndev

If you're using Linux 2.6.31 or newer pvops dom0 kernel then there's no guestdev/reassign_resources, but instead you use:


xen-pciback.permissive xen-pciback.hide=(08:05.0)(09:06.1) pci=resource_alignment=08:05.0;09:06.1

If you're using Linux 2.6.31 or newer dom0 kernel based on the Novell/SLES/OpenSuse Xenlinux forward-ported patches, then you use this syntax:


pciback.permissive pciback.hide=(00:1d.7)(00:1a.0)(00:1a.1)(00:1a.7)(00:1b.0) pci=resource_alignment=00:1a.7;00:1d.7

Note the ";" to separate multiple PCI ID entries for "pci=resource_alignment".

If using GRUB2, and using resource_alignment for multiple devices, you need to wrap the resource_alignment with single quotes like this:

'pci=resource_alignment=00:1a.7;00:1d.7'

Otherwise GRUB2 will parse the line wrong and you won't get any resource_alignment! For more info see: http://lists.xenproject.org/archives/html/xen-users/2011-09/msg00360.html .

I get "Error: pci: 0000:02:06.0 must be co-assigned to the same guest with 0000:02:05.0" error when trying to start the guest

This error usually happens when you're trying to passthru only a single function from a multi-function device (for example a dual-port nic), or only one of the devices behind the same PCI bridge. This is not allowed by the Intel VT-d specification. Please see this email for the explanation of this issue: http://lists.xenproject.org/archives/html/xen-devel/2010-01/msg00870.html and the patch implementing these FLR methods: http://xenbits.xenproject.org/xen-unstable.hg?rev/e61978c24d84

If you want to manually override this in Xen 4.0.0 or newer you can specify "pci-passthrough-strict-check no" in /etc/xen/xend-config.sxp, and after restarting xend passthru code won't give this error anymore. In some (many?) cases PCI passthru can work after this change.

If the PCI device is a single-function device, you can also move it to a different PCI slot to workaround the issue.

With Xen 3.4.x and 3.3.x versions you can apply a "disable FLR" patch to workaround this issue: http://lists.xenproject.org/archives/html/xen-devel/2008-10/binAofZNDKlrU.bin and discussion about it here http://lists.xenproject.org/archives/html/xen-devel/2008-10/msg00280.html

I get "Kernel panic - not syncing: Failed to get contiguous memory for DMA from Xen!" when trying to start the guest with iommu=soft

Limiting maximum dom0 memory may help.

Debian-way, /etc/default/grub:

...
GRUB_CMDLINE_XEN="dom0_mem=512M"
...

(Don't forget to run update-grub)

For more info see:

http://lists.xen.org/archives/html/xen-users/2012-06/msg00119.html

http://lists.xen.org/archives/html/xen-users/2012-08/msg00056.html


Note: if your guest silently fails to start with iommu=soft, try adding earlyprintk=xen to guest's kernel parameters.

Xen 4.0.0 says IO virtualization is disabled, how can I enable more verbose logging to find out why it gets disabled?

Add "iommu=verbose" option for Xen hypervisor (xen.gz) in grub.conf and reboot. After rebooting read "xm dmesg" log (or set up a serial console). As a default Xen 4.0.0 is not verbose about IOMMU initialization and related ACPI DMAR table parsing.

Does upstream kernel.org Linux 2.6.3x kernel work as PV guest (domU) kernel for PCI passthru usage?

Yes. Starting with kernel.org Linux 2.6.37 Xen PV domU PCI passthrough is supported out-of-the-box, ie. xen-pcifront driver is included in the standard kernel!

My hardware/motherboard does have an IOMMU included, but Xen doesn't enable hardware assisted IO virtualization!

Unfortunately many motherboards ship with broken BIOSes (for example incorrect ACPI DMAR, DRHD or RMRR tables) that causes Xen to disable IO virtualization as a security measure, or to prevent crashes from happening later on.

You can check if Xen enabled IO virtualization by running "xm dmesg" command and reading through the log. There's a line about IO virtualization telling if it's enabled or disabled. You need to have at least Xen 3.4 or newer for IOMMU (VT-d) to work.

If IO virtualization gets disabled, but it's available on your hardware, you should try these steps to troubleshoot it:

  • Check the BIOS version installed, and check the vendors support site for BIOS updates. Install the latest BIOS/firmware updates.
  • Enable "IOMMU", "IO virtualization" or "VT-d" in the BIOS and power-off, then restart the machine.
  • Set "iommu=verbose" boot option for Xen hypervisor (xen.gz) in grub.conf, if running Xen 4.0.0 or newer.
  • Read Xen hypervisor boot messages from "xm dmesg" to see if IO virtualization is enabled or disabled.
  • If Xen complains about broken BIOS, let the motherboard/system vendor know about it.
  • Intel developers also want to know about broken IOMMU/VT-d BIOS implementations, see this email: http://lists.xenproject.org/archives/html/xen-devel/2010-01/msg00841.html, so let them know all the details about your hardware and software if you have broken BIOS.
  • Upgrade to Xen hypervisor 4.0.0 or later, since this version added many workarounds for buggy BIOSes.

What is PCI device ID BDF notation?

Please see BDFNotation wiki page for more information.

How can I check if PCI device supports FLR (Function Level Reset) ?

Run "lspci -vv" (in dom0) and check if the device has "FLReset+" in the DevCap field.

If you are Ubuntu/Debian user don't forget to add sudo at front, otherwise, you won't get the result you should get.

sudo lspci -vv | egrep -i --colour flreset

The above line should get root access for lspci program and show colour with flreset it found from output

pci-stub?

pci-stub can be used only with Xen HVM guest PCI passthru, so it's recommended to use pciback instead, which works for both PV and HVM guests.

passing multiple PCI devices

When passing PCI devices rather then PCIe device, it is necessary to include all the sub devices before PCI passthrough works. E.g.

lspci

  • 06:00.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 62)
  • 06:00.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 62)
  • 06:00.2 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 65)
  • 06:03.0 FireWire (IEEE 1394): Agere Systems FW322/323 (rev 70)

For user who wish to have 06:00.0-2 to pass to domU, it is necessary to add following line to the /boot/grub/grub.cfg in dom0 kernel xen-pciback.hide=(06:00.0)(06:00.1)(06:00.2)(06:00.3)(06:03.0)

When (06:03.0) is left out, the pci passthrough won't work!!

In the /etc/xen/abc.cfg file, the following line is fine pci = ['06:00.0', '06:00.1', '06:00.2']

The above example is used for Xen 4.0.1 with Debian Squeeze as dom0.]