Dm-thin for local storage
This document is a draft.
The Storage Manager (SM) currently supports 2 kinds of local storage:
- .vhd files on an ext3 filesystem on an LVM LV on a local disk
- vhd-format data written directly to LVM LVs on a local disk
We can also directly import and export .vhd-format data using HTTP PUT and GET operations, see Disk import/export.
In all cases the data path uses "blktap" (the kernel module) and "tapdisk" (the user-space process). This means that:
- constant maintenance is required because blktap is an out-of-tree kernel module
- every I/O request incurs extra latency due to kernelspace/userspace transitions, a big problem on fast flash devices (PCIe)
- we only support vhd, and not vmdk or qcow2 (and in future direct access to object stores?)
The xapi storage model assumes that all VDIs are contained within an SR, and an SR cannot span multiple storage media. This means that cloud orchestration layers cannot create a local thin clone of a disk stored on remote template storage without first copying it entirely to local storage. This copy adds significant latency to VM.start, leading people to believe VMs are more heavyweight than they really are (reference: the current VM / container debate)
The xapi storage implementation assumes it "owns the world" and will set up the SR in SR.create. This prevents users using tools such as bcache (for accelerating access to a spinning disk by caching on flash) and DRBD (for replication)
We currently use the vhd format and blktap/tapdisk implementation for 2 distinct purposes:
- as a convenient, reasonably efficient, standard format for sharing images such as templates
- as a means of implementing thin provisioning on the data path: where blocks are allocated on demand, and storage is over provisioned
Instead of using vhd format and blktap/tapdisk everywhere we could
- use a tool (e.g. qemu-img) which reads and writes vhd, qcow2, vmdk and which can be mounted as a block device on an unmodified kernel (e.g. via NBD)
- use device-mapper modules to provide thin provisioning and low-latency access to the data
This would allow us to:
- avoid the blktap kernel module maintenance
- reduce the common-case I/O request latency by keeping it all in-kernel
- extend the number of formats we support, and make it easier to support direct object store access in future.
The xapi storage model assumption that a single SR cannot represent both remote templates and local thin clones could be removed. We would simply need to be clear about the 'type' of each VDI and which operations are valid on each, rather than assuming every VDI is the same.
Rather than always assuming we "own the world" we could also support a mode where the SR is configured, created and destroyed by the user. We would need to be careful to co-exist with LVs created by the user, and take care to mark our LVs appropriately.
We will extend the xapi storage model to support references to read-only template images as URIs. This will allow an SR to be more efficient by (for example) copying blocks on demand and using fast disks for caching.
We should create a suite of command-line tools with man pages to perform all the basic SR operations. These command-line tools should be invoked by the storage plugin script in a simple fashion. The separate command-line tools should be designed to be easy to test separately.
We will depend on existing packaged tools for useful functions such as:
- qemu: for generic image reading
- thin-provisioning-tools: for manipulating and querying dm-thin volumes
- vhd-tool: for streaming vhd export/import
Note: Attaching a file-based image to dom0
We can use qemu and NBD as follows:
sudo qemu-nbd --connect=/dev/nbd0 file.qcow2
We could also attach an S3 volume with http://www.sagaforce.com/sound/s3nbd/
Note: Layering on LVM/dm-thin
We should support using an existing LVM VG, such as the one typically pre-created at distro install time. In this case we should allow the user to choose how big to make the "thin pool" (the meta-LV which contains unallocated blocks for thin provisioning). This could be described as "be nice to an existing LVM installation" mode.
If we create the LVM VG ourselves, we should let the "thin pool" fill the disk. If the user asks us to create a regular raw LV (for highest performance) then we can dynamically shrink the thin pool to create enough space. This could be described as an "own the world" mode.
You can use an external _read only_ device as an origin for a thinly-provisioned volume. Any read to an unprovisioned area of the thin device will be passed through to the origin. Writes trigger the allocation of new blocks as usual. One use case for this is VM hosts that want to run guests on thinly-provisioned volumes but have the base image on another device (possibly shared between many VMs).
When the vhd/qcow2/vmdk is attached for writing, we attach it as a read-only device to dom0 and use it as an "external origin" for a dm-thin device. If you are versed in the vhd jargon this is equivalent to creating a "raw vhd" whose parent is a vhd/qcow2/vmdk. In qemu jargon the parent is known as the "backing image".
If we need to storage migrate / export the disk, we can compose together the vhd/qcow2/vmdk allocation map with the dm-thin metadata acquired by "thin_dump".
The plugin script
The storage plugin should create a small LVM volume to store its own metadata. This is expected to include
- URIs of known templates
- optional 'parents' for each VDI, which reference URIs
- name_labels for each VDI
Every VDI will be one of 2 types:
- a read-only template image with a URI. The only valid operations will be: VDI.introduce, VDI.clone, VDI.snapshot and VDI.forget. Attempts to invoke other operations will fail with VDI_IS_TEMPLATE.
- a read-write disk. These will support the full set of storage operations.
The following storage operations will be defined:
- *SR.introduce*: registers an existing volume group as an SR. This should default to "be nice to an existing LVM installation mode"
- *SR.forget*: deregisters an existing volume group
- *SR.create*": creates a volume group on the given device. This should default to "own the world mode"
- *SR.destroy*: actively destroys a volume group
- *VDI.introduce*: registers a template image URI with the SR. This could be something like 'nfs://server/path/foo.vmdk' or 'smb://server/path/bar.vhd'
- *VDI.forget*: deregisters a template image URI
- *VDI.create*: creates an empty dm-thin volume. For higher-performance use-cases we should support a 'preallocate' option which would make and write zeroes to a regular LV. This would be slow and would require cancellation and progress reporting.
- *VDI.destroy*: for a dm-thin volume we can remove it immediately. There is no need to provide GC, this is part of dm-thin.
- *VDI.clone* and *VDI.snapshot*: create a dm-thin snapshot
- *VDI.attach*: if the VDI has an external parent then use 'qemu-nbd' to attach it and then install the dm-thin on top referencing it as an external origin
- *VDI.detach*: reverse of *VDI.attach*
- *VDI.resize*: change the virtual size of the dm-thin device
Rather than naming all LVs with uuids, we will instead create a name based upon the VDI.name_label, with invalid characters removed and a suffix added to ensure uniqueness (c.f. vfat). The intention is to make the system easy for sysadmins to understand.
Every basic operation should be represented by a command-line tool invocation. This will allow easy debugging and testing without having to 'stand up' a whole system.
The 'qemu-nbd' tool expects to be able to access a local path rather than something like 'smb://server/path'. We therefore need something which will map and unman these Uris. For simplicity and extensibility we will extract the scheme from the URI ('nbd', 'smb', 'nfs') and execute a script '/usr/libexec/xapi/storage/<scheme> (attach|detach)'
- what's the most convenient way to extract the block allocation map from qemu? It's possible via the QMP interface: http://www.redhat.com/archives/libvir-list/2010-May/msg00381.html
- we should define some convenient hook points for interfacing with external file replication services