Difference between revisions of "Dm-thin for local storage"

From Xen
(Design)
Line 30: Line 30:
 
= Design =
 
= Design =
   
  +
The current XenServer storage model assumes that it "owns the world" in the sense that an SR has a scope (some physical disk, some NFS export) and all VDIs should be wholly contained within that scope. It is not possible to represent a reference to some external resource e.g. one can't say that a local thin-provisioned disk is a logical clone from an image template contained within an object store. Instead users would be forced to create an SR representing the object store and to physically 'VDI.copy' the data across, which is both wasteful of bandwidth for large images and massively increases latency.
The current XenServer storage model includes no notion of a reference or URI to an image which should be treated as read/only. Users must create an SR for 'local' storage and an SR for 'remote' storage and copy gold master images between them. We will extend the storage model to support references to read-only template images as URIs. This will allow an SR to be more efficient by (for example) copying blocks on demand and using fast disks for caching.
 
  +
  +
We will extend the XenServer storage model to support references to read-only template images as URIs. This will allow an SR to be more efficient by (for example) copying blocks on demand and using fast disks for caching.
   
 
We should create a package containing a new storage plugin script, and which depends on other packages containing the necessary command-line tools. Expected dependencies include:
 
We should create a package containing a new storage plugin script, and which depends on other packages containing the necessary command-line tools. Expected dependencies include:
Line 78: Line 80:
 
* *VDI.create*: creates an empty dm-thin volume. For higher-performance use-cases we should support a 'preallocate' option which would make and write zeroes to a regular LV. This would be slow and would require cancellation and progress reporting.
 
* *VDI.create*: creates an empty dm-thin volume. For higher-performance use-cases we should support a 'preallocate' option which would make and write zeroes to a regular LV. This would be slow and would require cancellation and progress reporting.
 
* *VDI.destroy*: for a dm-thin volume we can remove it immediately. There is no need to provide GC, this is part of dm-thin.
 
* *VDI.destroy*: for a dm-thin volume we can remove it immediately. There is no need to provide GC, this is part of dm-thin.
 
   
 
= Open issues =
 
= Open issues =

Revision as of 17:25, 14 July 2014

The Storage Manager (SM) currently supports 2 kinds of local storage:

  1. .vhd files on an ext3 filesystem on an LVM LV on a local disk
  2. vhd-format data written directly to LVM LVs on a local disk

We can also directly import and export .vhd-format data using HTTP PUT and GET operations, see Disk import/export.

In all cases the data path uses "blktap" (the kernel module) and "tapdisk" (the user-space process). This means that:

  1. constant maintenance is required because blktap is an out-of-tree kernel module
  2. every I/O request incurs extra latency due to kernelspace/userspace transitions, a big problem on fast flash devices (PCIe)
  3. we only support vhd, and not vmdk or qcow2 (and in future direct access to object stores?)

The Storage Manager does not support cloning from remote 'templates' such as those stored on read/only image servers or in cloud object stores. This forces cloud orchestration layers to copy images to local storage before VMs can be started, adding significantly to VM start latency. Extra latency defining and starting VMs leads people to believe VMs are more heavyweight than they are (reference: the current VM / container debate)

Analysis

We currently use the vhd format and blktap/tapdisk implementation for 2 distinct purposes:

  1. as a convenient, reasonably efficient, standard format for sharing images such as templates
  2. as a means of implementing thin provisioning on the data path: where blocks are allocated on demand, and storage is over provisioned

Instead of using vhd format and blktap/tapdisk everywhere we could

  1. use a tool (e.g. qemu-img) which reads and writes vhd, qcow2, vmdk and which can be mounted as a block device on an unmodified kernel (e.g. via NBD)
  2. use device-mapper modules to provide thin provisioning and low-latency access to the data

This would allow us to:

  1. avoid the blktap kernel module maintenance
  2. reduce the common-case I/O request latency by keeping it all in-kernel
  3. extend the number of formats we support, and make it easier to support direct object store access in future.

Design

The current XenServer storage model assumes that it "owns the world" in the sense that an SR has a scope (some physical disk, some NFS export) and all VDIs should be wholly contained within that scope. It is not possible to represent a reference to some external resource e.g. one can't say that a local thin-provisioned disk is a logical clone from an image template contained within an object store. Instead users would be forced to create an SR representing the object store and to physically 'VDI.copy' the data across, which is both wasteful of bandwidth for large images and massively increases latency.

We will extend the XenServer storage model to support references to read-only template images as URIs. This will allow an SR to be more efficient by (for example) copying blocks on demand and using fast disks for caching.

We should create a package containing a new storage plugin script, and which depends on other packages containing the necessary command-line tools. Expected dependencies include:

  • qemu: for generic image reading
  • thin-provisioning-tools: for manipulating and querying dm-thin volumes
  • vhd-tool: for streaming vhd export/import

Note: Attaching a file-based image to dom0

We can use qemu and NBD as follows:

  sudo qemu-nbd --connect=/dev/nbd0 file.qcow2

We could also attach an S3 volume with http://www.sagaforce.com/sound/s3nbd/

Note: Layering on LVM/dm-thin

We should support using an existing LVM VG, such as the one pre-created by the host installer. In this case we should allow the user to choose how big to make the "thin pool" (the meta-LV which contains unallocated blocks for thin provisioning). This could be described as "be nice to an existing LVM installation" mode.

If we create the LVM VG ourselves, we should let the "thin pool" fill the disk. If the user asks us to create a regular raw LV (for highest performance) then we can dynamically shrink the thin pool to create enough space. This could be described as an "own the world" mode.

From https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt

You can use an external _read only_ device as an origin for a
thinly-provisioned volume.  Any read to an unprovisioned area of the
thin device will be passed through to the origin.  Writes trigger
the allocation of new blocks as usual.

One use case for this is VM hosts that want to run guests on
thinly-provisioned volumes but have the base image on another device
(possibly shared between many VMs).

When the vhd/qcow2/vmdk is attached for writing, we attach it as a read-only device to dom0 and use it as an "external origin" for a dm-thin device. If you are versed in the vhd jargon this is equivalent to creating a "raw vhd" whose parent is a vhd/qcow2/vmdk. In qemu jargon the parent is known as the "backing image".

If we need to storage migrate / export the disk, we can compose together the vhd/qcow2/vmdk allocation map with the dm-thin metadata acquired by "thin_dump".

The plugin script

The storage plugin should create a small LVM volume to store its own metadata. This is expected to include

  • URIs of known templates
  • optional 'parents' for each VDI, which reference URIs
  • name_labels for each VDI

The following storage operations will be possible:

  • *VDI.introduce*: registers a template image URI with the SR. This could be something like 'nfs://server/path/foo.vmdk' or 'smb://server/path/bar.vhd'
  • *VDI.forget*: deregisters a template image URI
  • *VDI.create*: creates an empty dm-thin volume. For higher-performance use-cases we should support a 'preallocate' option which would make and write zeroes to a regular LV. This would be slow and would require cancellation and progress reporting.
  • *VDI.destroy*: for a dm-thin volume we can remove it immediately. There is no need to provide GC, this is part of dm-thin.

Open issues

  1. what's the most convenient way to extract the block allocation map from qemu? It's possible via the QMP interface: http://www.redhat.com/archives/libvir-list/2010-May/msg00381.html