Difference between revisions of "Dm-thin for local storage"

From Xen
(Design)
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
This document is a draft.
  +
 
The Storage Manager (SM) currently supports 2 kinds of local storage:
 
The Storage Manager (SM) currently supports 2 kinds of local storage:
   
Line 11: Line 13:
 
# we only support vhd, and not vmdk or qcow2 (and in future direct access to object stores?)
 
# we only support vhd, and not vmdk or qcow2 (and in future direct access to object stores?)
   
The Storage Manager does not support cloning from remote 'templates' such as those stored on read/only image servers or in cloud object stores. This forces cloud orchestration layers to copy images to local storage before VMs can be started, adding significantly to VM start latency. Extra latency defining and starting VMs leads people to believe VMs are more heavyweight than they are (reference: the current VM / container debate)
+
The xapi storage model assumes that all VDIs are contained within an SR, and an SR cannot span multiple storage media. This means that cloud orchestration layers cannot create a local thin clone of a disk stored on remote template storage without first copying it entirely to local storage. This copy adds significant latency to VM.start, leading people to believe VMs are more heavyweight than they really are (reference: the current VM / container debate)
  +
  +
The xapi storage implementation assumes it "owns the world" and will set up the SR in SR.create. This prevents users using tools such as bcache (for accelerating access to a spinning disk by caching on flash) and DRBD (for replication)
   
 
= Analysis =
 
= Analysis =
Line 27: Line 31:
 
# reduce the common-case I/O request latency by keeping it all in-kernel
 
# reduce the common-case I/O request latency by keeping it all in-kernel
 
# extend the number of formats we support, and make it easier to support direct object store access in future.
 
# extend the number of formats we support, and make it easier to support direct object store access in future.
  +
  +
The xapi storage model assumption that a single SR cannot represent both remote templates and local thin clones could be removed. We would simply need to be clear about the 'type' of each VDI and which operations are valid on each, rather than assuming every VDI is the same.
  +
  +
Rather than always assuming we "own the world" we could also support a mode where the SR is configured, created and destroyed by the user. We would need to be careful to co-exist with LVs created by the user, and take care to mark our LVs appropriately.
   
 
= Design =
 
= Design =
   
  +
We will extend the xapi storage model to support references to read-only template images as URIs. This will allow an SR to be more efficient by (for example) copying blocks on demand and using fast disks for caching.
In the current XenServer storage model every SR has a scope (some physical disk, some NFS export) and all VDIs should be wholly contained within that scope. It is not possible to represent a reference to some external resource e.g. one can't say that a local thin-provisioned disk is a logical clone from an image template contained within an object store. Instead users would be forced to create an SR representing the object store and to physically 'VDI.copy' the data across, which is both wasteful of bandwidth for large images and massively increases latency.
 
   
  +
We should create a suite of command-line tools with man pages to perform all the basic SR operations. These command-line tools should be invoked by the storage plugin script in a simple fashion. The separate command-line tools should be designed to be easy to test separately.
We will extend the XenServer storage model to support references to read-only template images as URIs. This will allow an SR to be more efficient by (for example) copying blocks on demand and using fast disks for caching.
 
   
  +
We will depend on existing packaged tools for useful functions such as:
We should create a package containing a new storage plugin script, and which depends on other packages containing the necessary command-line tools. Expected dependencies include:
 
 
* qemu: for generic image reading
 
* qemu: for generic image reading
 
* thin-provisioning-tools: for manipulating and querying dm-thin volumes
 
* thin-provisioning-tools: for manipulating and querying dm-thin volumes
Line 49: Line 57:
 
== Note: Layering on LVM/dm-thin ==
 
== Note: Layering on LVM/dm-thin ==
   
We should support using an existing LVM VG, such as the one pre-created by the host installer. In this case we should allow the user to choose how big to make the "thin pool" (the meta-LV which contains unallocated blocks for thin provisioning). This could be described as "be nice to an existing LVM installation" mode.
+
We should support using an existing LVM VG, such as the one typically pre-created at distro install time. In this case we should allow the user to choose how big to make the "thin pool" (the meta-LV which contains unallocated blocks for thin provisioning). This could be described as "be nice to an existing LVM installation" mode.
   
 
If we create the LVM VG ourselves, we should let the "thin pool" fill the disk. If the user asks us to create a regular raw LV (for highest performance) then we can dynamically shrink the thin pool to create enough space. This could be described as an "own the world" mode.
 
If we create the LVM VG ourselves, we should let the "thin pool" fill the disk. If the user asks us to create a regular raw LV (for highest performance) then we can dynamically shrink the thin pool to create enough space. This could be described as an "own the world" mode.
Line 75: Line 83:
 
* name_labels for each VDI
 
* name_labels for each VDI
   
  +
Every VDI will be one of 2 types:
The following storage operations will be possible:
 
  +
# a read-only template image with a URI. The only valid operations will be: VDI.introduce, VDI.clone, VDI.snapshot and VDI.forget. Attempts to invoke other operations will fail with VDI_IS_TEMPLATE.
  +
# a read-write disk. These will support the full set of storage operations.
  +
  +
The following storage operations will be defined:
  +
* *SR.introduce*: registers an existing volume group as an SR. This should default to "be nice to an existing LVM installation mode"
  +
* *SR.forget*: deregisters an existing volume group
  +
* *SR.create*": creates a volume group on the given device. This should default to "own the world mode"
  +
* *SR.destroy*: actively destroys a volume group
 
* *VDI.introduce*: registers a template image URI with the SR. This could be something like 'nfs://server/path/foo.vmdk' or 'smb://server/path/bar.vhd'
 
* *VDI.introduce*: registers a template image URI with the SR. This could be something like 'nfs://server/path/foo.vmdk' or 'smb://server/path/bar.vhd'
 
* *VDI.forget*: deregisters a template image URI
 
* *VDI.forget*: deregisters a template image URI
 
* *VDI.create*: creates an empty dm-thin volume. For higher-performance use-cases we should support a 'preallocate' option which would make and write zeroes to a regular LV. This would be slow and would require cancellation and progress reporting.
 
* *VDI.create*: creates an empty dm-thin volume. For higher-performance use-cases we should support a 'preallocate' option which would make and write zeroes to a regular LV. This would be slow and would require cancellation and progress reporting.
 
* *VDI.destroy*: for a dm-thin volume we can remove it immediately. There is no need to provide GC, this is part of dm-thin.
 
* *VDI.destroy*: for a dm-thin volume we can remove it immediately. There is no need to provide GC, this is part of dm-thin.
  +
* *VDI.clone* and *VDI.snapshot*: create a dm-thin snapshot
  +
* *VDI.attach*: if the VDI has an external parent then use 'qemu-nbd' to attach it and then install the dm-thin on top referencing it as an external origin
  +
* *VDI.detach*: reverse of *VDI.attach*
  +
* *VDI.resize*: change the virtual size of the dm-thin device
  +
  +
== Volume naming ==
  +
  +
Rather than naming all LVs with uuids, we will instead create a name based upon the VDI.name_label, with invalid characters removed and a suffix added to ensure uniqueness (c.f. vfat). The intention is to make the system easy for sysadmins to understand.
  +
  +
== Tools ==
  +
  +
Every basic operation should be represented by a command-line tool invocation. This will allow easy debugging and testing without having to 'stand up' a whole system.
  +
  +
== URIs ==
  +
  +
The 'qemu-nbd' tool expects to be able to access a local path rather than something like 'smb://server/path'. We therefore need something which will map and unman these Uris. For simplicity and extensibility we will extract the scheme from the URI ('nbd', 'smb', 'nfs') and execute a script '/usr/libexec/xapi/storage/<scheme> (attach|detach)'
  +
  +
= Copy-on-read for local storage =
  +
  +
When a cloud orchestration system 'introduces' a template VDI with a URI, clones it and starts a VM using it, we will use a tool (such as qemu-nbd) to expose the template over NBD, which is attached to dom0 and used as an "origin" for dm-thin. For reads of template data this could be quite slow, especially if the read path involves remote HTTP GET operations. We can accelerate this by interposing an NBD proxy which, for every read, writes the data to the dm-thin device, forcibly caching it.
  +
  +
The interposition could be handled by the storage script e.g. 's3://path' where the '/usr/libexec/xapi/storage/s3' script starts qemu-nbd, then the proxy and then returns the block device path suitable for dm-thin.
  +
   
 
= Open issues =
 
= Open issues =
   
 
# what's the most convenient way to extract the block allocation map from qemu? It's possible via the QMP interface: http://www.redhat.com/archives/libvir-list/2010-May/msg00381.html
 
# what's the most convenient way to extract the block allocation map from qemu? It's possible via the QMP interface: http://www.redhat.com/archives/libvir-list/2010-May/msg00381.html
  +
# we should define some convenient hook points for interfacing with external file replication services

Latest revision as of 20:42, 14 July 2014

This document is a draft.

The Storage Manager (SM) currently supports 2 kinds of local storage:

  1. .vhd files on an ext3 filesystem on an LVM LV on a local disk
  2. vhd-format data written directly to LVM LVs on a local disk

We can also directly import and export .vhd-format data using HTTP PUT and GET operations, see Disk import/export.

In all cases the data path uses "blktap" (the kernel module) and "tapdisk" (the user-space process). This means that:

  1. constant maintenance is required because blktap is an out-of-tree kernel module
  2. every I/O request incurs extra latency due to kernelspace/userspace transitions, a big problem on fast flash devices (PCIe)
  3. we only support vhd, and not vmdk or qcow2 (and in future direct access to object stores?)

The xapi storage model assumes that all VDIs are contained within an SR, and an SR cannot span multiple storage media. This means that cloud orchestration layers cannot create a local thin clone of a disk stored on remote template storage without first copying it entirely to local storage. This copy adds significant latency to VM.start, leading people to believe VMs are more heavyweight than they really are (reference: the current VM / container debate)

The xapi storage implementation assumes it "owns the world" and will set up the SR in SR.create. This prevents users using tools such as bcache (for accelerating access to a spinning disk by caching on flash) and DRBD (for replication)

Analysis

We currently use the vhd format and blktap/tapdisk implementation for 2 distinct purposes:

  1. as a convenient, reasonably efficient, standard format for sharing images such as templates
  2. as a means of implementing thin provisioning on the data path: where blocks are allocated on demand, and storage is over provisioned

Instead of using vhd format and blktap/tapdisk everywhere we could

  1. use a tool (e.g. qemu-img) which reads and writes vhd, qcow2, vmdk and which can be mounted as a block device on an unmodified kernel (e.g. via NBD)
  2. use device-mapper modules to provide thin provisioning and low-latency access to the data

This would allow us to:

  1. avoid the blktap kernel module maintenance
  2. reduce the common-case I/O request latency by keeping it all in-kernel
  3. extend the number of formats we support, and make it easier to support direct object store access in future.

The xapi storage model assumption that a single SR cannot represent both remote templates and local thin clones could be removed. We would simply need to be clear about the 'type' of each VDI and which operations are valid on each, rather than assuming every VDI is the same.

Rather than always assuming we "own the world" we could also support a mode where the SR is configured, created and destroyed by the user. We would need to be careful to co-exist with LVs created by the user, and take care to mark our LVs appropriately.

Design

We will extend the xapi storage model to support references to read-only template images as URIs. This will allow an SR to be more efficient by (for example) copying blocks on demand and using fast disks for caching.

We should create a suite of command-line tools with man pages to perform all the basic SR operations. These command-line tools should be invoked by the storage plugin script in a simple fashion. The separate command-line tools should be designed to be easy to test separately.

We will depend on existing packaged tools for useful functions such as:

  • qemu: for generic image reading
  • thin-provisioning-tools: for manipulating and querying dm-thin volumes
  • vhd-tool: for streaming vhd export/import

Note: Attaching a file-based image to dom0

We can use qemu and NBD as follows:

  sudo qemu-nbd --connect=/dev/nbd0 file.qcow2

We could also attach an S3 volume with http://www.sagaforce.com/sound/s3nbd/

Note: Layering on LVM/dm-thin

We should support using an existing LVM VG, such as the one typically pre-created at distro install time. In this case we should allow the user to choose how big to make the "thin pool" (the meta-LV which contains unallocated blocks for thin provisioning). This could be described as "be nice to an existing LVM installation" mode.

If we create the LVM VG ourselves, we should let the "thin pool" fill the disk. If the user asks us to create a regular raw LV (for highest performance) then we can dynamically shrink the thin pool to create enough space. This could be described as an "own the world" mode.

From https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt

You can use an external _read only_ device as an origin for a
thinly-provisioned volume.  Any read to an unprovisioned area of the
thin device will be passed through to the origin.  Writes trigger
the allocation of new blocks as usual.

One use case for this is VM hosts that want to run guests on
thinly-provisioned volumes but have the base image on another device
(possibly shared between many VMs).

When the vhd/qcow2/vmdk is attached for writing, we attach it as a read-only device to dom0 and use it as an "external origin" for a dm-thin device. If you are versed in the vhd jargon this is equivalent to creating a "raw vhd" whose parent is a vhd/qcow2/vmdk. In qemu jargon the parent is known as the "backing image".

If we need to storage migrate / export the disk, we can compose together the vhd/qcow2/vmdk allocation map with the dm-thin metadata acquired by "thin_dump".

The plugin script

The storage plugin should create a small LVM volume to store its own metadata. This is expected to include

  • URIs of known templates
  • optional 'parents' for each VDI, which reference URIs
  • name_labels for each VDI

Every VDI will be one of 2 types:

  1. a read-only template image with a URI. The only valid operations will be: VDI.introduce, VDI.clone, VDI.snapshot and VDI.forget. Attempts to invoke other operations will fail with VDI_IS_TEMPLATE.
  2. a read-write disk. These will support the full set of storage operations.

The following storage operations will be defined:

  • *SR.introduce*: registers an existing volume group as an SR. This should default to "be nice to an existing LVM installation mode"
  • *SR.forget*: deregisters an existing volume group
  • *SR.create*": creates a volume group on the given device. This should default to "own the world mode"
  • *SR.destroy*: actively destroys a volume group
  • *VDI.introduce*: registers a template image URI with the SR. This could be something like 'nfs://server/path/foo.vmdk' or 'smb://server/path/bar.vhd'
  • *VDI.forget*: deregisters a template image URI
  • *VDI.create*: creates an empty dm-thin volume. For higher-performance use-cases we should support a 'preallocate' option which would make and write zeroes to a regular LV. This would be slow and would require cancellation and progress reporting.
  • *VDI.destroy*: for a dm-thin volume we can remove it immediately. There is no need to provide GC, this is part of dm-thin.
  • *VDI.clone* and *VDI.snapshot*: create a dm-thin snapshot
  • *VDI.attach*: if the VDI has an external parent then use 'qemu-nbd' to attach it and then install the dm-thin on top referencing it as an external origin
  • *VDI.detach*: reverse of *VDI.attach*
  • *VDI.resize*: change the virtual size of the dm-thin device

Volume naming

Rather than naming all LVs with uuids, we will instead create a name based upon the VDI.name_label, with invalid characters removed and a suffix added to ensure uniqueness (c.f. vfat). The intention is to make the system easy for sysadmins to understand.

Tools

Every basic operation should be represented by a command-line tool invocation. This will allow easy debugging and testing without having to 'stand up' a whole system.

URIs

The 'qemu-nbd' tool expects to be able to access a local path rather than something like 'smb://server/path'. We therefore need something which will map and unman these Uris. For simplicity and extensibility we will extract the scheme from the URI ('nbd', 'smb', 'nfs') and execute a script '/usr/libexec/xapi/storage/<scheme> (attach|detach)'

Copy-on-read for local storage

When a cloud orchestration system 'introduces' a template VDI with a URI, clones it and starts a VM using it, we will use a tool (such as qemu-nbd) to expose the template over NBD, which is attached to dom0 and used as an "origin" for dm-thin. For reads of template data this could be quite slow, especially if the read path involves remote HTTP GET operations. We can accelerate this by interposing an NBD proxy which, for every read, writes the data to the dm-thin device, forcibly caching it.

The interposition could be handled by the storage script e.g. 's3://path' where the '/usr/libexec/xapi/storage/s3' script starts qemu-nbd, then the proxy and then returns the block device path suitable for dm-thin.


Open issues

  1. what's the most convenient way to extract the block allocation map from qemu? It's possible via the QMP interface: http://www.redhat.com/archives/libvir-list/2010-May/msg00381.html
  2. we should define some convenient hook points for interfacing with external file replication services