Difference between revisions of "Dm-thin for local storage"

From Xen
Line 37: Line 37:
  
 
== Layering on LVM/dm-thin ==
 
== Layering on LVM/dm-thin ==
 +
 +
We should support using an existing LVM VG, such as the one pre-created by the host installer. In this case we should allow the user to choose how big to make the "thin pool" (the meta-LV which contains unallocated blocks for thin provisioning). This could be described as "be nice to an existing LVM installation" mode.
 +
 +
If we create the LVM VG ourselves, we should let the "thin pool" fill the disk. If the user asks us to create a regular raw LV (for highest performance) then we can dynamically shrink the thin pool to create enough space. This could be described as an "own the world" mode.
  
 
From https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt
 
From https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt
Line 48: Line 52:
 
  thinly-provisioned volumes but have the base image on another device
 
  thinly-provisioned volumes but have the base image on another device
 
  (possibly shared between many VMs).
 
  (possibly shared between many VMs).
 +
 +
When the vhd/qcow2/vmdk is attached for writing, we attach it as a read-only device to dom0 and use it as an "external origin" for a dm-thin device.
 +
 +
If we need to storage migrate / export the disk, we can convert it on-the-fly to .vhd format (like we currently do for VDI exports) by composing together the vhd metadata with the thin device metadata, acquired via running "thin_dump".

Revision as of 14:16, 14 July 2014

The Storage Manager (SM) currently supports 2 kinds of local storage:

  1. .vhd files on an ext3 filesystem on an LVM LV on a local disk
  2. vhd-format data written directly to LVM LVs on a local disk

We can also directly import and export .vhd-format data using HTTP PUT and GET operations, see Disk import/export.

In all cases the data path uses "blktap" (the kernel module) and "tapdisk" (the user-space process). This means that:

  1. constant maintenance is required because blktap is an out-of-tree kernel module
  2. every I/O request incurs extra latency due to kernelspace/userspace transitions, a big problem on fast flash devices (PCIe)
  3. we only support vhd, and not vmdk or qcow2 (and in future direct access to object stores?)

Analysis

We currently use the vhd format and blktap/tapdisk implementation for 2 distinct purposes:

  1. as a convenient, reasonably efficient, standard format for sharing images such as templates
  2. as a means of implementing thin provisioning on the data path: where blocks are allocated on demand, and storage is over provisioned

If instead of using vhd format and blktap/tapdisk everywhere we

  1. use a tool (e.g. qemu-img) which reads and writes vhd, qcow2, vmdk and which can be mounted as a block device on an unmodified kernel (e.g. via NBD)
  2. use device-mapper modules to provide thin provisioning and low-latency access to the data

then we

  1. avoid the blktap kernel module maintenance
  2. reduce the common-case I/O request latency by keeping it all in-kernel
  3. extend the number of formats we support, and make it easier to support direct object store access in future.

Design

Attaching a file-based image to dom0

We can use qemu and NBD as follows:

  sudo qemu-nbd --connect=/dev/nbd0 file.qcow2

We could also attach an S3 volume with http://www.sagaforce.com/sound/s3nbd/

Layering on LVM/dm-thin

We should support using an existing LVM VG, such as the one pre-created by the host installer. In this case we should allow the user to choose how big to make the "thin pool" (the meta-LV which contains unallocated blocks for thin provisioning). This could be described as "be nice to an existing LVM installation" mode.

If we create the LVM VG ourselves, we should let the "thin pool" fill the disk. If the user asks us to create a regular raw LV (for highest performance) then we can dynamically shrink the thin pool to create enough space. This could be described as an "own the world" mode.

From https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt

You can use an external _read only_ device as an origin for a
thinly-provisioned volume.  Any read to an unprovisioned area of the
thin device will be passed through to the origin.  Writes trigger
the allocation of new blocks as usual.

One use case for this is VM hosts that want to run guests on
thinly-provisioned volumes but have the base image on another device
(possibly shared between many VMs).

When the vhd/qcow2/vmdk is attached for writing, we attach it as a read-only device to dom0 and use it as an "external origin" for a dm-thin device.

If we need to storage migrate / export the disk, we can convert it on-the-fly to .vhd format (like we currently do for VDI exports) by composing together the vhd metadata with the thin device metadata, acquired via running "thin_dump".