Difference between revisions of "Proposal: Disk import/export"

From Xen
m (moved Disk import/export to Proposal: Disk import/export: To archive the proposal, and re-use the page name for the actual APIs and examples.)
m
 
Line 1: Line 1:
  +
Note: incremental disk import/export APIs have been merged and are documented here: http://xapi-project.github.io/xen-api/snapshots.html
  +
 
== Proposal to improve disk import/export ==
 
== Proposal to improve disk import/export ==
   

Latest revision as of 16:53, 20 January 2015

Note: incremental disk import/export APIs have been merged and are documented here: http://xapi-project.github.io/xen-api/snapshots.html

Proposal to improve disk import/export

There are various ways to move a disk into or out of a system via the XenAPI, both "official" (ie. via an API) and "unofficial" (ie. by working around missing APIs). Many of the "official" ways to move disks around use non-standard and poorly-documented protocols and formats, which inhibits interoperability with other services such as CloudStack and OpenStack. The "unofficial" ways are often risky since they potentially conflict with running operations (e.g. vhd coalesce).

Disk import/export is a fundamental ability of a virtualisation platform, and good support for it is demanded by all users: traditional server virt users need off-site incremental backup; cloud orchestration layers need to deploy images from central repositories; everyone benefits from being able to quickly move gold image disk volumes around.

This document starts with an overview of what we currently have; followed by an analysis of use-cases we'd like to support (or improve our support for); followed by a set of principles to govern our designs; and finally by a proposed set of APIs and CLI commands.

What do we currently have?

The following sections describe the current mechanisms, who is known to use them, advantages and disadvantages of each.

HTTP GET raw disk contents

(In XenServer 6.2 and above)

  1. (Optionally) The client calls XenAPI Task.create if it wants to be able to tell if the operation succeeded or failed. This is only optional to allow quick uploads entirely over HTTP.
  2. The client sends an authenticated HTTP PUT to /export_raw_vdi?vdi=(ref or uuid). If a the client called Task.create it can add a "task_id=task reference" query parameter or cookie. Authentication can be either
    1. basic auth (convenient for commandline usage with wget/curl)
    2. a pre-created session_id query parameter or cookie header
  3. The server
    1. if the authentication cannot be verified then returns an HTTP 403 forbidden
    2. if the VDI query parameter (or cookie) is not present then an uncaught exception causes the server to return HTTP 500 Internal server error
    3. if the VDI ref or uuid doesn't exist then an uncaught exception causes the server to return HTTP 500 Internal server error
    4. if the VDI is only accessible on a remote host then an HTTP 302 redirect is returned using the Host.address field (this won't work if a client is behind a NAT)
    5. If all looks ok, the server returns HTTP 200 OK with headers
      1. content-type: application/octet-stream
      2. connection: close
      3. task-id: XenAPI task ID
  4. The server writes the unencoded disk contents
  5. The server closes the connection at the end
  6. (Optionally) The client waits until Task.get_finished is true, and then checks the value of Task.get_status to find out whether the operation succeeded or failed. Reasons the task may fail include
    1. I/O errors reading from the backend substrate
  7. If the client called Task.create then it now calls Task.destroy

There is no xe CLI command.

Advantages:

  1. simple, can be driven entirely through a wget/curl invocation (without error reporting)

Disadvantages:

  1. missing CLI command
  2. only supports raw format images
  3. no support for downloading deltas
  4. doesn't allow an aborted transfer to be resumed

HTTP PUT raw disk contents

  1. (Optionally) The client calls XenAPI Task.create if it wants to be able to tell if the operation succeeded or failed. This is only optional to allow quick uploads entirely over HTTP.
  2. The client sends an authenticated HTTP PUT to /import_raw_vdi?vdi=(ref or uuid). If a the client called Task.create it can add a "task_id=task reference" query parameter or cookie. Authentication can be either
    1. basic auth (convenient for commandline usage with wget/curl)
    2. a pre-created session_id query parameter or cookie header
  3. The server
    1. if the authentication cannot be verified then returns an HTTP 403 forbidden
    2. if the VDI query parameter (or cookie) is not present then an uncaught exception causes the server to return HTTP 500 Internal server error
    3. if the VDI ref or uuid doesn't exist then an uncaught exception causes the server to return HTTP 500 Internal server error
    4. if the VDI is only accessible on a remote host then an HTTP 302 redirect is returned using the Host.address field (this won't work if a client is behind a NAT)
    5. If the client requests any HTTP transfer-encoding, the server returns HTTP 403 forbidden
    6. If all looks ok, the server returns HTTP 200 OK with headers
      1. content-type: application/octet-stream
      2. connection: close
      3. task-id: XenAPI task ID
  4. The client
    1. writes the unencoded disk contents
    2. closes the connection at the end
  5. (Optionally) The client waits until Task.get_finished is true, and then checks the value of Task.get_status to find out whether the operation succeeded or failed. Reasons the task may fail include
    1. insufficient space in the VDI for the data provided
    2. I/O errors writing to the backend substrate
  6. If the client called Task.create then it now calls Task.destroy

This can all be driven through the xe CLI command:

 xe vdi-import uuid=<target VDI> filename=<raw image>

This command takes care of authentication, Task handling and error reporting.

Advantages:

  1. simple, can be driven entirely through a wget/curl invocation (without error reporting) or via the CLI

Disadvantages:

  1. only supports raw format images
  2. no support for uploading deltas
  3. requires you to pre-create an image of the right size
  4. doesn't allow an aborted transfer to be resumed
  5. CLI has no progress monitoring

HTTP PUT with 'chunked' encoding

This is a version of the HTTP PUT with raw encoding with the following differences:

  1. The client's HTTP PUT request contains the key "chunked" in either a query parameter or a cookie
  2. The client sends the data in a stream of chunks, consisting of a header followed by a payload

The header format is:

  1. 64-bit little-endian: offset: the offset within the target disk in bytes
  2. 32-bit little-endian: length: the number of bytes of data in this 'chunk'

The length field measure the length of the payload which follows the header.

The stream is terminated by a single chunk with both offset and length set to 0.

There is no support in the CLI. This is known to be used by XenDesktop.

Advantages:

  1. allows sparse disks (containing big holes full of zeroes) to be uploaded efficiently

Disadvantages:

  1. upload only
  2. no ability to resume an interrupted download
  3. no CLI support
  4. non-standard protocol, only available documentation seems to be this wiki page (!) after previous page was deleted
  5. the only opportunity for error reporting is at the end

Network Block Device (NBD) access

This applies to XenServer 6.2 and later.

For all the HTTP requests described below, authentication can be either

  1. basic auth (convenient for commandline usage with wget/curl)
  2. a pre-created session_id query parameter or cookie header

For a given VDI uuid $VDI containined within SR uuid $SR:

  1. The client generates a fresh $UUID
  2. The client calls the internal storage API VDI.attach by
    1. making an HTTP POST to /services/SM
    2. containing an XMLRPC request. TODO: generate documentation for this.
  3. The client calls the internal storage API VDI.activate by
    1. making an HTTP POST to /services/SM
    2. containing an XMLRPC request. TODO: generate documentation for this.
  4. The client makes an HTTP PUT to /services/SM/nbd/$SR/$VDI/$UUID
  5. The server replies with HTTP 200 OK including the header
    1. Transfer-encoding: nbd

At this point the client is connected to an NBD server and can access the disk.

To finalise:

  1. The client closes the NBD connection
  2. The client calls the internal storage API VDI.deactivate by
    1. making an HTTP POST to /services/SM
    2. containing an XMLRPC request. TODO: generate documentation for this
  3. The client calls the internal storage API VDI.detach by
    1. making an HTTP POST to /services/SM
    2. containing an XMLRPC request. TODO: generate documentation for this


Advantages:

  1. The NBD protocol is well supported by Linux and BSD
  2. The NBD protocol is used by qemu/KVM for easy interop
  3. The NBD protocol is understood by wireshark for easy debugging

Disadvantages:

  1. The current "wrapping" of the protocol is barking mad:
    1. It exposes internal storage APIs across host, holding back storage API evolution
    2. It requires the client to clean up everything at the end: the tapdisk resource left lying around until VDI.detach is called

VM import/export

We support 2 VM export formats: 1. "geneva": this was the primary export format before XS 4.0 and should not be generated any more. We only support it just in case very old VM exports are still around. It will not be documented here. 2. "xva": this is the primary export format used in XS 4.0 and later

A "rio" export format consists of a tar file containing VM metadata and disk blocks. Example contents of a VM export:

$ tar -tvf test.xva
---------- 0/0           17391 1970-01-01 01:00 ova.xml
---------- 0/0         1048576 1970-01-01 01:00 Ref:946/00000000
---------- 0/0              40 1970-01-01 01:00 Ref:946/00000000.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:946/00000007
---------- 0/0              40 1970-01-01 01:00 Ref:946/00000007.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:949/00000000
---------- 0/0              40 1970-01-01 01:00 Ref:949/00000000.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:949/00000003
---------- 0/0              40 1970-01-01 01:00 Ref:949/00000003.checksum

The ova.xml contains a set of records, one per object in the xapi database. For a VM it will include VBDs, VDIs, SRs, VIFs, Networks. If requested at export time it will also include details of VM snapshots, their VBDs, VDIs etc. The objects are linked together by references (Ref:xxx). The disk content corresponding to VDI with reference Ref:946 will be in a subdirectory called "Ref:946" (see the listing above).

The disk data is stored in a sequence of chunks. The size of each chunk is arbitrary but the common size is 1MiB. The chunks are ordered in a sequence, and the filename encodes a *sequence number* (not a disk offset). The first block is always present and given sequence number 00000000. Gaps in the sequence number space signal the presence of empty blocks, each the same size as the first block. For example, in the Ref:946 disk above, there is a gap of 6 between block 00000000 and 00000007, signalling the presence of 6 empty blocks of size 1MiB. The last block is always present to give a reader confidence that the stream has not been truncated.

Each block is followed by a checksum. For a block with filename 00000000, the checksum would be 00000000.checksum. The checksum file contains the SHA1 of the disk contents in the corresponding data block, printed as a hex string i.e. the same output as given by

$ sha1sum Ref\:946/00000000 | cut -f 1 -d " "
3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3

Note it is legal to insert empty data blocks into the stream, which consume a sequence number and which have a file_size of 0. Exporters may wish to insert these empty blocks if they are taking the time to scan the outgoing data for blocks of zeroes, in order to keep the data flowing and to prevent a network middlebox from silently closing the (apparently idle) connection.

HTTP BITS via transfer VM

TODO: fill this in later

vhd manipulation through plugins

OpenStack installs "plugins" in domain 0 and invokes these via the XenAPI. The plugins directly manipulate .vhd files contained within local "ext" and remote "nfs" SRs by:

  1. moving new files in
  2. updating the parent locator to create chains

Advantages:

  1. the plugins can be developed separately to XenServer

Disadvantages:

  1. the plugins run concurrently with coalesce and GC, potentially corrupting vhd chains or losing data
  2. the plugins typically only support the file-based SR types, and leave out LVM. Supporting LVM is genuinely quite difficult and getting it wrong can cause data loss (e.g. through truncated device mapper entries)

Desired use-cases

From Mate Lakat on xs-devel:

  • Full disk input/output
    • User wants to stream a virtual disk to OpenStack glance quickly and efficiently, where Glance acts as an HTTP server
    • User wants to fetch a virtual disk from OpenStack glance quickly and efficiently, where Glance acts as an HTTP server

Regarding "quickly and efficiently", this was explained by John Garbutt on xs-devel:

A key requirement is to not use large amounts of memory or disk space
(10s of MB at most), when exporting disks up to about 1.5 TB in size,
where they might need to be tarballed, then gzipped, before being set
to a client. Thats why I suggested a bash pipe, but doesn't feel like
the best approach.
  • Incremental backup/restore (for regular server virt as well as CloudStack)
    • User wants to make a full copy of a virtual disk (e.g. weekly)
    • User wants to take frequent (e.g. daily) incremental snapshots of the disk
    • User wants to export only the snapshot deltas (to save bandwidth)
    • User wants to limit the number of snapshots present (to save space and chain length, on both client and server)
    • User wants to import a full snapshot plus a set of deltas, to recover the disk
  • User wants to pipe a virtual disk through a bit-torrent client (or other third-party transcoding software) (see John Garbutt on xs-devel)

Principles

This section proposes some guiding principles to help guide the shape of the API.

We shall

  1. continue to support all currently supported protocols and formats.
  2. default to generating export disk images in a well-known format with good documentation and sample code.
    1. We could either invent a format and spend enough time to make it "well-known" or we could use someone else's format and share the documentation burden with them
  3. work across all XenServer SR types, whether they happen to use our well-known format (e.g. vhd) or not
    1. NB the format we choose to export and import in may be different to the common runtime format, since export/import may be optimised for readability while runtime will be optimised for random access performance. This is ok as long as efficient streaming conversion is possible.
  4. allow access to disk data using well-known protocols with good documentation and sample code.
  5. support both streaming input/output and random-access to/from external hosts
    1. when streaming we shall support resumption of interrupted transfers
  6. maximise efficiency by keeping the datapath short; by avoiding unnecessary copying; by using aligned I/O everywhere.
  7. keep it simple by hiding internal XenServer interfaces from clients.
  8. provide CLI wrappers around all of our lower-level APIs.

API changes

We will continue to use the same authentication strategy for HTTP requests i.e.

  • we will accept basic auth for convenient commandline use with wget/curl
  • we will accept pre-allocated session_id parameters via query parameters or cookies

We will build upon the concept of a *content_id* first introduced for storage migration. Unlike a VDI's uuid which uniquely identifies the storage *container* whose data constantly changes, the *content_id* identifies the state of the data blocks at a point in time. When a snapshot is taken the content_id will remain the same since the content is now read/only. When a disk is written to the content_id is changed to a unique value. If two VDIs have the same content_id then, irrespective of their heritage and tree structure, we know they are equivalent.

During an HTTP download we will allow the client to request a range of bytes, in order to resume an aborted download.

HTTP full disk download

  • Client:
    • HTTP GET /SM/sr-uuid/vdi-uuid
  • Server:
    • Accept-Ranges: bytes
      • indicating that byte range requests are acceptable
    • Content-Disposition: attachment; filename=content_id.vhd
      • encouraging the client to set the filename to the content_id, rather than the uuid. This is needed to handle chains properly, see below.
    • Streams out data as a consolidated dynamic .vhd file
      • NB .vhd can be constructed on the fly: there is no need to use large amounts of temporary memory or space on the server.

For convenience this will be wrapped in the xe command:

xe vdi-export uuid=<vdi-uuid> --progress

HTTP delta disk download

To download the differences which would need to be applied on top of vdi-uuid2 (with content_id2) in order to end up with the same content as vdi-uuid (with content_id):

  • Client:
    • HTTP GET /SM/sr-uuid/vdi-uuid/differences-from/sr-uuid2/vdi-uuid2
  • Server:
    • Accept-Ranges: bytes
    • Content-Disposition: attachment; filename=content_id.vhd
    • Streams out a .vhd differencing disk with the parent locator set to "content_id2.vhd"

For convenience this will be wrapped in the xe command:

xe vdi-export uuid=<vdi-uuid> differences-from=<vdi-uuid2> --progress

Coalesce backups

Coalescing needs to happen in two places: 1. on the XenServer 2. on the machine handling the backups

For the XenServer case, a coalesce is triggered as a background task when snapshots are deleted. The user needs to select which snapshots to delete based on their preferences, which could be 1. delete all snapshots older than 'n' days 2. delete all snapshots except the newest 'm'

The low-level interface is simply

xe vdi-destroy uuid=...

TODO: do we need to provide extra help in the form of policy-specific high-level commands?

To delete all snapshots older than 'n' days we could use a CLI command:

xe vdi-destroy snapshot-of=<vdi uuid> snapshot-time="older than n days" --multiple

To delete all snapshots except the newest 'm'

TODO: figure out something reasonable here

For the machine handling the backups, we'll provide a simple command-line tool which will coalesce a snapshot into its parent eg

vhd-tool coalesce ...

HTTP full/delta disk upload

A full disk is a base dynamic .vhd plus zero or more differencing .vhds. The full disk is identified by the last delta disk.

We shall support a low-level HTTP primitive:

  • Client:
    • HTTP PUT /SM/sr-uuid
    • Content-type: application/vhd
    • Client uploads the first few kb of the .vhd, which includes the header and backup footer
  • Server:
    • HTTP/201 Created
    • Location: /SM/sr-uuid/vdi-uuid

The server will detect whether the .vhd is a dynamic vhd or a differencing .vhd. If it's a differencing .vhd it can create the new VDI as a snapshot of the one given by the content_id in the parent locator.

If the .vhd is a differencing .vhd but the parent locator cannot be found then an HTTP/404 error code will be returned.

For convenience this will be wrapped in the xe command:

xe vdi-import filename=foo.vhd

If foo.vhd is a differencing .vhd then the CLI will upload the base disks first, snapshot those and continue.

Attaching VDIs to remote hosts

We shall support NBD over pre-authenticated connections, but without clients having to call SMAPI VDI.{attach,activate,deactivate,detach} functions.

  • the Client:
    • HTTP CONNECT /SM/sr-uuid/vdi-uuid
    • Accept: nbd
  • the Server:
    • HTTP/200 OK
  • both client and server can exchange NBD messages
  • when the client closes the connection to the server then all the data is cleaned up

VM import/export

We shall replace the ad-hoc 'chunked' encoding with .vhd in the export format. We will continue to support both 'chunked' and 'vhd' for input.


Open issues

The focus here is on shifting the bits around. We've not discussed uploading/downloading/backing up VDI metadata (name-label, other-config keys etc)