Difference between revisions of "Proposal: Disk import/export"

From Xen
(What do we currently have?)
(VM import/export)
Line 83: Line 83:
  
 
==== VM import/export ====
 
==== VM import/export ====
 +
 +
We support 2 VM export formats:
 +
1. "geneva": this was the primary export format before XS 4.0 and should not be generated any more. We only support it just in case very old VM exports are still around. It will not be documented here.
 +
2. "xva": this is the primary export format used in XS 4.0 and later
 +
 +
A "rio" export format consists of a tar file containing VM metadata and disk blocks. Example contents of a VM export:
 +
 +
$ tar -tvf test.xva
 +
---------- 0/0          17391 1970-01-01 01:00 ova.xml
 +
---------- 0/0        1048576 1970-01-01 01:00 Ref:946/00000000
 +
---------- 0/0              40 1970-01-01 01:00 Ref:946/00000000.checksum
 +
---------- 0/0        1048576 1970-01-01 01:00 Ref:946/00000007
 +
---------- 0/0              40 1970-01-01 01:00 Ref:946/00000007.checksum
 +
---------- 0/0        1048576 1970-01-01 01:00 Ref:949/00000000
 +
---------- 0/0              40 1970-01-01 01:00 Ref:949/00000000.checksum
 +
---------- 0/0        1048576 1970-01-01 01:00 Ref:949/00000003
 +
---------- 0/0              40 1970-01-01 01:00 Ref:949/00000003.checksum
 +
 +
The ova.xml contains a set of records, one per object in the xapi database. For a VM it will include VBDs, VDIs, SRs, VIFs, Networks. If requested at export time it will also include details of VM snapshots, their VBDs, VDIs etc. The objects are linked together by references (Ref:xxx). The disk content corresponding to VDI with reference Ref:946 will be in a subdirectory called "Ref:946" (see the listing above).
 +
 +
The disk data is stored in a sequence of chunks. The size of each chunk is arbitrary but the common size is 1MiB. The chunks are ordered in a sequence, and the filename encodes a *sequence number* (not a disk offset). The first block is always present and given sequence number 00000000. Gaps in the sequence number space signal the presence of empty blocks, each the same size as the first block. For example, in the Ref:946 disk above, there is a gap of 6 between block 00000000 and 00000007, signalling the presence of 6 empty blocks of size 1MiB. The last block is always present to give a reader confidence that the stream has not been truncated.
 +
 +
Each block is followed by a checksum. For a block with filename 00000000, the checksum would be 00000000.checksum. The checksum file contains the SHA1 of the disk contents in the corresponding data block, printed as a hex string i.e. the same output as given by
 +
 +
$ sha1sum Ref\:946/00000000 | cut -f 1 -d " "
 +
3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3
 +
 +
Note it is legal to insert empty data blocks into the stream, which consume a sequence number and which have a file_size of 0. Exporters may wish to insert these empty blocks if they are taking the time to scan the outgoing data for blocks of zeroes, in order to keep the data flowing and to prevent a network middlebox from silently closing the (apparently idle) connection.
  
 
==== HTTP BITS via transfer VM ====
 
==== HTTP BITS via transfer VM ====

Revision as of 10:12, 15 October 2013

Proposal to improve disk import/export

There are various ways to move a disk into or out of a system via the XenAPI, both "official" (ie. via an API) and "unofficial" (ie. by working around missing APIs). Many of the "official" ways to move disks around use non-standard and poorly-documented protocols and formats, which inhibits interoperability with other services such as CloudStack and OpenStack. The "unofficial" ways are often risky since they potentially conflict with running operations (e.g. vhd coalesce).

Disk import/export is a fundamental ability of a virtualisation platform, and good support for it is demanded by all users: traditional server virt users need off-site incremental backup; cloud orchestration layers need to deploy images from central repositories; everyone benefits from being able to quickly move gold image disk volumes around.

This document starts with an overview of what we currently have; followed by an analysis of use-cases we'd like to support (or improve our support for); followed by a set of principles to govern our designs; and finally by a proposed set of APIs and CLI commands.

What do we currently have?

The following sections describe the current mechanisms, who is known to use them, advantages and disadvantages of each.

HTTP GET raw disk contents

(In XenServer 6.2 and above)

HTTP PUT raw disk contents

  1. (Optionally) The client calls XenAPI Task.create if it wants to be able to tell if the operation succeeded or failed. This is only optional to allow quick uploads entirely over HTTP.
  2. The client sends an authenticated HTTP PUT to /import_raw_vdi?vdi=(ref or uuid). If a the client called Task.create it can add a "task_id=task reference" query parameter or cookie. Authentication can be either
    1. basic auth (convenient for commandline usage with wget/curl)
    2. a pre-created session_id query parameter or cookie header
  3. The server
    1. if the authentication cannot be verified then returns an HTTP 403 forbidden
    2. if the VDI query parameter (or cookie) is not present then an uncaught exception causes the server to return HTTP 500 Internal server error
    3. if the VDI ref or uuid doesn't exist then an uncaught exception causes the server to return HTTP 500 Internal server error
    4. if the VDI is only accessible on a remote host then an HTTP 302 redirect is returned using the Host.address field (this won't work if a client is behind a NAT)
    5. If the client requests any HTTP transfer-encoding, the server returns HTTP 403 forbidden
    6. If all looks ok, the server returns HTTP 200 OK with headers
      1. content-type: application/octet-stream
      2. connection: close
      3. task-id: XenAPI task ID
  4. The client
    1. writes the unencoded disk contents
    2. closes the connection at the end
  5. (Optionally) The client waits until Task.get_finished is true, and then checks the value of Task.get_status to find out whether the operation succeeded or failed. Reasons the task may fail include
    1. insufficient space in the VDI for the data provided
    2. I/O errors writing to the backend substrate
  6. If the client called Task.create then it now calls Task.destroy

This can all be driven through the xe CLI command:

 xe vdi-import uuid=<target VDI> filename=<raw image>

This command takes care of authentication, Task handling and error reporting.

Advantages:

  1. simple, can be driven entirely through a wget/curl invocation (without error reporting) or via the CLI

Disadvantages:

  1. import only
  2. only supports raw format images
  3. no support for uploading deltas
  4. requires you to pre-create an image of the right size
  5. doesn't allow an aborted transfer to be resumed
  6. no progress monitoring

HTTP PUT with 'chunked' encoding

This is a version of the HTTP PUT with raw encoding with the following differences:

  1. The client's HTTP PUT request contains the key "chunked" in either a query parameter or a cookie
  2. The client sends the data in a stream of chunks, consisting of a header followed by a payload

The header format is:

  1. 64-bit little-endian: offset: the offset within the target disk in bytes
  2. 32-bit little-endian: length: the number of bytes of data in this 'chunk'

The length field measure the length of the payload which follows the header.

The stream is terminated by a single chunk with both offset and length set to 0.

There is no support in the CLI. This is known to be used by XenDesktop.

Advantages:

  1. allows sparse disks (containing big holes full of zeroes) to be uploaded efficiently

Disadvantages:

  1. upload only
  2. no ability to resume an interrupted download
  3. no CLI support
  4. non-standard protocol, only available documentation seems to be this wiki page (!) after previous page was deleted
  5. the only opportunity for error reporting is at the end

Network Block Device (NBD) access

VM import/export

We support 2 VM export formats: 1. "geneva": this was the primary export format before XS 4.0 and should not be generated any more. We only support it just in case very old VM exports are still around. It will not be documented here. 2. "xva": this is the primary export format used in XS 4.0 and later

A "rio" export format consists of a tar file containing VM metadata and disk blocks. Example contents of a VM export:

$ tar -tvf test.xva
---------- 0/0           17391 1970-01-01 01:00 ova.xml
---------- 0/0         1048576 1970-01-01 01:00 Ref:946/00000000
---------- 0/0              40 1970-01-01 01:00 Ref:946/00000000.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:946/00000007
---------- 0/0              40 1970-01-01 01:00 Ref:946/00000007.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:949/00000000
---------- 0/0              40 1970-01-01 01:00 Ref:949/00000000.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:949/00000003
---------- 0/0              40 1970-01-01 01:00 Ref:949/00000003.checksum

The ova.xml contains a set of records, one per object in the xapi database. For a VM it will include VBDs, VDIs, SRs, VIFs, Networks. If requested at export time it will also include details of VM snapshots, their VBDs, VDIs etc. The objects are linked together by references (Ref:xxx). The disk content corresponding to VDI with reference Ref:946 will be in a subdirectory called "Ref:946" (see the listing above).

The disk data is stored in a sequence of chunks. The size of each chunk is arbitrary but the common size is 1MiB. The chunks are ordered in a sequence, and the filename encodes a *sequence number* (not a disk offset). The first block is always present and given sequence number 00000000. Gaps in the sequence number space signal the presence of empty blocks, each the same size as the first block. For example, in the Ref:946 disk above, there is a gap of 6 between block 00000000 and 00000007, signalling the presence of 6 empty blocks of size 1MiB. The last block is always present to give a reader confidence that the stream has not been truncated.

Each block is followed by a checksum. For a block with filename 00000000, the checksum would be 00000000.checksum. The checksum file contains the SHA1 of the disk contents in the corresponding data block, printed as a hex string i.e. the same output as given by

$ sha1sum Ref\:946/00000000 | cut -f 1 -d " "
3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3

Note it is legal to insert empty data blocks into the stream, which consume a sequence number and which have a file_size of 0. Exporters may wish to insert these empty blocks if they are taking the time to scan the outgoing data for blocks of zeroes, in order to keep the data flowing and to prevent a network middlebox from silently closing the (apparently idle) connection.

HTTP BITS via transfer VM

vhd manipulation through plugins

Desired use-cases

Principles

This section proposes some guiding principles to help guide the shape of the API.

Possibilities include:

  1. use of a nominated standard format for all exported and imported disk data (e.g. vhd now; qcow2 or vhdx later?). Note this doesn't say anything about the runtime format of the disk.
  2. always supporting resumption of interrupted transfers, since vhds are large

API changes