Difference between revisions of "Proposal: Disk import/export"

Revision as of 08:27, 17 October 2013

Proposal to improve disk import/export

There are various ways to move a disk into or out of a system via the XenAPI, both "official" (ie. via an API) and "unofficial" (ie. by working around missing APIs). Many of the "official" ways to move disks around use non-standard and poorly-documented protocols and formats, which inhibits interoperability with other services such as CloudStack and OpenStack. The "unofficial" ways are often risky since they potentially conflict with running operations (e.g. vhd coalesce).

Disk import/export is a fundamental ability of a virtualisation platform, and good support for it is demanded by all users: traditional server virt users need off-site incremental backup; cloud orchestration layers need to deploy images from central repositories; everyone benefits from being able to quickly move gold image disk volumes around.

This document starts with an overview of what we currently have; followed by an analysis of use-cases we'd like to support (or improve our support for); followed by a set of principles to govern our designs; and finally by a proposed set of APIs and CLI commands.

What do we currently have?

The following sections describe the current mechanisms, who is known to use them, advantages and disadvantages of each.

HTTP GET raw disk contents

(In XenServer 6.2 and above)

(Optionally) The client calls XenAPI Task.create if it wants to be able to tell if the operation succeeded or failed. This is only optional to allow quick uploads entirely over HTTP.
The client sends an authenticated HTTP PUT to /export_raw_vdi?vdi=(ref or uuid). If a the client called Task.create it can add a "task_id=task reference" query parameter or cookie. Authentication can be either
1. basic auth (convenient for commandline usage with wget/curl)
2. a pre-created session_id query parameter or cookie header
The server
1. if the authentication cannot be verified then returns an HTTP 403 forbidden
2. if the VDI query parameter (or cookie) is not present then an uncaught exception causes the server to return HTTP 500 Internal server error
3. if the VDI ref or uuid doesn't exist then an uncaught exception causes the server to return HTTP 500 Internal server error
4. if the VDI is only accessible on a remote host then an HTTP 302 redirect is returned using the Host.address field (this won't work if a client is behind a NAT)
5. If all looks ok, the server returns HTTP 200 OK with headers
  1. content-type: application/octet-stream
  2. connection: close
  3. task-id: XenAPI task ID
The server writes the unencoded disk contents
The server closes the connection at the end
(Optionally) The client waits until Task.get_finished is true, and then checks the value of Task.get_status to find out whether the operation succeeded or failed. Reasons the task may fail include
1. I/O errors reading from the backend substrate
If the client called Task.create then it now calls Task.destroy

There is no xe CLI command.

Advantages:

simple, can be driven entirely through a wget/curl invocation (without error reporting)

Disadvantages:

missing CLI command
only supports raw format images
no support for downloading deltas
doesn't allow an aborted transfer to be resumed

HTTP PUT raw disk contents

(Optionally) The client calls XenAPI Task.create if it wants to be able to tell if the operation succeeded or failed. This is only optional to allow quick uploads entirely over HTTP.
The client sends an authenticated HTTP PUT to /import_raw_vdi?vdi=(ref or uuid). If a the client called Task.create it can add a "task_id=task reference" query parameter or cookie. Authentication can be either
1. basic auth (convenient for commandline usage with wget/curl)
2. a pre-created session_id query parameter or cookie header
The server
1. if the authentication cannot be verified then returns an HTTP 403 forbidden
2. if the VDI query parameter (or cookie) is not present then an uncaught exception causes the server to return HTTP 500 Internal server error
3. if the VDI ref or uuid doesn't exist then an uncaught exception causes the server to return HTTP 500 Internal server error
4. if the VDI is only accessible on a remote host then an HTTP 302 redirect is returned using the Host.address field (this won't work if a client is behind a NAT)
5. If the client requests any HTTP transfer-encoding, the server returns HTTP 403 forbidden
6. If all looks ok, the server returns HTTP 200 OK with headers
  1. content-type: application/octet-stream
  2. connection: close
  3. task-id: XenAPI task ID
The client
1. writes the unencoded disk contents
2. closes the connection at the end
(Optionally) The client waits until Task.get_finished is true, and then checks the value of Task.get_status to find out whether the operation succeeded or failed. Reasons the task may fail include
1. insufficient space in the VDI for the data provided
2. I/O errors writing to the backend substrate
If the client called Task.create then it now calls Task.destroy

This can all be driven through the xe CLI command:

 xe vdi-import uuid=<target VDI> filename=<raw image>

This command takes care of authentication, Task handling and error reporting.

Advantages:

simple, can be driven entirely through a wget/curl invocation (without error reporting) or via the CLI

Disadvantages:

only supports raw format images
no support for uploading deltas
requires you to pre-create an image of the right size
doesn't allow an aborted transfer to be resumed
CLI has no progress monitoring

HTTP PUT with 'chunked' encoding

This is a version of the HTTP PUT with raw encoding with the following differences:

The client's HTTP PUT request contains the key "chunked" in either a query parameter or a cookie
The client sends the data in a stream of chunks, consisting of a header followed by a payload

The header format is:

64-bit little-endian: offset: the offset within the target disk in bytes
32-bit little-endian: length: the number of bytes of data in this 'chunk'

The length field measure the length of the payload which follows the header.

The stream is terminated by a single chunk with both offset and length set to 0.

There is no support in the CLI. This is known to be used by XenDesktop.

Advantages:

allows sparse disks (containing big holes full of zeroes) to be uploaded efficiently

Disadvantages:

upload only
no ability to resume an interrupted download
no CLI support
non-standard protocol, only available documentation seems to be this wiki page (!) after previous page was deleted
the only opportunity for error reporting is at the end

Network Block Device (NBD) access

This applies to XenServer 6.2 and later.

For all the HTTP requests described below, authentication can be either

basic auth (convenient for commandline usage with wget/curl)
a pre-created session_id query parameter or cookie header

For a given VDI uuid $VDI containined within SR uuid $SR:

The client generates a fresh $UUID
The client calls the internal storage API VDI.attach by
1. making an HTTP POST to /services/SM
2. containing an XMLRPC request. TODO: generate documentation for this.
The client calls the internal storage API VDI.activate by
1. making an HTTP POST to /services/SM
2. containing an XMLRPC request. TODO: generate documentation for this.
The client makes an HTTP PUT to /services/SM/nbd/$SR/$VDI/$UUID
The server replies with HTTP 200 OK including the header
1. Transfer-encoding: nbd

At this point the client is connected to an NBD server and can access the disk.

To finalise:

The client closes the NBD connection
The client calls the internal storage API VDI.deactivate by
1. making an HTTP POST to /services/SM
2. containing an XMLRPC request. TODO: generate documentation for this
The client calls the internal storage API VDI.detach by
1. making an HTTP POST to /services/SM
2. containing an XMLRPC request. TODO: generate documentation for this

Advantages:

The NBD protocol is well supported by Linux and BSD
The NBD protocol is used by qemu/KVM for easy interop
The NBD protocol is understood by wireshark for easy debugging

Disadvantages:

The current "wrapping" of the protocol is barking mad:
1. It exposes internal storage APIs across host, holding back storage API evolution
2. It requires the client to clean up everything at the end: the tapdisk resource left lying around until VDI.detach is called

VM import/export

We support 2 VM export formats: 1. "geneva": this was the primary export format before XS 4.0 and should not be generated any more. We only support it just in case very old VM exports are still around. It will not be documented here. 2. "xva": this is the primary export format used in XS 4.0 and later

A "rio" export format consists of a tar file containing VM metadata and disk blocks. Example contents of a VM export:

$ tar -tvf test.xva
---------- 0/0           17391 1970-01-01 01:00 ova.xml
---------- 0/0         1048576 1970-01-01 01:00 Ref:946/00000000
---------- 0/0              40 1970-01-01 01:00 Ref:946/00000000.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:946/00000007
---------- 0/0              40 1970-01-01 01:00 Ref:946/00000007.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:949/00000000
---------- 0/0              40 1970-01-01 01:00 Ref:949/00000000.checksum
---------- 0/0         1048576 1970-01-01 01:00 Ref:949/00000003
---------- 0/0              40 1970-01-01 01:00 Ref:949/00000003.checksum

The ova.xml contains a set of records, one per object in the xapi database. For a VM it will include VBDs, VDIs, SRs, VIFs, Networks. If requested at export time it will also include details of VM snapshots, their VBDs, VDIs etc. The objects are linked together by references (Ref:xxx). The disk content corresponding to VDI with reference Ref:946 will be in a subdirectory called "Ref:946" (see the listing above).

The disk data is stored in a sequence of chunks. The size of each chunk is arbitrary but the common size is 1MiB. The chunks are ordered in a sequence, and the filename encodes a *sequence number* (not a disk offset). The first block is always present and given sequence number 00000000. Gaps in the sequence number space signal the presence of empty blocks, each the same size as the first block. For example, in the Ref:946 disk above, there is a gap of 6 between block 00000000 and 00000007, signalling the presence of 6 empty blocks of size 1MiB. The last block is always present to give a reader confidence that the stream has not been truncated.

Each block is followed by a checksum. For a block with filename 00000000, the checksum would be 00000000.checksum. The checksum file contains the SHA1 of the disk contents in the corresponding data block, printed as a hex string i.e. the same output as given by

$ sha1sum Ref\:946/00000000 | cut -f 1 -d " "
3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3

Note it is legal to insert empty data blocks into the stream, which consume a sequence number and which have a file_size of 0. Exporters may wish to insert these empty blocks if they are taking the time to scan the outgoing data for blocks of zeroes, in order to keep the data flowing and to prevent a network middlebox from silently closing the (apparently idle) connection.

HTTP BITS via transfer VM

TODO: fill this in later

vhd manipulation through plugins

OpenStack installs "plugins" in domain 0 and invokes these via the XenAPI. The plugins directly manipulate .vhd files contained within local "ext" and remote "nfs" SRs by:

moving new files in
updating the parent locator to create chains

Advantages:

the plugins can be developed separately to XenServer

Disadvantages:

the plugins run concurrently with coalesce and GC, potentially corrupting vhd chains or losing data
the plugins typically only support the file-based SR types, and leave out LVM. Supporting LVM is genuinely quite difficult and getting it wrong can cause data loss (e.g. through truncated device mapper entries)

Desired use-cases

From Mate Lakat on xs-devel:

Full disk input/output
- User wants to stream a virtual disk to OpenStack glance quickly and efficiently, where Glance acts as an HTTP server
- User wants to fetch a virtual disk from OpenStack glance quickly and efficiently, where Glance acts as an HTTP server

Regarding "quickly and efficiently", this was explained by John Garbutt on xs-devel:

A key requirement is to not use large amounts of memory or disk space
(10s of MB at most), when exporting disks up to about 1.5 TB in size,
where they might need to be tarballed, then gzipped, before being set
to a client. Thats why I suggested a bash pipe, but doesn't feel like
the best approach.

Incremental backup/restore (for regular server virt as well as CloudStack)
- User wants to make a full copy of a virtual disk (e.g. weekly)
- User wants to take frequent (e.g. daily) incremental snapshots of the disk
- User wants to export only the snapshot deltas (to save bandwidth)
- User wants to limit the number of snapshots present (to save space and chain length, on both client and server)
- User wants to import a full snapshot plus a set of deltas, to recover the disk

User wants to pipe a virtual disk through a bit-torrent client (or other third-party transcoding software) (see John Garbutt on xs-devel)

Principles

This section proposes some guiding principles to help guide the shape of the API.

We shall

support all currently supported protocols and formats.
default to generating export disk images in a well-known format with good documentation and sample code.
allow access to disk data using well-known protocols with good documentation and sample code.
support both streaming input/output and random-access.
1. when streaming we shall support resumption of interrupted transfers
maximise efficiency by keeping the datapath short; by avoiding unnecessary copying; by using aligned I/O everywhere.
keep it simple by hiding internal XenServer interfaces from clients.
provide CLI wrappers around all of our lower-level APIs.

API changes

We will continue to use the same authentication strategy for HTTP requests i.e.

we will accept basic auth for convenient commandline use with wget/curl
we will accept pre-allocated session_id parameters via query parameters or cookies

We will build upon the concept of a *content_id* first introduced for storage migration. Unlike a VDI's uuid which uniquely identifies the storage *container* whose data constantly changes, the *content_id* identifies the state of the data blocks at a point in time. When a snapshot is taken the content_id will remain the same since the content is now read/only. When a disk is written to the content_id is changed to a unique value. If two VDIs have the same content_id then, irrespective of their heritage and tree structure, we know they are equivalent.

HTTP full disk download

Client:
- HTTP GET /SM/sr-uuid/vdi-uuid
Server:
- Accept-Ranges: bytes
  - indicating that byte range requests are acceptable
- Content-Disposition: attachment; filename=content_id.vhd
  - encouraging the client to set the filename to the content_id, rather than the uuid. This is needed to handle chains properly, see below.
- Streams out data as a consolidated dynamic .vhd file
  - NB .vhd can be constructed on the fly: there is no need to use large amounts of temporary memory or space on the server.

For convenience this will be wrapped in the xe command:

xe vdi-export uuid=<vdi-uuid> --progress

HTTP delta disk download

To download the differences which would need to be applied on top of vdi-uuid2 (with content_id2) in order to end up with the same content as vdi-uuid (with content_id):

Client:
- HTTP GET /SM/sr-uuid/vdi-uuid/differences-from/sr-uuid2/vdi-uuid2
Server:
- Accept-Ranges: bytes
- Content-Disposition: attachment; filename=content_id.vhd
- Streams out a .vhd differencing disk with the parent locator set to "content_id2.vhd"

For convenience this will be wrapped in the xe command:

xe vdi-export uuid=<vdi-uuid> differences-from=<vdi-uuid2> --progress

@@ Line 240: / Line 240: @@
 == API changes ==
+We will continue to use the same authentication strategy for HTTP requests i.e.
-HTTP GET /...
+* we will accept basic auth for convenient commandline use with wget/curl
-* full disk in .vhd format
+* we will accept pre-allocated session_id parameters via query parameters or cookies
-* content-disposition sets the client filename to content_id.vhd
+We will build upon the concept of a *content_id* first introduced for storage migration. Unlike a VDI's uuid which uniquely identifies the storage *container* whose data constantly changes, the *content_id* identifies the state of the data blocks at a point in time. When a snapshot is taken the content_id will remain the same since the content is now read/only. When a disk is written to the content_id is changed to a unique value. If two VDIs have the same content_id then, irrespective of their heritage and tree structure, we know they are equivalent.
-HTTP GET /.../base
-* delta between current and base, with parent locator filename = base_content_id.vhd
+=== HTTP full disk download ===
+* Client:
+** HTTP GET /SM/sr-uuid/vdi-uuid
+* Server:
+** Accept-Ranges: bytes
+*** indicating that byte range requests are acceptable
+** Content-Disposition: attachment; filename=content_id.vhd
+*** encouraging the client to set the filename to the content_id, rather than the uuid. This is needed to handle chains properly, see below.
+** Streams out data as a consolidated dynamic .vhd file
+*** NB .vhd can be constructed on the fly: there is no need to use large amounts of temporary memory or space on the server.
+For convenience this will be wrapped in the xe command:
+ xe vdi-export uuid=<vdi-uuid> --progress
+=== HTTP delta disk download ===
+To download the differences which would need to be applied on top of vdi-uuid2 (with content_id2) in order to end up with the same content as vdi-uuid (with content_id):
+* Client:
+** HTTP GET /SM/sr-uuid/vdi-uuid/differences-from/sr-uuid2/vdi-uuid2
+* Server:
+** Accept-Ranges: bytes
+** Content-Disposition: attachment; filename=content_id.vhd
+** Streams out a .vhd differencing disk with the parent locator set to "content_id2.vhd"
+For convenience this will be wrapped in the xe command:
+ xe vdi-export uuid=<vdi-uuid> differences-from=<vdi-uuid2> --progress

Difference between revisions of "Proposal: Disk import/export"

Revision as of 08:27, 17 October 2013

Contents

Proposal to improve disk import/export

What do we currently have?

HTTP GET raw disk contents

HTTP PUT raw disk contents

HTTP PUT with 'chunked' encoding

Network Block Device (NBD) access

VM import/export

HTTP BITS via transfer VM

vhd manipulation through plugins

Desired use-cases

Principles

API changes

HTTP full disk download

HTTP delta disk download

Navigation menu

Views

Personal tools

Search

WIKI GUIDE

NAVIGATION BY INDEX

NAVIGATION BY AUDIENCE

HYPERVISOR & TOOLS

EMBEDDED/AUTOMOTIVE

UNIKERNELS

COMMUNITY

NAVIGATION BY DOC TYPE

NAVIGATION BY TECHNOLOGY

INTERACTION

Tools