Difference between revisions of "Live-Updating Xen"

From Xen
(More information: add link to video)
(Current State)
 
(7 intermediate revisions by 3 users not shown)
Line 3: Line 3:
 
== Current State ==
 
== Current State ==
   
  +
=== Merged upstream ===
* [https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/log/?h=v2.0.20-rc1 kexec work merged for v2.0.20]
 
* PV domU serialization work ongoing
 
   
  +
* [https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/log/?h=v2.0.20 Multiboot2 support (i.e. relocation support) merged in kexec-tools v2.0.20]
Initial proof-of-concept with patches from Varad's tree (link below) - no kexec involved:
 
  +
 
  +
=== Posted upstream, in review ===
  +
  +
Early cleanups and fixes (v1): https://lists.xenproject.org/archives/html/xen-devel/2020-02/msg00000.html
  +
  +
TODO: For early `vmap()` we really want to make it officially OK to free boot-allocated pages with `free_xenheap_pages()` and even `free_domheap_pages()`. This involves fixing the esoteric corner cases in which it currently (rarely) doesn't quite work. Plan is to merge the `PGC_allocated` bit into the `PGC_state` bits, giving us 3 bits which can encode 8 states, of which 6 are currently valid: { inuse, offlining, broken_offlining, offline, broken, free }. We use the all-zeroes as 'never touched by the heap' moving inuse to 1, and then free_xenheap_pages() and free_domheap_pages() can check for that and call init_heap_pages() instead of free_heap_pages() if necessary. And we have one spare state for future use. (Varad)
  +
  +
=== Posted as RFC ===
  +
  +
* Physical memory management over kexec [https://xenbits.xen.org/gitweb/?p=people/dwmw2/xen.git;a=blob;f=docs/specs/live-update-handover.pandoc;hb=refs/heads/lu-master Handover protocol documentation] [http://david.woodhou.se/live-update-handover.pdf Potentially out of date PDF version]
  +
* Management of live update data stream passing domains' state from Running Xen to Target Xen.
  +
* Definition of state record format based on migration stream record format.
  +
* Reservation of domain-owned pages in Target Xen as heap allocator starts up.
  +
  +
=== In development hacks ===
  +
  +
* PV domain save/restore over kexec with certain caveats.
  +
  +
<dwmw2_gone> [root@localhost ~]# xl info | grep cc_compile_date
  +
<dwmw2_gone> cc_compile_date : Wed Jan 22 21:10:38 GMT 2020
  +
<dwmw2_gone> [root@localhost ~]# KEXEC_LIVE_UPATE=1 ./kexec-tools/build/sbin/kexec xen2 --append="console=vga,com1 crashkernel=128M<4G no-real-mode insert_l1d_flush=0 dom0_max_vcpus=1 liveupdate=128M@2936M:0xb7800000" --mem-min=0xaf800000 -t multiboot2-x86 -f
  +
<dwmw2_gone> can't get linerar framebuffer address
  +
<dwmw2_gone> kexec failed: Invalid argument
  +
<dwmw2_gone> [root@localhost ~]# xl info | grep cc_compile_date
  +
<dwmw2_gone> cc_compile_date : Wed Jan 22 21:45:36 GMT 2020
  +
<dwmw2_gone> Wheee. Really must fix that -EINVAL :)
  +
<andyhhp> is that a kexec reload actually preserving dom0?
  +
<dwmw2_gone> yep
  +
<andyhhp> ship it :)
  +
<dwmw2_gone> a carefully configured dom0 with 2l event channels, one vcpu
  +
  +
=== Being worked on ===
  +
  +
* Pass M2P over (dwmw2)
  +
* Refactor internal LU_DOMAIN_INFO record to post upstream (dwmw2)
  +
* Refactor internal page list record into a single uint64_t per contiguous range of MFNs of the same time (dwmw2)
  +
* Continue fixing PV Dom0 (Julien / Varad)
  +
** Support more than one vCPU (Varad)
  +
** FIFO event channels (Varad)
  +
* Refactor internal PV save/restore (once it's all working perfectly) for posting upstream; especially all the hacks through domain creation (TBD)
  +
* Save/restore HVM domains (Julien)
  +
* Upstreaming Guest transparent HVM migration support (Paul)
  +
  +
* kexec-tools `--live-update` support including memory layout based on `KEXEC_RANGE_MA_LIVEUPDATE` and `liveupdate=` command line. (Varad)
  +
  +
  +
Initial proof-of-concept with patches from Varad's tree (link below) - no kexec involved:
  +
Boot xen with domkill_leakguest cmdline param.
 
Save a PV domain state, leave guest memory in the RAM:
 
Save a PV domain state, leave guest memory in the RAM:
 
# xl save -s domU domU.img
 
# xl save -s domU domU.img
Line 17: Line 63:
   
 
== Development trees ==
 
== Development trees ==
http://git.infradead.org/users/dwmw2/xen.git/shortlog/refs/heads/bootcleanup
+
* http://git.infradead.org/users/dwmw2/xen.git/shortlog/refs/heads/bootcleanup
  +
* https://xenbits.xen.org/gitweb/?p=people/dwmw2/xen.git;a=shortlog;h=refs/heads/lu-master
 
https://github.com/varadgautam/xen/tree/liveupdate-devel
+
* https://github.com/varadgautam/xen/tree/liveupdate-devel
   
 
== TODO ==
 
== TODO ==
Line 42: Line 88:
 
* [https://static.sched.com/hosted_files/xensummit19/1f/20190710-xensummit-live-updating-xen.pdf Slides from Xen Summit 2019]
 
* [https://static.sched.com/hosted_files/xensummit19/1f/20190710-xensummit-live-updating-xen.pdf Slides from Xen Summit 2019]
 
* [https://www.youtube.com/watch?v=ANaDS9BUhuA&list=PLYyw7IQjL-zHmP6CuqwuqfXNK5QLmU7Ur&index=15&t=0s Video recording of the Xen Summit 2019 talk]
 
* [https://www.youtube.com/watch?v=ANaDS9BUhuA&list=PLYyw7IQjL-zHmP6CuqwuqfXNK5QLmU7Ur&index=15&t=0s Video recording of the Xen Summit 2019 talk]
  +
  +
=== Design Session Notes from Xen Summit 2019 ===
  +
  +
* Brief project overview:
  +
** We want to build Xen Live-update
  +
** early prototyping phase
  +
** IDEA: change running hypervisor to new one without guest disruptions
  +
** Reasons:
  +
*** Security - we might need an updated versions for vuln mitigation
  +
*** Development cycle accelaration - fast switch to hypervisor during dev
  +
*** Maintainability - reduce version diversity in the fleet
  +
** We are currently eyeing a combination of guest transparent live migration and kexec into a new xen buildb
  +
** For more details: Live-Update talk
  +
  +
* Terminology:
  +
** Running Xen -> The xen running on the host before update (Source)
  +
** Target Xen -> The xen we are updating *to*
  +
  +
* Design discussions:
  +
  +
* Live-update ties into multiple other projects currently done in the Xen-project:
  +
** Secret free Xen: reduce the footprint of guest relevant data in Xen
  +
*** less state we might have to handle in the live update case
  +
** dom0less: bootstrap domains without the involvement of dom0
  +
*** this might come in handy to at least setup and continue dom0 on target xen
  +
*** If we have this this might also enable us to de-serialize the state for other guest-domains in xen and not have to wait for dom0 to do this
  +
  +
* We want to just keep domain and hardware state
  +
** Xen is supposedly completely to be exchanged
  +
** We have to keep around the IOMMU page tables and do not touch them
  +
*** this might also come in handy for some newer UEFI boot related issues?
  +
*** We might have to go and reinject certain interrupts
  +
** do we need to dis-aggregate xenheap and domheap here?
  +
*** We are currently trying to avoid this
  +
  +
* A key stepstone for Live-update is guest transparent live migration
  +
** This means we are using a well defined ABI for saving/restoring domain state
  +
*** We do only rely on domain state and no internal xen state
  +
** The idea is to migrate the guest not from one machine to another (in space) but on the same machine from one hypervisor to another (in time)
  +
** In addition we want to keep as much as possible in memory unchanged and feed this back to the target domain in order to save time
  +
** This means we will need additional info on those memory areas and have to be super careful not to stomp over them while starting the target xen
  +
** for live migration: domid is a problem in this case
  +
*** randomize and pray does not work on smaller fleets
  +
*** this is not a problem for live-update
  +
*** BUT: as a community we shoudl make this restriction go away
  +
  +
* Exchanging the Hypervisor using kexec
  +
** We have patches on upstream kexec-tools merged that enable multiboot2 for Xen
  +
** We can now load the target xen binary to the crashdump region to not stomp over any valuable date we might need later
  +
** But using the crashdump region for this has drawbacks when it comes to debugging and we might want to think about this later
  +
*** What happens when live-update goes wrong?
  +
*** Option: Increase Crashdump region size and partition it or have a separate reserved live-update region to load the target xen into
  +
*** Separate region or partitoned region is not a priority for V1 but should be on the road map for future versions
  +
  +
* Who serializes and deserializes domain state?
  +
** dom0: This should work fine, but who does this for dom0 itself?
  +
** Xen: This will need some more work, but might covered mostly by the dom0less effort on the arm side
  +
*** this will need some work for x86, but Stefano does not consider this a lot of work
  +
** This would mean: serialize domain state into multiboot module and set domains up after kexecing xen in the dom0less manner
  +
*** make multiboot module general enough so we can tag it as boot/resume/create/etc.
  +
**** this will also enable us to do per-guest feature enablement
  +
**** finer granular than specifying on cmdline
  +
**** cmdline stuff is mostly broken, needs to be fixed for nested either way
  +
**** domain create flags is a mess
  +
  +
* Live update instead of crashdump?
  +
** Can we use such capabilities to recover from a crash be "restarting" xen on a crash?
  +
*** live updating into (the same) xen on crash
  +
** crashing is a good mechanism because it happens if something is really broken and most likely not recoverable
  +
** Live update should be a concious process and not something you do as reaction to a crash
  +
*** something is really broken if we crash
  +
*** we should not proactively restart xen on crash
  +
**** we might run into crash loops
  +
** maybe this can be done in the future, but it is not changing anything for the design
  +
*** if anybody wants to wire this up once live update is there, that should not be too hard
  +
*** then you want to think about: scattering the domains to multiple other hosts to not keep them on broken machines
  +
  +
* We should use this opportunity to clean up certain parts of the code base:
  +
** interface for domain information is a mess
  +
*** HVM and PV have some shared data but completely different ways of accessing it
  +
  +
* Volume of patches:
  +
** Live update: still developing, we do not know yet
  +
** guest transparent live migration:
  +
*** We have roughly 100 patches over time
  +
*** we believe most of this has just to be cleaned up/squashed and will land us at a reasonable much lower number
  +
*** this also needs 2-3 dom0 kernel patches
  +
  +
* Summary of action items:
  +
** coordinate with dom0less effort on what we can use and contribute there
  +
** fix the domid clash problem
  +
** Decision on usage of crash kernel area
  +
** fix live migration patch set to include yet unsupported backends
  +
*** clean up the patch set
  +
*** upstream it
  +
  +
* Longer term vision:
  +
** Have a tiny hypervisor between Guest and Xen that handles the common cases
  +
*** this enables (almost) zero downtime for the guest
  +
*** the tiny hypervisor will maintain the guest while the underlying xen is kexecing into new build
  +
  +
* Somebody someday will want to get rid of the long tail of old xen versions in a fleet
  +
** live patch old running versions with live update capability?
  +
** crashdumping into a new hypervisor?
  +
*** "crazy idea" but this will likely come up at some point

Latest revision as of 16:56, 6 February 2020

Live-Updating Xen

Current State

Merged upstream

Posted upstream, in review

Early cleanups and fixes (v1): https://lists.xenproject.org/archives/html/xen-devel/2020-02/msg00000.html

TODO: For early `vmap()` we really want to make it officially OK to free boot-allocated pages with `free_xenheap_pages()` and even `free_domheap_pages()`. This involves fixing the esoteric corner cases in which it currently (rarely) doesn't quite work. Plan is to merge the `PGC_allocated` bit into the `PGC_state` bits, giving us 3 bits which can encode 8 states, of which 6 are currently valid: { inuse, offlining, broken_offlining, offline, broken, free }. We use the all-zeroes as 'never touched by the heap' moving inuse to 1, and then free_xenheap_pages() and free_domheap_pages() can check for that and call init_heap_pages() instead of free_heap_pages() if necessary. And we have one spare state for future use. (Varad)

Posted as RFC

  • Physical memory management over kexec Handover protocol documentation Potentially out of date PDF version
  • Management of live update data stream passing domains' state from Running Xen to Target Xen.
  • Definition of state record format based on migration stream record format.
  • Reservation of domain-owned pages in Target Xen as heap allocator starts up.

In development hacks

  • PV domain save/restore over kexec with certain caveats.
<dwmw2_gone>    [root@localhost ~]# xl info | grep cc_compile_date
<dwmw2_gone>    cc_compile_date        : Wed Jan 22 21:10:38 GMT 2020
<dwmw2_gone>    [root@localhost ~]# KEXEC_LIVE_UPATE=1  ./kexec-tools/build/sbin/kexec xen2 --append="console=vga,com1 crashkernel=128M<4G no-real-mode insert_l1d_flush=0 dom0_max_vcpus=1 liveupdate=128M@2936M:0xb7800000"  --mem-min=0xaf800000 -t multiboot2-x86 -f
<dwmw2_gone>    can't get linerar framebuffer address
<dwmw2_gone>    kexec failed: Invalid argument
<dwmw2_gone>    [root@localhost ~]# xl info | grep cc_compile_date
<dwmw2_gone>    cc_compile_date        : Wed Jan 22 21:45:36 GMT 2020
<dwmw2_gone>    Wheee. Really must fix that -EINVAL :)
<andyhhp>       is that a kexec reload actually preserving dom0?
<dwmw2_gone>    yep
<andyhhp>       ship it :)
<dwmw2_gone>    a carefully configured dom0 with 2l event channels, one vcpu

Being worked on

  • Pass M2P over (dwmw2)
  • Refactor internal LU_DOMAIN_INFO record to post upstream (dwmw2)
  • Refactor internal page list record into a single uint64_t per contiguous range of MFNs of the same time (dwmw2)
  • Continue fixing PV Dom0 (Julien / Varad)
    • Support more than one vCPU (Varad)
    • FIFO event channels (Varad)
  • Refactor internal PV save/restore (once it's all working perfectly) for posting upstream; especially all the hacks through domain creation (TBD)
  • Save/restore HVM domains (Julien)
  • Upstreaming Guest transparent HVM migration support (Paul)
  • kexec-tools `--live-update` support including memory layout based on `KEXEC_RANGE_MA_LIVEUPDATE` and `liveupdate=` command line. (Varad)


Initial proof-of-concept with patches from Varad's tree (link below) - no kexec involved:

Boot xen with domkill_leakguest cmdline param.
Save a PV domain state, leave guest memory in the RAM:
# xl save -s domU domU.img 

Restore domain state reusing magic mfns. The shared_info page contents are preserved:
# xl restore -T domU.img <l3tab_mfn> <l2tab_mfn> <shared_info_mfn>

TODO: Restore console, reconstruct guest pagetables from shared_info.

Development trees

TODO

This list will move to the JIRA instance

  • Devel milestone: PV domU persists across domain destroy/create
  • Dom0 persists across kexec
  • HVM guests persist across kexec
  • PV guests persist across kexec
  • One guest persists across kexec
  • Multiple guests persist across kexec
  • Guests exercise workloads
  • Update to same Xen binary as the Target Xen
  • Update to a Xen binary with a minor change, like a new printk
  • Update to a Xen binary with a fix for an XSA
  • Update to a new minor version
  • Update to a new major version

More information

Design Session Notes from Xen Summit 2019

  • Brief project overview:
    • We want to build Xen Live-update
    • early prototyping phase
    • IDEA: change running hypervisor to new one without guest disruptions
    • Reasons:
      • Security - we might need an updated versions for vuln mitigation
      • Development cycle accelaration - fast switch to hypervisor during dev
      • Maintainability - reduce version diversity in the fleet
    • We are currently eyeing a combination of guest transparent live migration and kexec into a new xen buildb
    • For more details: Live-Update talk
  • Terminology:
    • Running Xen -> The xen running on the host before update (Source)
    • Target Xen -> The xen we are updating *to*
  • Design discussions:
  • Live-update ties into multiple other projects currently done in the Xen-project:
    • Secret free Xen: reduce the footprint of guest relevant data in Xen
      • less state we might have to handle in the live update case
    • dom0less: bootstrap domains without the involvement of dom0
      • this might come in handy to at least setup and continue dom0 on target xen
      • If we have this this might also enable us to de-serialize the state for other guest-domains in xen and not have to wait for dom0 to do this
  • We want to just keep domain and hardware state
    • Xen is supposedly completely to be exchanged
    • We have to keep around the IOMMU page tables and do not touch them
      • this might also come in handy for some newer UEFI boot related issues?
      • We might have to go and reinject certain interrupts
    • do we need to dis-aggregate xenheap and domheap here?
      • We are currently trying to avoid this
  • A key stepstone for Live-update is guest transparent live migration
    • This means we are using a well defined ABI for saving/restoring domain state
      • We do only rely on domain state and no internal xen state
    • The idea is to migrate the guest not from one machine to another (in space) but on the same machine from one hypervisor to another (in time)
    • In addition we want to keep as much as possible in memory unchanged and feed this back to the target domain in order to save time
    • This means we will need additional info on those memory areas and have to be super careful not to stomp over them while starting the target xen
    • for live migration: domid is a problem in this case
      • randomize and pray does not work on smaller fleets
      • this is not a problem for live-update
      • BUT: as a community we shoudl make this restriction go away
  • Exchanging the Hypervisor using kexec
    • We have patches on upstream kexec-tools merged that enable multiboot2 for Xen
    • We can now load the target xen binary to the crashdump region to not stomp over any valuable date we might need later
    • But using the crashdump region for this has drawbacks when it comes to debugging and we might want to think about this later
      • What happens when live-update goes wrong?
      • Option: Increase Crashdump region size and partition it or have a separate reserved live-update region to load the target xen into
      • Separate region or partitoned region is not a priority for V1 but should be on the road map for future versions
  • Who serializes and deserializes domain state?
    • dom0: This should work fine, but who does this for dom0 itself?
    • Xen: This will need some more work, but might covered mostly by the dom0less effort on the arm side
      • this will need some work for x86, but Stefano does not consider this a lot of work
    • This would mean: serialize domain state into multiboot module and set domains up after kexecing xen in the dom0less manner
      • make multiboot module general enough so we can tag it as boot/resume/create/etc.
        • this will also enable us to do per-guest feature enablement
        • finer granular than specifying on cmdline
        • cmdline stuff is mostly broken, needs to be fixed for nested either way
        • domain create flags is a mess
  • Live update instead of crashdump?
    • Can we use such capabilities to recover from a crash be "restarting" xen on a crash?
      • live updating into (the same) xen on crash
    • crashing is a good mechanism because it happens if something is really broken and most likely not recoverable
    • Live update should be a concious process and not something you do as reaction to a crash
      • something is really broken if we crash
      • we should not proactively restart xen on crash
        • we might run into crash loops
    • maybe this can be done in the future, but it is not changing anything for the design
      • if anybody wants to wire this up once live update is there, that should not be too hard
      • then you want to think about: scattering the domains to multiple other hosts to not keep them on broken machines
  • We should use this opportunity to clean up certain parts of the code base:
    • interface for domain information is a mess
      • HVM and PV have some shared data but completely different ways of accessing it
  • Volume of patches:
    • Live update: still developing, we do not know yet
    • guest transparent live migration:
      • We have roughly 100 patches over time
      • we believe most of this has just to be cleaned up/squashed and will land us at a reasonable much lower number
      • this also needs 2-3 dom0 kernel patches
  • Summary of action items:
    • coordinate with dom0less effort on what we can use and contribute there
    • fix the domid clash problem
    • Decision on usage of crash kernel area
    • fix live migration patch set to include yet unsupported backends
      • clean up the patch set
      • upstream it
  • Longer term vision:
    • Have a tiny hypervisor between Guest and Xen that handles the common cases
      • this enables (almost) zero downtime for the guest
      • the tiny hypervisor will maintain the guest while the underlying xen is kexecing into new build
  • Somebody someday will want to get rid of the long tail of old xen versions in a fleet
    • live patch old running versions with live update capability?
    • crashdumping into a new hypervisor?
      • "crazy idea" but this will likely come up at some point