Live-Updating Xen

From Xen

Live-Updating Xen

Current State

Initial proof-of-concept with patches from Varad's tree (link below) - no kexec involved:

Boot xen with domkill_leakguest cmdline param.
Save a PV domain state, leave guest memory in the RAM:
# xl save -s domU domU.img 

Restore domain state reusing magic mfns. The shared_info page contents are preserved:
# xl restore -T domU.img <l3tab_mfn> <l2tab_mfn> <shared_info_mfn>

TODO: Restore console, reconstruct guest pagetables from shared_info.

Development trees

http://git.infradead.org/users/dwmw2/xen.git/shortlog/refs/heads/bootcleanup

https://github.com/varadgautam/xen/tree/liveupdate-devel

TODO

This list will move to the JIRA instance

  • Devel milestone: PV domU persists across domain destroy/create
  • Dom0 persists across kexec
  • HVM guests persist across kexec
  • PV guests persist across kexec
  • One guest persists across kexec
  • Multiple guests persist across kexec
  • Guests exercise workloads
  • Update to same Xen binary as the Target Xen
  • Update to a Xen binary with a minor change, like a new printk
  • Update to a Xen binary with a fix for an XSA
  • Update to a new minor version
  • Update to a new major version

More information

Design Session Notes from Xen Summit 2019

  • Brief project overview:
    • We want to build Xen Live-update
    • early prototyping phase
    • IDEA: change running hypervisor to new one without guest disruptions
    • Reasons:
      • Security - we might need an updated versions for vuln mitigation
      • Development cycle accelaration - fast switch to hypervisor during dev
      • Maintainability - reduce version diversity in the fleet
    • We are currently eyeing a combination of guest transparent live migration and kexec into a new xen buildb
    • For more details: Live-Update talk
  • Terminology:
    • Running Xen -> The xen running on the host before update (Source)
    • Target Xen -> The xen we are updating *to*
  • Design discussions:
  • Live-update ties into multiple other projects currently done in the Xen-project:
    • Secret free Xen: reduce the footprint of guest relevant data in Xen
      • less state we might have to handle in the live update case
    • dom0less: bootstrap domains without the involvement of dom0
      • this might come in handy to at least setup and continue dom0 on target xen
      • If we have this this might also enable us to de-serialize the state for other guest-domains in xen and not have to wait for dom0 to do this
  • We want to just keep domain and hardware state
    • Xen is supposedly completely to be exchanged
    • We have to keep around the IOMMU page tables and do not touch them
      • this might also come in handy for some newer UEFI boot related issues?
      • We might have to go and reinject certain interrupts
    • do we need to dis-aggregate xenheap and domheap here?
      • We are currently trying to avoid this
  • A key stepstone for Live-update is guest transparent live migration
    • This means we are using a well defined ABI for saving/restoring domain state
      • We do only rely on domain state and no internal xen state
    • The idea is to migrate the guest not from one machine to another (in space) but on the same machine from one hypervisor to another (in time)
    • In addition we want to keep as much as possible in memory unchanged and feed this back to the target domain in order to save time
    • This means we will need additional info on those memory areas and have to be super careful not to stomp over them while starting the target xen
    • for live migration: domid is a problem in this case
      • randomize and pray does not work on smaller fleets
      • this is not a problem for live-update
      • BUT: as a community we shoudl make this restriction go away
  • Exchanging the Hypervisor using kexec
    • We have patches on upstream kexec-tools merged that enable multiboot2 for Xen
    • We can now load the target xen binary to the crashdump region to not stomp over any valuable date we might need later
    • But using the crashdump region for this has drawbacks when it comes to debugging and we might want to think about this later
      • What happens when live-update goes wrong?
      • Option: Increase Crashdump region size and partition it or have a separate reserved live-update region to load the target xen into
      • Separate region or partitoned region is not a priority for V1 but should be on the road map for future versions
  • Who serializes and deserializes domain state?
    • dom0: This should work fine, but who does this for dom0 itself?
    • Xen: This will need some more work, but might covered mostly by the dom0less effort on the arm side
      • this will need some work for x86, but Stefano does not consider this a lot of work
    • This would mean: serialize domain state into multiboot module and set domains up after kexecing xen in the dom0less manner
      • make multiboot module general enough so we can tag it as boot/resume/create/etc.
        • this will also enable us to do per-guest feature enablement
        • finer granular than specifying on cmdline
        • cmdline stuff is mostly broken, needs to be fixed for nested either way
        • domain create flags is a mess
  • Live update instead of crashdump?
    • Can we use such capabilities to recover from a crash be "restarting" xen on a crash?
      • live updating into (the same) xen on crash
    • crashing is a good mechanism because it happens if something is really broken and most likely not recoverable
    • Live update should be a concious process and not something you do as reaction to a crash
      • something is really broken if we crash
      • we should not proactively restart xen on crash
        • we might run into crash loops
    • maybe this can be done in the future, but it is not changing anything for the design
      • if anybody wants to wire this up once live update is there, that should not be too hard
      • then you want to think about: scattering the domains to multiple other hosts to not keep them on broken machines
  • We should use this opportunity to clean up certain parts of the code base:
    • interface for domain information is a mess
      • HVM and PV have some shared data but completely different ways of accessing it
  • Volume of patches:
    • Live update: still developing, we do not know yet
    • guest transparent live migration:
      • We have roughly 100 patches over time
      • we believe most of this has just to be cleaned up/squashed and will land us at a reasonable much lower number
      • this also needs 2-3 dom0 kernel patches
  • Summary of action items:
    • coordinate with dom0less effort on what we can use and contribute there
    • fix the domid clash problem
    • Decision on usage of crash kernel area
    • fix live migration patch set to include yet unsupported backends
      • clean up the patch set
      • upstream it
  • Longer term vision:
    • Have a tiny hypervisor between Guest and Xen that handles the common cases
      • this enables (almost) zero downtime for the guest
      • the tiny hypervisor will maintain the guest while the underlying xen is kexecing into new build
  • Somebody someday will want to get rid of the long tail of old xen versions in a fleet
    • live patch old running versions with live update capability?
    • crashdumping into a new hypervisor?
      • "crazy idea" but this will likely come up at some point