XCP Archive/XCP diagnostic messages

From Xen
Jump to: navigation, search

In future we would like to have a diagnostic console which can help understand the state of an individual host (not pool). We would like to see:

  1. instantaneous load information, including
    1. disk and network throughput (bytes per sec)
    2. memory usage
    3. CPU usage
    4. number of messages per second across various internal control interfaces
    5. message latency distribution per internal service
  2. the states of "alarms", where an "alarm" is set when one or more of the load metrics crosses some threshold for some period of time. Perhaps we could 3 states and use red/amber/green (people love dashboards with traffic lights)
  3. the contents of message queues containing JSON control messages
  4. a live stream of messages, filtered with some expression
  5. a live stream of logs, filtered with some expression

Ideally the data would be very easy to access and understand so I've tried to make it "RESTful", to allow people to explore its structure using only their web-browser-- comments on this appreciated!

Example messages

The message format is not fixed, it can still be adjusted. Everything here is just an example.

Load information example

There will be lots of load metrics. It's unlikely that many of these will be displayed at once, rather they'll be consulted on-demand or perhaps a small set will be compared together.

The load metrics are grouped by the type of thing they refer to:

$ wget http://server/metrics
{
  "SR": {"href": "/metrics/SR", "description": "1"},
  "VM": {"href": "/metrics/VM", "description": "2"}
}

where "1" and "2" are description ids, which can be mapped onto text separately in each supported language.

Then each type of thing has a number of instances:

$ wget http://server/metrics/VM
[ { "uuid": "foo",
    "href", "/metrics/VM/foo" },
  ...
]

and each instance has a number of available metrics:

$ wget http://server/metrics/VM/foo
[
  { "name": "throughput",
    "units": "bytes/sec",
    "description": "3",
    "instantaneous_href": "/metrics/VM/foo/throughput/instantaneous",
    "history_href": "/metrics/VM/foo/throughput/history"
  }
]

where "instantaneous" can be polled to retrieve a single value:

$ wget http://server/metrics/VM/foo/throughput/instantaneous
{ "value": 15 }

and "history" can be used to fetch an array of old values, plus a uri which will block until new values are available:

$ wget http://server/metrics/VM/foo/throughput/history
{ "data": [ 1., 2., 3. ],
  "next_href": "/metrics/VM/foo/throughput/history/5
}

(where "5" is some kind of next data id)

Alarm state example

An alarm is a named variable whose value can vary over a small set e.g. { set, unset } or { red, amber, green} (people love dashboards full of traffic lights). A value would be configured to become "set" (the alarm is "triggered") if a particular condition is true for more than a defined interval of time (e.g. if the host CPUs are more than 90% utilised for 5 minutes). Unsetting an alarm would be a manual act, taken after an admin has been able to examine the relevant historical data, either logs or historical load metrics.

Since people love panels full of coloured lights, we would need to be able to fetch the state of all alarms:

$ wget http://server/alarms
[ { "href": "/alarms/1",
    "state": "red" },
  { "href", "/alarms/2",
    "state": "green" }
]

When an alarm is set (or "red" or whatever means "bad things happened") the user will want to query the specific alarm for more information:

$ wget http://server/alarms/1
{ "description": "5",
  "update_datetime": "ISO8601 date",
  "metrics_hrefs": [ "/metrics/VM/foo/throughput",
                     "/metrics/VM/foo/latency"
                   ]
}

where the "metrics_hrefs" constitute the raw data input to the alarm, whose history the admin will want to consult. We may wish to actually link to the time period in question?

Message queue example

Each service on the system will be contactable through one or more named persistent message queues. A high-level operation such as "start VM" will cause a cascade of messages between components. If a service starts to overload, or become stuck, it will be helpful to be able to inspect the individual message queues to check the queueing time of an individual message. It will also be helpful to see distributions of queueing times to help understand and improve predictability. Each message within a high-level operation will have a tag allowing us to filter and display a message sequence chart for a single "VM start".

It will be possible to list all the named queues:

$ wget http://server/queues
[ { "name": "org.xen.xapi",
    "description": "6",
    "href": "/queues/xapi" },
  { "name": "org.xen.xenopsd",
    "description": "7",
    "href": "/queues/xenopsd" }
]

and examine individual queues in more detail:

$ wget http://server/queues/xapi
{ "contents": [ { "from": "IP:port",
                  "arrival_datetime": "ISO8601 date",
                  "reply_to": "",
                  "contents": { "method": "VM.start",
                                "tag": "some unique id",
                                "args": [] }
                 } 
               ],
  "last_service_datetime": "ISO8601 date",
  "latency_href": "/metrics/queue/latency",
  "occupancy_href": "/metrics/queue/occupancy",
 }

where "contents" shows the full contents of the queue; "last_service_datetime" records when a message was last processed by the service; "latency_href" links to a record of recent message processing latencies; and "occupancy_href" links to a record of queue occupancy over time.

Live stream of messages / logs example

All message events (send, receive) will be logged to a fixed-size buffer which can be queried:

$ wget http://server/messages
{ "messages": [ { "from": "org.xen.xapi",
                  "to": "org.xen.xenopsd",
                  "when_datetime": "ISO8601",
                  "contents": { "method": "VM.start",
                                "tag": "some unique id",
                                "args": []
                              }
                 } ],
  "next_href": "/messages/2",
  "dropped": "0"
 }

where "next_href" will block until more messages are available and "dropped" is non-zero if some messages have been overwritten since the last call.

It will be possible to request only those messages which contain a "tag" within the contents:

$ wget http://server/messages/tag/some-unique-id

and only those messages involving a particular named service

$ wget http://server/messages/name/org.xen.xapi

Fetching system logs will be similar, although the exact details of the "contents" will be different.