aboutsummaryrefslogtreecommitdiffstats
path: root/docs/src/interface.tex
diff options
context:
space:
mode:
Diffstat (limited to 'docs/src/interface.tex')
-rw-r--r--docs/src/interface.tex2035
1 files changed, 2003 insertions, 32 deletions
diff --git a/docs/src/interface.tex b/docs/src/interface.tex
index 41721b04e2..5169b3878c 100644
--- a/docs/src/interface.tex
+++ b/docs/src/interface.tex
@@ -3,6 +3,10 @@
\usepackage{comment,parskip}
\setstretch{1.15}
+% LIBRARY FUNCTIONS
+
+\newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}
+
\begin{document}
% TITLE PAGE
@@ -15,19 +19,18 @@
\vfill
\begin{tabular}{l}
{\Huge \bf Interface manual} \\[4mm]
-{\huge Xen v2.0 for x86} \\[80mm]
+{\huge Xen v3.0 for x86} \\[80mm]
-{\Large Xen is Copyright (c) 2002-2004, The Xen Team} \\[3mm]
+{\Large Xen is Copyright (c) 2002-2005, The Xen Team} \\[3mm]
{\Large University of Cambridge, UK} \\[20mm]
\end{tabular}
\end{center}
-{\bf
-DISCLAIMER: This documentation is currently under active development
+{\bf DISCLAIMER: This documentation is always under active development
and as such there may be mistakes and omissions --- watch out for
these and please report any you find to the developer's mailing list.
-Contributions of material, suggestions and corrections are welcome.
-}
+The latest version is always available on-line. Contributions of
+material, suggestions and corrections are welcome. }
\vfill
\cleardoublepage
@@ -67,7 +70,7 @@ operating system.
Xen essentially takes a `whole machine' virtualization approach as
pioneered by IBM VM/370. However, unlike VM/370 or more recent
-efforts such as VMWare and Virtual PC, Xen does not attempt to
+efforts such as VMware and Virtual PC, Xen does not attempt to
completely virtualize the underlying hardware. Instead parts of the
hosted guest operating systems are modified to work with the VMM; the
operating system is effectively ported to a new target architecture,
@@ -87,48 +90,2016 @@ itself, allows the Xen framework to separate the notions of
mechanism and policy within the system.
-%% chapter Virtual Architecture moved to architecture.tex
-\include{src/interface/architecture}
+\chapter{Virtual Architecture}
-%% chapter Memory moved to memory.tex
-\include{src/interface/memory}
+In a Xen/x86 system, only the hypervisor runs with full processor
+privileges ({\it ring 0} in the x86 four-ring model). It has full
+access to the physical memory available in the system and is
+responsible for allocating portions of it to running domains.
-%% chapter Devices moved to devices.tex
-\include{src/interface/devices}
+On a 32-bit x86 system, guest operating systems may use {\it rings 1},
+{\it 2} and {\it 3} as they see fit. Segmentation is used to prevent
+the guest OS from accessing the portion of the address space that is
+reserved for Xen. We expect most guest operating systems will use
+ring 1 for their own operation and place applications in ring 3.
-%% chapter Further Information moved to further_info.tex
-\include{src/interface/further_info}
+On 64-bit systems it is not possible to protect the hypervisor from
+untrusted guest code running in rings 1 and 2. Guests are therefore
+restricted to run in ring 3 only. The guest kernel is protected from its
+applications by context switching between the kernel and currently
+running application.
+In this chapter we consider the basic virtual architecture provided by
+Xen: CPU state, exception and interrupt handling, and time.
+Other aspects such as memory and device access are discussed in later
+chapters.
-\appendix
-%% chapter hypercalls moved to hypercalls.tex
-\include{src/interface/hypercalls}
+\section{CPU state}
+
+All privileged state must be handled by Xen. The guest OS has no
+direct access to CR3 and is not permitted to update privileged bits in
+EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen;
+these are analogous to system calls but occur from ring 1 to ring 0.
+
+A list of all hypercalls is given in Appendix~\ref{a:hypercalls}.
+
+
+\section{Exceptions}
+
+A virtual IDT is provided --- a domain can submit a table of trap
+handlers to Xen via the {\tt set\_trap\_table()} hypercall. The
+exception stack frame presented to a virtual trap handler is identical
+to its native equivalent.
+
+
+\section{Interrupts and events}
+
+Interrupts are virtualized by mapping them to \emph{event channels},
+which are delivered asynchronously to the target domain using a callback
+supplied via the {\tt set\_callbacks()} hypercall. A guest OS can map
+these events onto its standard interrupt dispatch mechanisms. Xen is
+responsible for determining the target domain that will handle each
+physical interrupt source. For more details on the binding of event
+sources to event channels, see Chapter~\ref{c:devices}.
+
+
+\section{Time}
+
+Guest operating systems need to be aware of the passage of both real
+(or wallclock) time and their own `virtual time' (the time for which
+they have been executing). Furthermore, Xen has a notion of time which
+is used for scheduling. The following notions of time are provided:
+
+\begin{description}
+\item[Cycle counter time.]
+
+ This provides a fine-grained time reference. The cycle counter time
+ is used to accurately extrapolate the other time references. On SMP
+ machines it is currently assumed that the cycle counter time is
+ synchronized between CPUs. The current x86-based implementation
+ achieves this within inter-CPU communication latencies.
+
+\item[System time.]
+
+ This is a 64-bit counter which holds the number of nanoseconds that
+ have elapsed since system boot.
+
+\item[Wall clock time.]
+
+ This is the time of day in a Unix-style {\tt struct timeval}
+ (seconds and microseconds since 1 January 1970, adjusted by leap
+ seconds). An NTP client hosted by {\it domain 0} can keep this
+ value accurate.
+
+\item[Domain virtual time.]
+
+ This progresses at the same pace as system time, but only while a
+ domain is executing --- it stops while a domain is de-scheduled.
+ Therefore the share of the CPU that a domain receives is indicated
+ by the rate at which its virtual time increases.
+
+\end{description}
+
+
+Xen exports timestamps for system time and wall-clock time to guest
+operating systems through a shared page of memory. Xen also provides
+the cycle counter time at the instant the timestamps were calculated,
+and the CPU frequency in Hertz. This allows the guest to extrapolate
+system and wall-clock times accurately based on the current cycle
+counter time.
+
+Since all time stamps need to be updated and read \emph{atomically}
+a version number is also stored in the shared info page, which is
+incremented before and after updating the timestamps. Thus a guest can
+be sure that it read a consistent state by checking the two version
+numbers are equal and even.
+
+Xen includes a periodic ticker which sends a timer event to the
+currently executing domain every 10ms. The Xen scheduler also sends a
+timer event whenever a domain is scheduled; this allows the guest OS
+to adjust for the time that has passed while it has been inactive. In
+addition, Xen allows each domain to request that they receive a timer
+event sent at a specified system time by using the {\tt
+ set\_timer\_op()} hypercall. Guest OSes may use this timer to
+implement timeout values when they block.
+
+
+
+%% % akw: demoting this to a section -- not sure if there is any point
+%% % though, maybe just remove it.
+
+% KAF: Remove these random sections!
+\begin{comment}
+\section{Xen CPU Scheduling}
+
+Xen offers a uniform API for CPU schedulers. It is possible to choose
+from a number of schedulers at boot and it should be easy to add more.
+The BVT, Atropos and Round Robin schedulers are part of the normal Xen
+distribution. BVT provides proportional fair shares of the CPU to the
+running domains. Atropos can be used to reserve absolute shares of
+the CPU for each domain. Round-robin is provided as an example of
+Xen's internal scheduler API.
+
+\paragraph*{Note: SMP host support}
+Xen has always supported SMP host systems. Domains are statically
+assigned to CPUs, either at creation time or when manually pinning to
+a particular CPU. The current schedulers then run locally on each CPU
+to decide which of the assigned domains should be run there. The
+user-level control software can be used to perform coarse-grain
+load-balancing between CPUs.
+\end{comment}
+
+
+%% More information on the characteristics and use of these schedulers
+%% is available in {\tt Sched-HOWTO.txt}.
+
+
+\section{Privileged operations}
+
+Xen exports an extended interface to privileged domains (viz.\ {\it
+ Domain 0}). This allows such domains to build and boot other domains
+on the server, and provides control interfaces for managing
+scheduling, memory, networking, and block devices.
+
+\chapter{Memory}
+\label{c:memory}
+
+Xen is responsible for managing the allocation of physical memory to
+domains, and for ensuring safe use of the paging and segmentation
+hardware.
+
+
+\section{Memory Allocation}
+
+As well as allocating a portion of physical memory for its own private
+use, Xen also reserves s small fixed portion of every virtual address
+space. This is located in the top 64MB on 32-bit systems, the top
+168MB on PAE systems, and a larger portion in the middle of the
+address space on 64-bit systems. Unreserved physical memory is
+available for allocation to domains at a page granularity. Xen tracks
+the ownership and use of each page, which allows it to enforce secure
+partitioning between domains.
+
+Each domain has a maximum and current physical memory allocation. A
+guest OS may run a `balloon driver' to dynamically adjust its current
+memory allocation up to its limit.
+
+
+\section{Pseudo-Physical Memory}
+
+Since physical memory is allocated and freed on a page granularity,
+there is no guarantee that a domain will receive a contiguous stretch
+of physical memory. However most operating systems do not have good
+support for operating in a fragmented physical address space. To aid
+porting such operating systems to run on top of Xen, we make a
+distinction between \emph{machine memory} and \emph{pseudo-physical
+ memory}.
+
+Put simply, machine memory refers to the entire amount of memory
+installed in the machine, including that reserved by Xen, in use by
+various domains, or currently unallocated. We consider machine memory
+to comprise a set of 4kB \emph{machine page frames} numbered
+consecutively starting from 0. Machine frame numbers mean the same
+within Xen or any domain.
+
+Pseudo-physical memory, on the other hand, is a per-domain
+abstraction. It allows a guest operating system to consider its memory
+allocation to consist of a contiguous range of physical page frames
+starting at physical frame 0, despite the fact that the underlying
+machine page frames may be sparsely allocated and in any order.
+
+To achieve this, Xen maintains a globally readable {\it
+ machine-to-physical} table which records the mapping from machine
+page frames to pseudo-physical ones. In addition, each domain is
+supplied with a {\it physical-to-machine} table which performs the
+inverse mapping. Clearly the machine-to-physical table has size
+proportional to the amount of RAM installed in the machine, while each
+physical-to-machine table has size proportional to the memory
+allocation of the given domain.
+
+Architecture dependent code in guest operating systems can then use
+the two tables to provide the abstraction of pseudo-physical memory.
+In general, only certain specialized parts of the operating system
+(such as page table management) needs to understand the difference
+between machine and pseudo-physical addresses.
+
+
+\section{Page Table Updates}
+
+In the default mode of operation, Xen enforces read-only access to
+page tables and requires guest operating systems to explicitly request
+any modifications. Xen validates all such requests and only applies
+updates that it deems safe. This is necessary to prevent domains from
+adding arbitrary mappings to their page tables.
+
+To aid validation, Xen associates a type and reference count with each
+memory page. A page has one of the following mutually-exclusive types
+at any point in time: page directory ({\sf PD}), page table ({\sf
+ PT}), local descriptor table ({\sf LDT}), global descriptor table
+({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always
+create readable mappings of its own memory regardless of its current
+type.
+
+%%% XXX: possibly explain more about ref count 'lifecyle' here?
+This mechanism is used to maintain the invariants required for safety;
+for example, a domain cannot have a writable mapping to any part of a
+page table as this would require the page concerned to simultaneously
+be of types {\sf PT} and {\sf RW}.
+
+\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count, domid\_t domid)}
+
+This hypercall is used to make updates to either the domain's
+pagetables or to the machine to physical mapping table. It supports
+submitting a queue of updates, allowing batching for maximal
+performance. Explicitly queuing updates using this interface will
+cause any outstanding writable pagetable state to be flushed from the
+system.
+
+\section{Writable Page Tables}
+
+Xen also provides an alternative mode of operation in which guests
+have the illusion that their page tables are directly writable. Of
+course this is not really the case, since Xen must still validate
+modifications to ensure secure partitioning. To this end, Xen traps
+any write attempt to a memory page of type {\sf PT} (i.e., that is
+currently part of a page table). If such an access occurs, Xen
+temporarily allows write access to that page while at the same time
+\emph{disconnecting} it from the page table that is currently in use.
+This allows the guest to safely make updates to the page because the
+newly-updated entries cannot be used by the MMU until Xen revalidates
+and reconnects the page. Reconnection occurs automatically in a
+number of situations: for example, when the guest modifies a different
+page-table page, when the domain is preempted, or whenever the guest
+uses Xen's explicit page-table update interfaces.
+
+Writable pagetable functionality is enabled when the guest requests
+it, using a {\tt vm\_assist} hypercall. Writable pagetables do {\em
+not} provide full virtualisation of the MMU, so the memory management
+code of the guest still needs to be aware that it is running on Xen.
+Since the guest's page tables are used directly, it must translate
+pseudo-physical addresses to real machine addresses when building page
+table entries. The guest may not attempt to map its own pagetables
+writably, since this would violate the memory type invariants; page
+tables will automatically be made writable by the hypervisor, as
+necessary.
+
+\section{Shadow Page Tables}
+
+Finally, Xen also supports a form of \emph{shadow page tables} in
+which the guest OS uses a independent copy of page tables which are
+unknown to the hardware (i.e.\ which are never pointed to by {\tt
+ cr3}). Instead Xen propagates changes made to the guest's tables to
+the real ones, and vice versa. This is useful for logging page writes
+(e.g.\ for live migration or checkpoint). A full version of the shadow
+page tables also allows guest OS porting with less effort.
+
+
+\section{Segment Descriptor Tables}
+
+At start of day a guest is supplied with a default GDT, which does not reside
+within its own memory allocation. If the guest wishes to use other
+than the default `flat' ring-1 and ring-3 segments that this GDT
+provides, it must register a custom GDT and/or LDT with Xen, allocated
+from its own memory.
+
+The following hypercall is used to specify a new GDT:
+
+\begin{quote}
+ int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em
+ entries})
+
+ \emph{frame\_list}: An array of up to 14 machine page frames within
+ which the GDT resides. Any frame registered as a GDT frame may only
+ be mapped read-only within the guest's address space (e.g., no
+ writable mappings, no use as a page-table page, and so on). Only 14
+ pages may be specified because pages 15 and 16 are reserved for
+ the hypervisor's GDT entries.
+
+ \emph{entries}: The number of descriptor-entry slots in the GDT.
+\end{quote}
+
+The LDT is updated via the generic MMU update mechanism (i.e., via the
+{\tt mmu\_update()} hypercall.
+
+\section{Start of Day}
+
+The start-of-day environment for guest operating systems is rather
+different to that provided by the underlying hardware. In particular,
+the processor is already executing in protected mode with paging
+enabled.
+
+{\it Domain 0} is created and booted by Xen itself. For all subsequent
+domains, the analogue of the boot-loader is the {\it domain builder},
+user-space software running in {\it domain 0}. The domain builder is
+responsible for building the initial page tables for a domain and
+loading its kernel image at the appropriate virtual address.
+
+\section{VM assists}
+
+Xen provides a number of ``assists'' for guest memory management.
+These are available on an ``opt-in'' basis to provide commonly-used
+extra functionality to a guest.
+
+\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
+
+The {\tt cmd} parameter describes the action to be taken, whilst the
+{\tt type} parameter describes the kind of assist that is being
+referred to. Available commands are as follows:
+
+\begin{description}
+\item[VMASST\_CMD\_enable] Enable a particular assist type
+\item[VMASST\_CMD\_disable] Disable a particular assist type
+\end{description}
+
+And the available types are:
+
+\begin{description}
+\item[VMASST\_TYPE\_4gb\_segments] Provide emulated support for
+ instructions that rely on 4GB segments (such as the techniques used
+ by some TLS solutions).
+\item[VMASST\_TYPE\_4gb\_segments\_notify] Provide a callback to the
+ guest if the above segment fixups are used: allows the guest to
+ display a warning message during boot.
+\item[VMASST\_TYPE\_writable\_pagetables] Enable writable pagetable
+ mode - described above.
+\end{description}
+
+
+\chapter{Xen Info Pages}
+
+The {\bf Shared info page} is used to share various CPU-related state
+between the guest OS and the hypervisor. This information includes VCPU
+status, time information and event channel (virtual interrupt) state.
+The {\bf Start info page} is used to pass build-time information to
+the guest when it boots and when it is resumed from a suspended state.
+This chapter documents the fields included in the {\tt
+shared\_info\_t} and {\tt start\_info\_t} structures for use by the
+guest OS.
+
+\section{Shared info page}
+
+The {\tt shared\_info\_t} is accessed at run time by both Xen and the
+guest OS. It is used to pass information relating to the
+virtual CPU and virtual machine state between the OS and the
+hypervisor.
+
+The structure is declared in {\tt xen/include/public/xen.h}:
+
+\scriptsize
+\begin{verbatim}
+typedef struct shared_info {
+ vcpu_info_t vcpu_info[MAX_VIRT_CPUS];
+
+ /*
+ * A domain can create "event channels" on which it can send and receive
+ * asynchronous event notifications. There are three classes of event that
+ * are delivered by this mechanism:
+ * 1. Bi-directional inter- and intra-domain connections. Domains must
+ * arrange out-of-band to set up a connection (usually by allocating
+ * an unbound 'listener' port and avertising that via a storage service
+ * such as xenstore).
+ * 2. Physical interrupts. A domain with suitable hardware-access
+ * privileges can bind an event-channel port to a physical interrupt
+ * source.
+ * 3. Virtual interrupts ('events'). A domain can bind an event-channel
+ * port to a virtual interrupt source, such as the virtual-timer
+ * device or the emergency console.
+ *
+ * Event channels are addressed by a "port index". Each channel is
+ * associated with two bits of information:
+ * 1. PENDING -- notifies the domain that there is a pending notification
+ * to be processed. This bit is cleared by the guest.
+ * 2. MASK -- if this bit is clear then a 0->1 transition of PENDING
+ * will cause an asynchronous upcall to be scheduled. This bit is only
+ * updated by the guest. It is read-only within Xen. If a channel
+ * becomes pending while the channel is masked then the 'edge' is lost
+ * (i.e., when the channel is unmasked, the guest must manually handle
+ * pending notifications as no upcall will be scheduled by Xen).
+ *
+ * To expedite scanning of pending notifications, any 0->1 pending
+ * transition on an unmasked channel causes a corresponding bit in a
+ * per-vcpu selector word to be set. Each bit in the selector covers a
+ * 'C long' in the PENDING bitfield array.
+ */
+ unsigned long evtchn_pending[sizeof(unsigned long) * 8];
+ unsigned long evtchn_mask[sizeof(unsigned long) * 8];
+
+ /*
+ * Wallclock time: updated only by control software. Guests should base
+ * their gettimeofday() syscall on this wallclock-base value.
+ */
+ uint32_t wc_version; /* Version counter: see vcpu_time_info_t. */
+ uint32_t wc_sec; /* Secs 00:00:00 UTC, Jan 1, 1970. */
+ uint32_t wc_nsec; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */
+
+ arch_shared_info_t arch;
+
+} shared_info_t;
+\end{verbatim}
+\normalsize
+
+\begin{description}
+\item[vcpu\_info] An array of {\tt vcpu\_info\_t} structures, each of
+ which holds either runtime information about a virtual CPU, or is
+ ``empty'' if the corresponding VCPU does not exist.
+\item[evtchn\_pending] Guest-global array, with one bit per event
+ channel. Bits are set if an event is currently pending on that
+ channel.
+\item[evtchn\_mask] Guest-global array for masking notifications on
+ event channels.
+\item[wc\_version] Version counter for current wallclock time.
+\item[wc\_sec] Whole seconds component of current wallclock time.
+\item[wc\_nsec] Nanoseconds component of current wallclock time.
+\item[arch] Host architecture-dependent portion of the shared info
+ structure.
+\end{description}
+
+\subsection{vcpu\_info\_t}
+
+\scriptsize
+\begin{verbatim}
+typedef struct vcpu_info {
+ /*
+ * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
+ * a pending notification for a particular VCPU. It is then cleared
+ * by the guest OS /before/ checking for pending work, thus avoiding
+ * a set-and-check race. Note that the mask is only accessed by Xen
+ * on the CPU that is currently hosting the VCPU. This means that the
+ * pending and mask flags can be updated by the guest without special
+ * synchronisation (i.e., no need for the x86 LOCK prefix).
+ * This may seem suboptimal because if the pending flag is set by
+ * a different CPU then an IPI may be scheduled even when the mask
+ * is set. However, note:
+ * 1. The task of 'interrupt holdoff' is covered by the per-event-
+ * channel mask bits. A 'noisy' event that is continually being
+ * triggered can be masked at source at this very precise
+ * granularity.
+ * 2. The main purpose of the per-VCPU mask is therefore to restrict
+ * reentrant execution: whether for concurrency control, or to
+ * prevent unbounded stack usage. Whatever the purpose, we expect
+ * that the mask will be asserted only for short periods at a time,
+ * and so the likelihood of a 'spurious' IPI is suitably small.
+ * The mask is read before making an event upcall to the guest: a
+ * non-zero mask therefore guarantees that the VCPU will not receive
+ * an upcall activation. The mask is cleared when the VCPU requests
+ * to block: this avoids wakeup-waiting races.
+ */
+ uint8_t evtchn_upcall_pending;
+ uint8_t evtchn_upcall_mask;
+ unsigned long evtchn_pending_sel;
+ arch_vcpu_info_t arch;
+ vcpu_time_info_t time;
+} vcpu_info_t; /* 64 bytes (x86) */
+\end{verbatim}
+\normalsize
+
+\begin{description}
+\item[evtchn\_upcall\_pending] This is set non-zero by Xen to indicate
+ that there are pending events to be received.
+\item[evtchn\_upcall\_mask] This is set non-zero to disable all
+ interrupts for this CPU for short periods of time. If individual
+ event channels need to be masked, the {\tt evtchn\_mask} in the {\tt
+ shared\_info\_t} is used instead.
+\item[evtchn\_pending\_sel] When an event is delivered to this VCPU, a
+ bit is set in this selector to indicate which word of the {\tt
+ evtchn\_pending} array in the {\tt shared\_info\_t} contains the
+ event in question.
+\item[arch] Architecture-specific VCPU info. On x86 this contains the
+ virtualized CR2 register (page fault linear address) for this VCPU.
+\item[time] Time values for this VCPU.
+\end{description}
+
+\subsection{vcpu\_time\_info}
+
+\scriptsize
+\begin{verbatim}
+typedef struct vcpu_time_info {
+ /*
+ * Updates to the following values are preceded and followed by an
+ * increment of 'version'. The guest can therefore detect updates by
+ * looking for changes to 'version'. If the least-significant bit of
+ * the version number is set then an update is in progress and the guest
+ * must wait to read a consistent set of values.
+ * The correct way to interact with the version number is similar to
+ * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry.
+ */
+ uint32_t version;
+ uint32_t pad0;
+ uint64_t tsc_timestamp; /* TSC at last update of time vals. */
+ uint64_t system_time; /* Time, in nanosecs, since boot. */
+ /*
+ * Current system time:
+ * system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul
+ * CPU frequency (Hz):
+ * ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift
+ */
+ uint32_t tsc_to_system_mul;
+ int8_t tsc_shift;
+ int8_t pad1[3];
+} vcpu_time_info_t; /* 32 bytes */
+\end{verbatim}
+\normalsize
+
+\begin{description}
+\item[version] Used to ensure the guest gets consistent time updates.
+\item[tsc\_timestamp] Cycle counter timestamp of last time value;
+ could be used to expolate in between updates, for instance.
+\item[system\_time] Time since boot (nanoseconds).
+\item[tsc\_to\_system\_mul] Cycle counter to nanoseconds multiplier
+(used in extrapolating current time).
+\item[tsc\_shift] Cycle counter to nanoseconds shift (used in
+extrapolating current time).
+\end{description}
+
+\subsection{arch\_shared\_info\_t}
+
+On x86, the {\tt arch\_shared\_info\_t} is defined as follows (from
+xen/public/arch-x86\_32.h):
+
+\scriptsize
+\begin{verbatim}
+typedef struct arch_shared_info {
+ unsigned long max_pfn; /* max pfn that appears in table */
+ /* Frame containing list of mfns containing list of mfns containing p2m. */
+ unsigned long pfn_to_mfn_frame_list_list;
+} arch_shared_info_t;
+\end{verbatim}
+\normalsize
+\begin{description}
+\item[max\_pfn] The maximum PFN listed in the physical-to-machine
+ mapping table (P2M table).
+\item[pfn\_to\_mfn\_frame\_list\_list] Machine address of the frame
+ that contains the machine addresses of the P2M table frames.
+\end{description}
+\section{Start info page}
+
+The start info structure is declared as the following (in {\tt
+xen/include/public/xen.h}):
+
+\scriptsize
+\begin{verbatim}
+#define MAX_GUEST_CMDLINE 1024
+typedef struct start_info {
+ /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
+ char magic[32]; /* "Xen-<version>.<subversion>". */
+ unsigned long nr_pages; /* Total pages allocated to this domain. */
+ unsigned long shared_info; /* MACHINE address of shared info struct. */
+ uint32_t flags; /* SIF_xxx flags. */
+ unsigned long store_mfn; /* MACHINE page number of shared page. */
+ uint32_t store_evtchn; /* Event channel for store communication. */
+ unsigned long console_mfn; /* MACHINE address of console page. */
+ uint32_t console_evtchn; /* Event channel for console messages. */
+ /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */
+ unsigned long pt_base; /* VIRTUAL address of page directory. */
+ unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
+ unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
+ unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
+ unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
+ int8_t cmd_line[MAX_GUEST_CMDLINE];
+} start_info_t;
+\end{verbatim}
+\normalsize
+
+The fields are in two groups: the first group are always filled in
+when a domain is booted or resumed, the second set are only used at
+boot time.
+
+The always-available group is as follows:
+
+\begin{description}
+\item[magic] A text string identifying the Xen version to the guest.
+\item[nr\_pages] The number of real machine pages available to the
+ guest.
+\item[shared\_info] Machine address of the shared info structure,
+ allowing the guest to map it during initialisation.
+\item[flags] Flags for describing optional extra settings to the
+ guest.
+\item[store\_mfn] Machine address of the Xenstore communications page.
+\item[store\_evtchn] Event channel to communicate with the store.
+\item[console\_mfn] Machine address of the console data page.
+\item[console\_evtchn] Event channel to notify the console backend.
+\end{description}
+
+The boot-only group may only be safely referred to during system boot:
+
+\begin{description}
+\item[pt\_base] Virtual address of the page directory created for us
+ by the domain builder.
+\item[nr\_pt\_frames] Number of frames used by the builders' bootstrap
+ pagetables.
+\item[mfn\_list] Virtual address of the list of machine frames this
+ domain owns.
+\item[mod\_start] Virtual address of any pre-loaded modules
+ (e.g. ramdisk)
+\item[mod\_len] Size of pre-loaded module (if any).
+\item[cmd\_line] Kernel command line passed by the domain builder.
+\end{description}
+
+
+% by Mark Williamson <mark.williamson@cl.cam.ac.uk>
+
+\chapter{Event Channels}
+\label{c:eventchannels}
+
+Event channels are the basic primitive provided by Xen for event
+notifications. An event is the Xen equivalent of a hardware
+interrupt. They essentially store one bit of information, the event
+of interest is signalled by transitioning this bit from 0 to 1.
+
+Notifications are received by a guest via an upcall from Xen,
+indicating when an event arrives (setting the bit). Further
+notifications are masked until the bit is cleared again (therefore,
+guests must check the value of the bit after re-enabling event
+delivery to ensure no missed notifications).
+
+Event notifications can be masked by setting a flag; this is
+equivalent to disabling interrupts and can be used to ensure atomicity
+of certain operations in the guest kernel.
+
+\section{Hypercall interface}
+
+\hypercall{event\_channel\_op(evtchn\_op\_t *op)}
+
+The event channel operation hypercall is used for all operations on
+event channels / ports. Operations are distinguished by the value of
+the {\tt cmd} field of the {\tt op} structure. The possible commands
+are described below:
+
+\begin{description}
+
+\item[EVTCHNOP\_alloc\_unbound]
+ Allocate a new event channel port, ready to be connected to by a
+ remote domain.
+ \begin{itemize}
+ \item Specified domain must exist.
+ \item A free port must exist in that domain.
+ \end{itemize}
+ Unprivileged domains may only allocate their own ports, privileged
+ domains may also allocate ports in other domains.
+\item[EVTCHNOP\_bind\_interdomain]
+ Bind an event channel for interdomain communications.
+ \begin{itemize}
+ \item Caller domain must have a free port to bind.
+ \item Remote domain must exist.
+ \item Remote port must be allocated and currently unbound.
+ \item Remote port must be expecting the caller domain as the ``remote''.
+ \end{itemize}
+\item[EVTCHNOP\_bind\_virq]
+ Allocate a port and bind a VIRQ to it.
+ \begin{itemize}
+ \item Caller domain must have a free port to bind.
+ \item VIRQ must be valid.
+ \item VCPU must exist.
+ \item VIRQ must not currently be bound to an event channel.
+ \end{itemize}
+\item[EVTCHNOP\_bind\_ipi]
+ Allocate and bind a port for notifying other virtual CPUs.
+ \begin{itemize}
+ \item Caller domain must have a free port to bind.
+ \item VCPU must exist.
+ \end{itemize}
+\item[EVTCHNOP\_bind\_pirq]
+ Allocate and bind a port to a real IRQ.
+ \begin{itemize}
+ \item Caller domain must have a free port to bind.
+ \item PIRQ must be within the valid range.
+ \item Another binding for this PIRQ must not exist for this domain.
+ \item Caller must have an available port.
+ \end{itemize}
+\item[EVTCHNOP\_close]
+ Close an event channel (no more events will be received).
+ \begin{itemize}
+ \item Port must be valid (currently allocated).
+ \end{itemize}
+\item[EVTCHNOP\_send] Send a notification on an event channel attached
+ to a port.
+ \begin{itemize}
+ \item Port must be valid.
+ \item Only valid for Interdomain, IPI or Allocated Unbound ports.
+ \end{itemize}
+\item[EVTCHNOP\_status] Query the status of a port; what kind of port,
+ whether it is bound, what remote domain is expected, what PIRQ or
+ VIRQ it is bound to, what VCPU will be notified, etc.
+ Unprivileged domains may only query the state of their own ports.
+ Privileged domains may query any port.
+\item[EVTCHNOP\_bind\_vcpu] Bind event channel to a particular VCPU -
+ receive notification upcalls only on that VCPU.
+ \begin{itemize}
+ \item VCPU must exist.
+ \item Port must be valid.
+ \item Event channel must be either: allocated but unbound, bound to
+ an interdomain event channel, bound to a PIRQ.
+ \end{itemize}
+
+\end{description}
+
+%%
+%% grant_tables.tex
%%
-%% XXX SMH: not really sure how useful below is -- if it's still
-%% actually true, might be useful for someone wanting to write a
-%% new scheduler... not clear how many of them there are...
+%% Made by Mark Williamson
+%% Login <mark@maw48>
%%
-%% \include{src/interface/scheduling}
-%% scheduling information moved to scheduling.tex
-%% still commented out
+\chapter{Grant tables}
+\label{c:granttables}
+
+Xen's grant tables provide a generic mechanism to memory sharing
+between domains. This shared memory interface underpins the split
+device drivers for block and network IO.
+
+Each domain has its own {\bf grant table}. This is a data structure
+that is shared with Xen; it allows the domain to tell Xen what kind of
+permissions other domains have on its pages. Entries in the grant
+table are identified by {\bf grant references}. A grant reference is
+an integer, which indexes into the grant table. It acts as a
+capability which the grantee can use to perform operations on the
+granter's memory.
+
+This capability-based system allows shared-memory communications
+between unprivileged domains. A grant reference also encapsulates the
+details of a shared page, removing the need for a domain to know the
+real machine address of a page it is sharing. This makes it possible
+to share memory correctly with domains running in fully virtualised
+memory.
+
+\section{Interface}
+
+\subsection{Grant table manipulation}
+Creating and destroying grant references is done by direct access to
+the grant table. This removes the need to involve Xen when creating
+grant references, modifying access permissions, etc. The grantee
+domain will invoke hypercalls to use the grant references. Four main
+operations can be accomplished by directly manipulating the table:
+\begin{description}
+\item[Grant foreign access] allocate a new entry in the grant table
+ and fill out the access permissions accordingly. The access
+ permissions will be looked up by Xen when the grantee attempts to
+ use the reference to map the granted frame.
+\item[End foreign access] check that the grant reference is not
+ currently in use, then remove the mapping permissions for the frame.
+ This prevents further mappings from taking place but does not allow
+ forced revocations of existing mappings.
+\item[Grant foreign transfer] allocate a new entry in the table
+ specifying transfer permissions for the grantee. Xen will look up
+ this entry when the grantee attempts to transfer a frame to the
+ granter.
+\item[End foreign transfer] remove permissions to prevent a transfer
+ occurring in future. If the transfer is already committed,
+ modifying the grant table cannot prevent it from completing.
+\end{description}
+
+\subsection{Hypercalls}
+
+Use of grant references is accomplished via a hypercall. The grant
+table op hypercall takes three arguments:
+
+\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
+
+{\tt cmd} indicates the grant table operation of interest. {\tt uop}
+is a pointer to a structure (or an array of structures) describing the
+operation to be performed. The {\tt count} field describes how many
+grant table operations are being batched together.
+
+The core logic is situated in {\tt xen/common/grant\_table.c}. The
+grant table operation hypercall can be used to perform the following
+actions:
+
+\begin{description}
+\item[GNTTABOP\_map\_grant\_ref] Given a grant reference from another
+ domain, map the referred page into the caller's address space.
+\item[GNTTABOP\_unmap\_grant\_ref] Remove a mapping to a granted frame
+ from the caller's address space. This is used to voluntarily
+ relinquish a mapping to a granted page.
+\item[GNTTABOP\_setup\_table] Setup grant table for caller domain.
+\item[GNTTABOP\_dump\_table] Debugging operation.
+\item[GNTTABOP\_transfer] Given a transfer reference from another
+ domain, transfer ownership of a page frame to that domain.
+\end{description}
%%
-%% XXX SMH: we probably should have something in here on debugging
-%% etc; this is a kinda developers manual and many devs seem to
-%% like debugging support :^)
-%% Possibly sanitize below, else wait until new xendbg stuff is in
-%% (and/or kip's stuff?) and write about that instead?
+%% xenstore.tex
+%%
+%% Made by Mark Williamson
+%% Login <mark@maw48>
+%%
+
+\chapter{Xenstore}
+
+Xenstore is the mechanism by which control-plane activities occur.
+These activities include:
+
+\begin{itemize}
+\item Setting up shared memory regions and event channels for use with
+ the split device drivers.
+\item Notifying the guest of control events (e.g. balloon driver
+ requests)
+\item Reporting back status information from the guest
+ (e.g. performance-related statistics, etc).
+\end{itemize}
+
+The store is arranged as a hierachical collection of key-value pairs.
+Each domain has a directory hierarchy containing data related to its
+configuration. Domains are permitted to register for notifications
+about changes in subtrees of the store, and to apply changes to the
+store transactionally.
+
+\section{Guidelines}
+
+A few principles govern the operation of the store:
+
+\begin{itemize}
+\item Domains should only modify the contents of their own
+ directories.
+\item The setup protocol for a device channel should simply consist of
+ entering the configuration data into the store.
+\item The store should allow device discovery without requiring the
+ relevant device drivers to be loaded: a Xen ``bus'' should be
+ visible to probing code in the guest.
+\item The store should be usable for inter-tool communications,
+ allowing the tools themselves to be decomposed into a number of
+ smaller utilities, rather than a single monolithic entity. This
+ also facilitates the development of alternate user interfaces to the
+ same functionality.
+\end{itemize}
+
+\section{Store layout}
+
+There are three main paths in XenStore:
+
+\begin{description}
+\item[/vm] stores configuration information about domain
+\item[/local/domain] stores information about the domain on the local node (domid, etc.)
+\item[/tool] stores information for the various tools
+\end{description}
+
+The {\tt /vm} path stores configuration information for a domain.
+This information doesn't change and is indexed by the domain's UUID.
+A {\tt /vm} entry contains the following information:
+
+\begin{description}
+\item[ssidref] ssid reference for domain
+\item[uuid] uuid of the domain (somewhat redundant)
+\item[on\_reboot] the action to take on a domain reboot request (destroy or restart)
+\item[on\_poweroff] the action to take on a domain halt request (destroy or restart)
+\item[on\_crash] the action to take on a domain crash (destroy or restart)
+\item[vcpus] the number of allocated vcpus for the domain
+\item[memory] the amount of memory (in megabytes) for the domain Note: appears to sometimes be empty for domain-0
+\item[vcpu\_avail] the number of active vcpus for the domain (vcpus - number of disabled vcpus)
+\item[name] the name of the domain
+\end{description}
+
+
+{\tt /vm/$<$uuid$>$/image/}
+
+The image path is only available for Domain-Us and contains:
+\begin{description}
+\item[ostype] identifies the builder type (linux or vmx)
+\item[kernel] path to kernel on domain-0
+\item[cmdline] command line to pass to domain-U kernel
+\item[ramdisk] path to ramdisk on domain-0
+\end{description}
+
+{\tt /local}
+
+The {\tt /local} path currently only contains one directory, {\tt
+/local/domain} that is indexed by domain id. It contains the running
+domain information. The reason to have two storage areas is that
+during migration, the uuid doesn't change but the domain id does. The
+{\tt /local/domain} directory can be created and populated before
+finalizing the migration enabling localhost to localhost migration.
+
+{\tt /local/domain/$<$domid$>$}
+
+This path contains:
+
+\begin{description}
+\item[cpu\_time] xend start time (this is only around for domain-0)
+\item[handle] private handle for xend
+\item[name] see /vm
+\item[on\_reboot] see /vm
+\item[on\_poweroff] see /vm
+\item[on\_crash] see /vm
+\item[vm] the path to the VM directory for the domain
+\item[domid] the domain id (somewhat redundant)
+\item[running] indicates that the domain is currently running
+\item[memory] the current memory in megabytes for the domain (empty for domain-0?)
+\item[maxmem\_KiB] the maximum memory for the domain (in kilobytes)
+\item[memory\_KiB] the memory allocated to the domain (in kilobytes)
+\item[cpu] the current CPU the domain is pinned to (empty for domain-0?)
+\item[cpu\_weight] the weight assigned to the domain
+\item[vcpu\_avail] a bitmap telling the domain whether it may use a given VCPU
+\item[online\_vcpus] how many vcpus are currently online
+\item[vcpus] the total number of vcpus allocated to the domain
+\item[console/] a directory for console information
+ \begin{description}
+ \item[ring-ref] the grant table reference of the console ring queue
+ \item[port] the event channel being used for the console ring queue (local port)
+ \item[tty] the current tty the console data is being exposed of
+ \item[limit] the limit (in bytes) of console data to buffer
+ \end{description}
+\item[backend/] a directory containing all backends the domain hosts
+ \begin{description}
+ \item[vbd/] a directory containing vbd backends
+ \begin{description}
+ \item[$<$domid$>$/] a directory containing vbd's for domid
+ \begin{description}
+ \item[$<$virtual-device$>$/] a directory for a particular
+ virtual-device on domid
+ \begin{description}
+ \item[frontend-id] domain id of frontend
+ \item[frontend] the path to the frontend domain
+ \item[physical-device] backend device number
+ \item[sector-size] backend sector size
+ \item[info] 0 read/write, 1 read-only (is this right?)
+ \item[domain] name of frontend domain
+ \item[params] parameters for device
+ \item[type] the type of the device
+ \item[dev] the virtual device (as given by the user)
+ \item[node] output from block creation script
+ \end{description}
+ \end{description}
+ \end{description}
+
+ \item[vif/] a directory containing vif backends
+ \begin{description}
+ \item[$<$domid$>$/] a directory containing vif's for domid
+ \begin{description}
+ \item[$<$vif number$>$/] a directory for each vif
+ \item[frontend-id] the domain id of the frontend
+ \item[frontend] the path to the frontend
+ \item[mac] the mac address of the vif
+ \item[bridge] the bridge the vif is connected to
+ \item[handle] the handle of the vif
+ \item[script] the script used to create/stop the vif
+ \item[domain] the name of the frontend
+ \end{description}
+ \end{description}
+ \end{description}
+
+ \item[device/] a directory containing the frontend devices for the
+ domain
+ \begin{description}
+ \item[vbd/] a directory containing vbd frontend devices for the
+ domain
+ \begin{description}
+ \item[$<$virtual-device$>$/] a directory containing the vbd frontend for
+ virtual-device
+ \begin{description}
+ \item[virtual-device] the device number of the frontend device
+ \item[backend-id] the domain id of the backend
+ \item[backend] the path of the backend in the store (/local/domain
+ path)
+ \item[ring-ref] the grant table reference for the block request
+ ring queue
+ \item[event-channel] the event channel used for the block request
+ ring queue
+ \end{description}
+
+ \item[vif/] a directory containing vif frontend devices for the
+ domain
+ \begin{description}
+ \item[$<$id$>$/] a directory for vif id frontend device for the domain
+ \begin{description}
+ \item[backend-id] the backend domain id
+ \item[mac] the mac address of the vif
+ \item[handle] the internal vif handle
+ \item[backend] a path to the backend's store entry
+ \item[tx-ring-ref] the grant table reference for the transmission ring queue
+ \item[rx-ring-ref] the grant table reference for the receiving ring queue
+ \item[event-channel] the event channel used for the two ring queues
+ \end{description}
+ \end{description}
+
+ \item[device-misc/] miscellanous information for devices
+ \begin{description}
+ \item[vif/] miscellanous information for vif devices
+ \begin{description}
+ \item[nextDeviceID] the next device id to use
+ \end{description}
+ \end{description}
+ \end{description}
+ \end{description}
+
+ \item[store/] per-domain information for the store
+ \begin{description}
+ \item[port] the event channel used for the store ring queue
+ \item[ring-ref] - the grant table reference used for the store's
+ communication channel
+ \end{description}
+
+ \item[image] - private xend information
+\end{description}
+
+
+\chapter{Devices}
+\label{c:devices}
+
+Virtual devices under Xen are provided by a {\bf split device driver}
+architecture. The illusion of the virtual device is provided by two
+co-operating drivers: the {\bf frontend}, which runs an the
+unprivileged domain and the {\bf backend}, which runs in a domain with
+access to the real device hardware (often called a {\bf driver
+domain}; in practice domain 0 usually fulfills this function).
+
+The frontend driver appears to the unprivileged guest as if it were a
+real device, for instance a block or network device. It receives IO
+requests from its kernel as usual, however since it does not have
+access to the physical hardware of the system it must then issue
+requests to the backend. The backend driver is responsible for
+receiving these IO requests, verifying that they are safe and then
+issuing them to the real device hardware. The backend driver appears
+to its kernel as a normal user of in-kernel IO functionality. When
+the IO completes the backend notifies the frontend that the data is
+ready for use; the frontend is then able to report IO completion to
+its own kernel.
+
+Frontend drivers are designed to be simple; most of the complexity is
+in the backend, which has responsibility for translating device
+addresses, verifying that requests are well-formed and do not violate
+isolation guarantees, etc.
+
+Split drivers exchange requests and responses in shared memory, with
+an event channel for asynchronous notifications of activity. When the
+frontend driver comes up, it uses Xenstore to set up a shared memory
+frame and an interdomain event channel for communications with the
+backend. Once this connection is established, the two can communicate
+directly by placing requests / responses into shared memory and then
+sending notifications on the event channel. This separation of
+notification from data transfer allows message batching, and results
+in very efficient device access.
+
+This chapter focuses on some individual split device interfaces
+available to Xen guests.
+
+
+\section{Network I/O}
+
+Virtual network device services are provided by shared memory
+communication with a backend domain. From the point of view of other
+domains, the backend may be viewed as a virtual ethernet switch
+element with each domain having one or more virtual network interfaces
+connected to it.
+
+From the point of view of the backend domain itself, the network
+backend driver consists of a number of ethernet devices. Each of
+these has a logical direct connection to a virtual network device in
+another domain. This allows the backend domain to route, bridge,
+firewall, etc the traffic to / from the other domains using normal
+operating system mechanisms.
+
+\subsection{Backend Packet Handling}
+
+The backend driver is responsible for a variety of actions relating to
+the transmission and reception of packets from the physical device.
+With regard to transmission, the backend performs these key actions:
+
+\begin{itemize}
+\item {\bf Validation:} To ensure that domains do not attempt to
+ generate invalid (e.g. spoofed) traffic, the backend driver may
+ validate headers ensuring that source MAC and IP addresses match the
+ interface that they have been sent from.
+
+ Validation functions can be configured using standard firewall rules
+ ({\small{\tt iptables}} in the case of Linux).
+
+\item {\bf Scheduling:} Since a number of domains can share a single
+ physical network interface, the backend must mediate access when
+ several domains each have packets queued for transmission. This
+ general scheduling function subsumes basic shaping or rate-limiting
+ schemes.
+
+\item {\bf Logging and Accounting:} The backend domain can be
+ configured with classifier rules that control how packets are
+ accounted or logged. For example, log messages might be generated
+ whenever a domain attempts to send a TCP packet containing a SYN.
+\end{itemize}
+
+On receipt of incoming packets, the backend acts as a simple
+demultiplexer: Packets are passed to the appropriate virtual interface
+after any necessary logging and accounting have been carried out.
+
+\subsection{Data Transfer}
+
+Each virtual interface uses two ``descriptor rings'', one for
+transmit, the other for receive. Each descriptor identifies a block
+of contiguous machine memory allocated to the domain.
+
+The transmit ring carries packets to transmit from the guest to the
+backend domain. The return path of the transmit ring carries messages
+indicating that the contents have been physically transmitted and the
+backend no longer requires the associated pages of memory.
+
+To receive packets, the guest places descriptors of unused pages on
+the receive ring. The backend will return received packets by
+exchanging these pages in the domain's memory with new pages
+containing the received data, and passing back descriptors regarding
+the new packets on the ring. This zero-copy approach allows the
+backend to maintain a pool of free pages to receive packets into, and
+then deliver them to appropriate domains after examining their
+headers.
+
+% Real physical addresses are used throughout, with the domain
+% performing translation from pseudo-physical addresses if that is
+% necessary.
+
+If a domain does not keep its receive ring stocked with empty buffers
+then packets destined to it may be dropped. This provides some
+defence against receive livelock problems because an overloaded domain
+will cease to receive further data. Similarly, on the transmit path,
+it provides the application with feedback on the rate at which packets
+are able to leave the system.
+
+Flow control on rings is achieved by including a pair of producer
+indexes on the shared ring page. Each side will maintain a private
+consumer index indicating the next outstanding message. In this
+manner, the domains cooperate to divide the ring into two message
+lists, one in each direction. Notification is decoupled from the
+immediate placement of new messages on the ring; the event channel
+will be used to generate notification when {\em either} a certain
+number of outstanding messages are queued, {\em or} a specified number
+of nanoseconds have elapsed since the oldest message was placed on the
+ring.
+
+%% Not sure if my version is any better -- here is what was here
+%% before: Synchronization between the backend domain and the guest is
+%% achieved using counters held in shared memory that is accessible to
+%% both. Each ring has associated producer and consumer indices
+%% indicating the area in the ring that holds descriptors that contain
+%% data. After receiving {\it n} packets or {\t nanoseconds} after
+%% receiving the first packet, the hypervisor sends an event to the
+%% domain.
+
+
+\subsection{Network ring interface}
+
+The network device uses two shared memory rings for communication: one
+for transmit, one for receieve.
+
+Transmit requests are described by the following structure:
+
+\scriptsize
+\begin{verbatim}
+typedef struct netif_tx_request {
+ grant_ref_t gref; /* Reference to buffer page */
+ uint16_t offset; /* Offset within buffer page */
+ uint16_t flags; /* NETTXF_* */
+ uint16_t id; /* Echoed in response message. */
+ uint16_t size; /* Packet size in bytes. */
+} netif_tx_request_t;
+\end{verbatim}
+\normalsize
+
+\begin{description}
+\item[gref] Grant reference for the network buffer
+\item[offset] Offset to data
+\item[flags] Transmit flags (currently only NETTXF\_csum\_blank is
+ supported, to indicate that the protocol checksum field is
+ incomplete).
+\item[id] Echoed to guest by the backend in the ring-level response so
+ that the guest can match it to this request
+\item[size] Buffer size
+\end{description}
+
+Each transmit request is followed by a transmit response at some later
+date. This is part of the shared-memory communication protocol and
+allows the guest to (potentially) retire internal structures related
+to the request. It does not imply a network-level response. This
+structure is as follows:
+
+\scriptsize
+\begin{verbatim}
+typedef struct netif_tx_response {
+ uint16_t id;
+ int16_t status;
+} netif_tx_response_t;
+\end{verbatim}
+\normalsize
+
+\begin{description}
+\item[id] Echo of the ID field in the corresponding transmit request.
+\item[status] Success / failure status of the transmit request.
+\end{description}
+
+Receive requests must be queued by the frontend, accompanied by a
+donation of page-frames to the backend. The backend transfers page
+frames full of data back to the guest
+
+\scriptsize
+\begin{verbatim}
+typedef struct {
+ uint16_t id; /* Echoed in response message. */
+ grant_ref_t gref; /* Reference to incoming granted frame */
+} netif_rx_request_t;
+\end{verbatim}
+\normalsize
+
+\begin{description}
+\item[id] Echoed by the frontend to identify this request when
+ responding.
+\item[gref] Transfer reference - the backend will use this reference
+ to transfer a frame of network data to us.
+\end{description}
+
+Receive response descriptors are queued for each received frame. Note
+that these may only be queued in reply to an existing receive request,
+providing an in-built form of traffic throttling.
+
+\scriptsize
+\begin{verbatim}
+typedef struct {
+ uint16_t id;
+ uint16_t offset; /* Offset in page of start of received packet */
+ uint16_t flags; /* NETRXF_* */
+ int16_t status; /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */
+} netif_rx_response_t;
+\end{verbatim}
+\normalsize
+
+\begin{description}
+\item[id] ID echoed from the original request, used by the guest to
+ match this response to the original request.
+\item[offset] Offset to data within the transferred frame.
+\item[flags] Transmit flags (currently only NETRXF\_csum\_valid is
+ supported, to indicate that the protocol checksum field has already
+ been validated).
+\item[status] Success / error status for this operation.
+\end{description}
+
+Note that the receive protocol includes a mechanism for guests to
+receive incoming memory frames but there is no explicit transfer of
+frames in the other direction. Guests are expected to return memory
+to the hypervisor in order to use the network interface. They {\em
+must} do this or they will exceed their maximum memory reservation and
+will not be able to receive incoming frame transfers. When necessary,
+the backend is able to replenish its pool of free network buffers by
+claiming some of this free memory from the hypervisor.
+
+\section{Block I/O}
+
+All guest OS disk access goes through the virtual block device VBD
+interface. This interface allows domains access to portions of block
+storage devices visible to the the block backend device. The VBD
+interface is a split driver, similar to the network interface
+described above. A single shared memory ring is used between the
+frontend and backend drivers for each virtual device, across which
+IO requests and responses are sent.
+
+Any block device accessible to the backend domain, including
+network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
+can be exported as a VBD. Each VBD is mapped to a device node in the
+guest, specified in the guest's startup configuration.
+
+\subsection{Data Transfer}
+
+The per-(virtual)-device ring between the guest and the block backend
+supports two messages:
+
+\begin{description}
+\item [{\small {\tt READ}}:] Read data from the specified block
+ device. The front end identifies the device and location to read
+ from and attaches pages for the data to be copied to (typically via
+ DMA from the device). The backend acknowledges completed read
+ requests as they finish.
+
+\item [{\small {\tt WRITE}}:] Write data to the specified block
+ device. This functions essentially as {\small {\tt READ}}, except
+ that the data moves to the device instead of from it.
+\end{description}
+
+%% Rather than copying data, the backend simply maps the domain's
+%% buffers in order to enable direct DMA to them. The act of mapping
+%% the buffers also increases the reference counts of the underlying
+%% pages, so that the unprivileged domain cannot try to return them to
+%% the hypervisor, install them as page tables, or any other unsafe
+%% behaviour.
%%
+%% % block API here
+
+\subsection{Block ring interface}
+
+The block interface is defined by the structures passed over the
+shared memory interface. These structures are either requests (from
+the frontend to the backend) or responses (from the backend to the
+frontend).
+
+The request structure is defined as follows:
+
+\scriptsize
+\begin{verbatim}
+typedef struct blkif_request {
+ uint8_t operation; /* BLKIF_OP_??? */
+ uint8_t nr_segments; /* number of segments */
+ blkif_vdev_t handle; /* only for read/write requests */
+ uint64_t id; /* private guest value, echoed in resp */
+ blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */
+ struct blkif_request_segment {
+ grant_ref_t gref; /* reference to I/O buffer frame */
+ /* @first_sect: first sector in frame to transfer (inclusive). */
+ /* @last_sect: last sector in frame to transfer (inclusive). */
+ uint8_t first_sect, last_sect;
+ } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+} blkif_request_t;
+\end{verbatim}
+\normalsize
+
+The fields are as follows:
+
+\begin{description}
+\item[operation] operation ID: one of the operations described above
+\item[nr\_segments] number of segments for scatter / gather IO
+ described by this request
+\item[handle] identifier for a particular virtual device on this
+ interface
+\item[id] this value is echoed in the response message for this IO;
+ the guest may use it to identify the original request
+\item[sector\_number] start sector on the virtal device for this
+ request
+\item[frame\_and\_sects] This array contains structures encoding
+ scatter-gather IO to be performed:
+ \begin{description}
+ \item[gref] The grant reference for the foreign I/O buffer page.
+ \item[first\_sect] First sector to access within the buffer page (0 to 7).
+ \item[last\_sect] Last sector to access within the buffer page (0 to 7).
+ \end{description}
+ Data will be transferred into frames at an offset determined by the
+ value of {\tt first\_sect}.
+\end{description}
+
+
+\chapter{Further Information}
+
+If you have questions that are not answered by this manual, the
+sources of information listed below may be of interest to you. Note
+that bug reports, suggestions and contributions related to the
+software (or the documentation) should be sent to the Xen developers'
+mailing list (address below).
+
+
+\section{Other documentation}
+
+If you are mainly interested in using (rather than developing for)
+Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/}
+directory of the Xen source distribution.
+
+% Various HOWTOs are also available in {\tt docs/HOWTOS}.
+
+
+\section{Online references}
+
+The official Xen web site can be found at:
+\begin{quote} {\tt http://www.xensource.com}
+\end{quote}
+
+
+This contains links to the latest versions of all online
+documentation, including the latest version of the FAQ.
+
+Information regarding Xen is also available at the Xen Wiki at
+\begin{quote} {\tt http://wiki.xensource.com/xenwiki/}\end{quote}
+The Xen project uses Bugzilla as its bug tracking system. You'll find
+the Xen Bugzilla at http://bugzilla.xensource.com/bugzilla/.
+
+
+\section{Mailing lists}
+
+There are several mailing lists that are used to discuss Xen related
+topics. The most widely relevant are listed below. An official page of
+mailing lists and subscription information can be found at \begin{quote}
+ {\tt http://lists.xensource.com/} \end{quote}
+
+\begin{description}
+\item[xen-devel@lists.xensource.com] Used for development
+ discussions and bug reports. Subscribe at: \\
+ {\small {\tt http://lists.xensource.com/xen-devel}}
+\item[xen-users@lists.xensource.com] Used for installation and usage
+ discussions and requests for help. Subscribe at: \\
+ {\small {\tt http://lists.xensource.com/xen-users}}
+\item[xen-announce@lists.xensource.com] Used for announcements only.
+ Subscribe at: \\
+ {\small {\tt http://lists.xensource.com/xen-announce}}
+\item[xen-changelog@lists.xensource.com] Changelog feed
+ from the unstable and 2.0 trees - developer oriented. Subscribe at: \\
+ {\small {\tt http://lists.xensource.com/xen-changelog}}
+\end{description}
+
+\appendix
+
+
+\chapter{Xen Hypercalls}
+\label{a:hypercalls}
+
+Hypercalls represent the procedural interface to Xen; this appendix
+categorizes and describes the current set of hypercalls.
+
+\section{Invoking Hypercalls}
+
+Hypercalls are invoked in a manner analogous to system calls in a
+conventional operating system; a software interrupt is issued which
+vectors to an entry point within Xen. On x86/32 machines the
+instruction required is {\tt int \$82}; the (real) IDT is setup so
+that this may only be issued from within ring 1. The particular
+hypercall to be invoked is contained in {\tt EAX} --- a list
+mapping these values to symbolic hypercall names can be found
+in {\tt xen/include/public/xen.h}.
+
+On some occasions a set of hypercalls will be required to carry
+out a higher-level function; a good example is when a guest
+operating wishes to context switch to a new process which
+requires updating various privileged CPU state. As an optimization
+for these cases, there is a generic mechanism to issue a set of
+hypercalls as a batch:
+
+\begin{quote}
+\hypercall{multicall(void *call\_list, int nr\_calls)}
+
+Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
+the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
+call\_list}. Each entry contains the hypercall operation code followed
+by up to 7 word-sized arguments.
+\end{quote}
+
+Note that multicalls are provided purely as an optimization; there is
+no requirement to use them when first porting a guest operating
+system.
+
+
+\section{Virtual CPU Setup}
+
+At start of day, a guest operating system needs to setup the virtual
+CPU it is executing on. This includes installing vectors for the
+virtual IDT so that the guest OS can handle interrupts, page faults,
+etc. However the very first thing a guest OS must setup is a pair
+of hypervisor callbacks: these are the entry points which Xen will
+use when it wishes to notify the guest OS of an occurrence.
+
+\begin{quote}
+\hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
+ event\_address, unsigned long failsafe\_selector, unsigned long
+ failsafe\_address) }
+
+Register the normal (``event'') and failsafe callbacks for
+event processing. In each case the code segment selector and
+address within that segment are provided. The selectors must
+have RPL 1; in XenLinux we simply use the kernel's CS for both
+{\tt event\_selector} and {\tt failsafe\_selector}.
+
+The value {\tt event\_address} specifies the address of the guest OSes
+event handling and dispatch routine; the {\tt failsafe\_address}
+specifies a separate entry point which is used only if a fault occurs
+when Xen attempts to use the normal callback.
+
+\end{quote}
+
+On x86/64 systems the hypercall takes slightly different
+arguments. This is because callback CS does not need to be specified
+(since teh callbacks are entered via SYSRET), and also because an
+entry address needs to be specified for SYSCALLs from guest user
+space:
+
+\begin{quote}
+\hypercall{set\_callbacks(unsigned long event\_address, unsigned long
+ failsafe\_address, unsigned long syscall\_address)}
+\end{quote}
+
+
+After installing the hypervisor callbacks, the guest OS can
+install a `virtual IDT' by using the following hypercall:
+
+\begin{quote}
+\hypercall{set\_trap\_table(trap\_info\_t *table)}
+
+Install one or more entries into the per-domain
+trap handler table (essentially a software version of the IDT).
+Each entry in the array pointed to by {\tt table} includes the
+exception vector number with the corresponding segment selector
+and entry point. Most guest OSes can use the same handlers on
+Xen as when running on the real hardware.
+
+
+\end{quote}
+
+A further hypercall is provided for the management of virtual CPUs:
+
+\begin{quote}
+\hypercall{vcpu\_op(int cmd, int vcpuid, void *extra\_args)}
+
+This hypercall can be used to bootstrap VCPUs, to bring them up and
+down and to test their current status.
+
+\end{quote}
+
+\section{Scheduling and Timer}
+
+Domains are preemptively scheduled by Xen according to the
+parameters installed by domain 0 (see Section~\ref{s:dom0ops}).
+In addition, however, a domain may choose to explicitly
+control certain behavior with the following hypercall:
+
+\begin{quote}
+\hypercall{sched\_op(unsigned long op)}
+
+Request scheduling operation from hypervisor. The options are: {\it
+SCHEDOP\_yield}, {\it SCHEDOP\_block}, and {\it SCHEDOP\_shutdown}.
+{\it yield} keeps the calling domain runnable but may cause a
+reschedule if other domains are runnable. {\it block} removes the
+calling domain from the run queue and cause is to sleeps until an
+event is delivered to it. {\it shutdown} is used to end the domain's
+execution; the caller can additionally specify whether the domain
+should reboot, halt or suspend.
+\end{quote}
+
+To aid the implementation of a process scheduler within a guest OS,
+Xen provides a virtual programmable timer:
+
+\begin{quote}
+\hypercall{set\_timer\_op(uint64\_t timeout)}
+
+Request a timer event to be sent at the specified system time (time
+in nanoseconds since system boot). The hypercall actually passes the
+64-bit timeout value as a pair of 32-bit values.
+
+\end{quote}
+
+Note that calling {\tt set\_timer\_op()} prior to {\tt sched\_op}
+allows block-with-timeout semantics.
+
+
+\section{Page Table Management}
+
+Since guest operating systems have read-only access to their page
+tables, Xen must be involved when making any changes. The following
+multi-purpose hypercall can be used to modify page-table entries,
+update the machine-to-physical mapping table, flush the TLB, install
+a new page-table base pointer, and more.
+
+\begin{quote}
+\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)}
+
+Update the page table for the domain; a set of {\tt count} updates are
+submitted for processing in a batch, with {\tt success\_count} being
+updated to report the number of successful updates.
+
+Each element of {\tt req[]} contains a pointer (address) and value;
+the least significant 2-bits of the pointer are used to distinguish
+the type of update requested as follows:
+\begin{description}
+
+\item[\it MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
+page table entry to the associated value; Xen will check that the
+update is safe, as described in Chapter~\ref{c:memory}.
+
+\item[\it MMU\_MACHPHYS\_UPDATE:] update an entry in the
+ machine-to-physical table. The calling domain must own the machine
+ page in question (or be privileged).
+\end{description}
+
+\end{quote}
+
+Explicitly updating batches of page table entries is extremely
+efficient, but can require a number of alterations to the guest
+OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
+recommended for new OS ports.
+
+Regardless of which page table update mode is being used, however,
+there are some occasions (notably handling a demand page fault) where
+a guest OS will wish to modify exactly one PTE rather than a
+batch, and where that PTE is mapped into the current address space.
+This is catered for by the following:
+
+\begin{quote}
+\hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
+ unsigned long flags)}
+
+Update the currently installed PTE that maps virtual address {\tt va}
+to new value {\tt val}. As with {\tt mmu\_update()}, Xen checks the
+modification is safe before applying it. The {\tt flags} determine
+which kind of TLB flush, if any, should follow the update.
+
+\end{quote}
+
+Finally, sufficiently privileged domains may occasionally wish to manipulate
+the pages of others:
+
+\begin{quote}
+\hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
+ unsigned long flags, domid\_t domid)}
+
+Identical to {\tt update\_va\_mapping()} save that the pages being
+mapped must belong to the domain {\tt domid}.
+
+\end{quote}
+
+An additional MMU hypercall provides an ``extended command''
+interface. This provides additional functionality beyond the basic
+table updating commands:
+
+\begin{quote}
+
+\hypercall{mmuext\_op(struct mmuext\_op *op, int count, int *success\_count, domid\_t domid)}
+
+This hypercall is used to perform additional MMU operations. These
+include updating {\tt cr3} (or just re-installing it for a TLB flush),
+requesting various kinds of TLB flush, flushing the cache, installing
+a new LDT, or pinning \& unpinning page-table pages (to ensure their
+reference count doesn't drop to zero which would require a
+revalidation of all entries). Some of the operations available are
+restricted to domains with sufficient system privileges.
+
+It is also possible for privileged domains to reassign page ownership
+via an extended MMU operation, although grant tables are used instead
+of this where possible; see Section~\ref{s:idc}.
+
+\end{quote}
+
+Finally, a hypercall interface is exposed to activate and deactivate
+various optional facilities provided by Xen for memory management.
+
+\begin{quote}
+\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
+
+Toggle various memory management modes (in particular writable page
+tables).
+
+\end{quote}
+
+\section{Segmentation Support}
+
+Xen allows guest OSes to install a custom GDT if they require it;
+this is context switched transparently whenever a domain is
+[de]scheduled. The following hypercall is effectively a
+`safe' version of {\tt lgdt}:
+
+\begin{quote}
+\hypercall{set\_gdt(unsigned long *frame\_list, int entries)}
+
+Install a global descriptor table for a domain; {\tt frame\_list} is
+an array of up to 16 machine page frames within which the GDT resides,
+with {\tt entries} being the actual number of descriptor-entry
+slots. All page frames must be mapped read-only within the guest's
+address space, and the table must be large enough to contain Xen's
+reserved entries (see {\tt xen/include/public/arch-x86\_32.h}).
+
+\end{quote}
+
+Many guest OSes will also wish to install LDTs; this is achieved by
+using {\tt mmu\_update()} with an extended command, passing the
+linear address of the LDT base along with the number of entries. No
+special safety checks are required; Xen needs to perform this task
+simply since {\tt lldt} requires CPL 0.
+
+
+Xen also allows guest operating systems to update just an
+individual segment descriptor in the GDT or LDT:
+
+\begin{quote}
+\hypercall{update\_descriptor(uint64\_t ma, uint64\_t desc)}
+
+Update the GDT/LDT entry at machine address {\tt ma}; the new
+8-byte descriptor is stored in {\tt desc}.
+Xen performs a number of checks to ensure the descriptor is
+valid.
+
+\end{quote}
+
+Guest OSes can use the above in place of context switching entire
+LDTs (or the GDT) when the number of changing descriptors is small.
+
+\section{Context Switching}
+
+When a guest OS wishes to context switch between two processes,
+it can use the page table and segmentation hypercalls described
+above to perform the the bulk of the privileged work. In addition,
+however, it will need to invoke Xen to switch the kernel (ring 1)
+stack pointer:
+
+\begin{quote}
+\hypercall{stack\_switch(unsigned long ss, unsigned long esp)}
+
+Request kernel stack switch from hypervisor; {\tt ss} is the new
+stack segment, which {\tt esp} is the new stack pointer.
+
+\end{quote}
+
+A useful hypercall for context switching allows ``lazy'' save and
+restore of floating point state:
+
+\begin{quote}
+\hypercall{fpu\_taskswitch(int set)}
+
+This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
+control register; this means that the next attempt to use floating
+point will cause a trap which the guest OS can trap. Typically it will
+then save/restore the FP state, and clear the {\tt TS} bit, using the
+same call.
+\end{quote}
+
+This is provided as an optimization only; guest OSes can also choose
+to save and restore FP state on all context switches for simplicity.
+
+Finally, a hypercall is provided for entering vm86 mode:
+
+\begin{quote}
+\hypercall{switch\_vm86}
+
+This allows the guest to run code in vm86 mode, which is needed for
+some legacy software.
+\end{quote}
+
+\section{Physical Memory Management}
+
+As mentioned previously, each domain has a maximum and current
+memory allocation. The maximum allocation, set at domain creation
+time, cannot be modified. However a domain can choose to reduce
+and subsequently grow its current allocation by using the
+following call:
+
+\begin{quote}
+\hypercall{memory\_op(unsigned int op, void *arg)}
+
+Increase or decrease current memory allocation (as determined by
+the value of {\tt op}). The available operations are:
+
+\begin{description}
+\item[XENMEM\_increase\_reservation] Request an increase in machine
+ memory allocation; {\tt arg} must point to a {\tt
+ xen\_memory\_reservation} structure.
+\item[XENMEM\_decrease\_reservation] Request a decrease in machine
+ memory allocation; {\tt arg} must point to a {\tt
+ xen\_memory\_reservation} structure.
+\item[XENMEM\_maximum\_ram\_page] Request the frame number of the
+ highest-addressed frame of machine memory in the system. {\tt arg}
+ must point to an {\tt unsigned long} where this value will be
+ stored.
+\item[XENMEM\_current\_reservation] Returns current memory reservation
+ of the specified domain.
+\item[XENMEM\_maximum\_reservation] Returns maximum memory resrevation
+ of the specified domain.
+\end{description}
+
+\end{quote}
+
+In addition to simply reducing or increasing the current memory
+allocation via a `balloon driver', this call is also useful for
+obtaining contiguous regions of machine memory when required (e.g.
+for certain PCI devices, or if using superpages).
+
+
+\section{Inter-Domain Communication}
+\label{s:idc}
+
+Xen provides a simple asynchronous notification mechanism via
+\emph{event channels}. Each domain has a set of end-points (or
+\emph{ports}) which may be bound to an event source (e.g. a physical
+IRQ, a virtual IRQ, or an port in another domain). When a pair of
+end-points in two different domains are bound together, then a `send'
+operation on one will cause an event to be received by the destination
+domain.
+
+The control and use of event channels involves the following hypercall:
+
+\begin{quote}
+\hypercall{event\_channel\_op(evtchn\_op\_t *op)}
+
+Inter-domain event-channel management; {\tt op} is a discriminated
+union which allows the following 7 operations:
+
+\begin{description}
+
+\item[\it alloc\_unbound:] allocate a free (unbound) local
+ port and prepare for connection from a specified domain.
+\item[\it bind\_virq:] bind a local port to a virtual
+IRQ; any particular VIRQ can be bound to at most one port per domain.
+\item[\it bind\_pirq:] bind a local port to a physical IRQ;
+once more, a given pIRQ can be bound to at most one port per
+domain. Furthermore the calling domain must be sufficiently
+privileged.
+\item[\it bind\_interdomain:] construct an interdomain event
+channel; in general, the target domain must have previously allocated
+an unbound port for this channel, although this can be bypassed by
+privileged domains during domain setup.
+\item[\it close:] close an interdomain event channel.
+\item[\it send:] send an event to the remote end of a
+interdomain event channel.
+\item[\it status:] determine the current status of a local port.
+\end{description}
+
+For more details see
+{\tt xen/include/public/event\_channel.h}.
+
+\end{quote}
+
+Event channels are the fundamental communication primitive between
+Xen domains and seamlessly support SMP. However they provide little
+bandwidth for communication {\sl per se}, and hence are typically
+married with a piece of shared memory to produce effective and
+high-performance inter-domain communication.
+
+Safe sharing of memory pages between guest OSes is carried out by
+granting access on a per page basis to individual domains. This is
+achieved by using the {\tt grant\_table\_op()} hypercall.
+
+\begin{quote}
+\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
+
+Used to invoke operations on a grant reference, to setup the grant
+table and to dump the tables' contents for debugging.
+
+\end{quote}
+
+\section{IO Configuration}
+
+Domains with physical device access (i.e.\ driver domains) receive
+limited access to certain PCI devices (bus address space and
+interrupts). However many guest operating systems attempt to
+determine the PCI configuration by directly access the PCI BIOS,
+which cannot be allowed for safety.
+
+Instead, Xen provides the following hypercall:
+
+\begin{quote}
+\hypercall{physdev\_op(void *physdev\_op)}
+
+Set and query IRQ configuration details, set the system IOPL, set the
+TSS IO bitmap.
+
+\end{quote}
+
+
+For examples of using {\tt physdev\_op()}, see the
+Xen-specific PCI code in the linux sparse tree.
+
+\section{Administrative Operations}
+\label{s:dom0ops}
+
+A large number of control operations are available to a sufficiently
+privileged domain (typically domain 0). These allow the creation and
+management of new domains, for example. A complete list is given
+below: for more details on any or all of these, please see
+{\tt xen/include/public/dom0\_ops.h}
+
+
+\begin{quote}
+\hypercall{dom0\_op(dom0\_op\_t *op)}
+
+Administrative domain operations for domain management. The options are:
+
+\begin{description}
+\item [\it DOM0\_GETMEMLIST:] get list of pages used by the domain
+
+\item [\it DOM0\_SCHEDCTL:]
+
+\item [\it DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain
+
+\item [\it DOM0\_CREATEDOMAIN:] create a new domain
+
+\item [\it DOM0\_DESTROYDOMAIN:] deallocate all resources associated
+with a domain
+
+\item [\it DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run
+queue.
+
+\item [\it DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
+ once again.
+
+\item [\it DOM0\_GETDOMAININFO:] get statistics about the domain
+
+\item [\it DOM0\_SETDOMAININFO:] set VCPU-related attributes
+
+\item [\it DOM0\_MSR:] read or write model specific registers
+
+\item [\it DOM0\_DEBUG:] interactively invoke the debugger
+
+\item [\it DOM0\_SETTIME:] set system time
+
+\item [\it DOM0\_GETPAGEFRAMEINFO:]
+
+\item [\it DOM0\_READCONSOLE:] read console content from hypervisor buffer ring
+
+\item [\it DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU
+
+\item [\it DOM0\_TBUFCONTROL:] get and set trace buffer attributes
+
+\item [\it DOM0\_PHYSINFO:] get information about the host machine
+
+\item [\it DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler
+
+\item [\it DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes
+
+\item [\it DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain
+
+\item [\it DOM0\_GETPAGEFRAMEINFO2:] batched interface for getting
+page frame info
+
+\item [\it DOM0\_ADD\_MEMTYPE:] set MTRRs
+
+\item [\it DOM0\_DEL\_MEMTYPE:] remove a memory type range
+
+\item [\it DOM0\_READ\_MEMTYPE:] read MTRR
+
+\item [\it DOM0\_PERFCCONTROL:] control Xen's software performance
+counters
+
+\item [\it DOM0\_MICROCODE:] update CPU microcode
+
+\item [\it DOM0\_IOPORT\_PERMISSION:] modify domain permissions for an
+IO port range (enable / disable a range for a particular domain)
+
+\item [\it DOM0\_GETVCPUCONTEXT:] get context from a VCPU
+
+\item [\it DOM0\_GETVCPUINFO:] get current state for a VCPU
+\item [\it DOM0\_GETDOMAININFOLIST:] batched interface to get domain
+info
+
+\item [\it DOM0\_PLATFORM\_QUIRK:] inform Xen of a platform quirk it
+needs to handle (e.g. noirqbalance)
+
+\item [\it DOM0\_PHYSICAL\_MEMORY\_MAP:] get info about dom0's memory
+map
+
+\item [\it DOM0\_MAX\_VCPUS:] change max number of VCPUs for a domain
+
+\item [\it DOM0\_SETDOMAINHANDLE:] set the handle for a domain
+
+\end{description}
+\end{quote}
+
+Most of the above are best understood by looking at the code
+implementing them (in {\tt xen/common/dom0\_ops.c}) and in
+the user-space tools that use them (mostly in {\tt tools/libxc}).
+
+Hypercalls relating to the management of the Access Control Module are
+also restricted to domain 0 access for now:
+
+\begin{quote}
+
+\hypercall{acm\_op(struct acm\_op * u\_acm\_op)}
+
+This hypercall can be used to configure the state of the ACM, query
+that state, request access control decisions and dump additional
+information.
+
+\end{quote}
+
+
+\section{Debugging Hypercalls}
+
+A few additional hypercalls are mainly useful for debugging:
+
+\begin{quote}
+\hypercall{console\_io(int cmd, int count, char *str)}
+
+Use Xen to interact with the console; operations are:
+
+{\it CONSOLEIO\_write}: Output count characters from buffer str.
+
+{\it CONSOLEIO\_read}: Input at most count characters into buffer str.
+\end{quote}
+
+A pair of hypercalls allows access to the underlying debug registers:
+\begin{quote}
+\hypercall{set\_debugreg(int reg, unsigned long value)}
+
+Set debug register {\tt reg} to {\tt value}
+
+\hypercall{get\_debugreg(int reg)}
+
+Return the contents of the debug register {\tt reg}
+\end{quote}
+
+And finally:
+\begin{quote}
+\hypercall{xen\_version(int cmd)}
+
+Request Xen version number.
+\end{quote}
-%% \include{src/interface/debugging}
-%% debugging information moved to debugging.tex
-%% still commented out
+This is useful to ensure that user-space tools are in sync
+with the underlying hypervisor.
\end{document}