aboutsummaryrefslogtreecommitdiffstats
path: root/docs/src/interface.tex
diff options
context:
space:
mode:
Diffstat (limited to 'docs/src/interface.tex')
-rw-r--r--docs/src/interface.tex2216
1 files changed, 0 insertions, 2216 deletions
diff --git a/docs/src/interface.tex b/docs/src/interface.tex
deleted file mode 100644
index dd061cbfff..0000000000
--- a/docs/src/interface.tex
+++ /dev/null
@@ -1,2216 +0,0 @@
-\documentclass[11pt,twoside,final,openright,a4paper]{report}
-\usepackage{graphicx,html,setspace,times}
-\usepackage{parskip}
-\setstretch{1.15}
-
-% LIBRARY FUNCTIONS
-
-\newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}
-
-\begin{document}
-
-% TITLE PAGE
-\pagestyle{empty}
-\begin{center}
-\vspace*{\fill}
-\includegraphics{figs/xenlogo.eps}
-\vfill
-\vfill
-\vfill
-\begin{tabular}{l}
-{\Huge \bf Interface manual} \\[4mm]
-{\huge Xen v3.0 for x86} \\[80mm]
-
-{\Large Xen is Copyright (c) 2002-2005, The Xen Team} \\[3mm]
-{\Large University of Cambridge, UK} \\[20mm]
-\end{tabular}
-\end{center}
-
-{\bf DISCLAIMER: This documentation is always under active development
-and as such there may be mistakes and omissions --- watch out for
-these and please report any you find to the developer's mailing list.
-The latest version is always available on-line. Contributions of
-material, suggestions and corrections are welcome. }
-
-\vfill
-\cleardoublepage
-
-% TABLE OF CONTENTS
-\pagestyle{plain}
-\pagenumbering{roman}
-{ \parskip 0pt plus 1pt
- \tableofcontents }
-\cleardoublepage
-
-% PREPARE FOR MAIN TEXT
-\pagenumbering{arabic}
-\raggedbottom
-\widowpenalty=10000
-\clubpenalty=10000
-\parindent=0pt
-\parskip=5pt
-\renewcommand{\topfraction}{.8}
-\renewcommand{\bottomfraction}{.8}
-\renewcommand{\textfraction}{.2}
-\renewcommand{\floatpagefraction}{.8}
-\setstretch{1.1}
-
-\chapter{Introduction}
-
-Xen allows the hardware resources of a machine to be virtualized and
-dynamically partitioned, allowing multiple different {\em guest}
-operating system images to be run simultaneously. Virtualizing the
-machine in this manner provides considerable flexibility, for example
-allowing different users to choose their preferred operating system
-(e.g., Linux, NetBSD, or a custom operating system). Furthermore, Xen
-provides secure partitioning between virtual machines (known as
-{\em domains} in Xen terminology), and enables better resource
-accounting and QoS isolation than can be achieved with a conventional
-operating system.
-
-Xen essentially takes a `whole machine' virtualization approach as
-pioneered by IBM VM/370. However, unlike VM/370 or more recent
-efforts such as VMware and Virtual PC, Xen does not attempt to
-completely virtualize the underlying hardware. Instead parts of the
-hosted guest operating systems are modified to work with the VMM; the
-operating system is effectively ported to a new target architecture,
-typically requiring changes in just the machine-dependent code. The
-user-level API is unchanged, and so existing binaries and operating
-system distributions work without modification.
-
-In addition to exporting virtualized instances of CPU, memory, network
-and block devices, Xen exposes a control interface to manage how these
-resources are shared between the running domains. Access to the
-control interface is restricted: it may only be used by one
-specially-privileged VM, known as {\em domain 0}. This domain is a
-required part of any Xen-based server and runs the application software
-that manages the control-plane aspects of the platform. Running the
-control software in {\it domain 0}, distinct from the hypervisor
-itself, allows the Xen framework to separate the notions of
-mechanism and policy within the system.
-
-
-\chapter{Virtual Architecture}
-
-In a Xen/x86 system, only the hypervisor runs with full processor
-privileges ({\it ring 0} in the x86 four-ring model). It has full
-access to the physical memory available in the system and is
-responsible for allocating portions of it to running domains.
-
-On a 32-bit x86 system, guest operating systems may use {\it rings 1},
-{\it 2} and {\it 3} as they see fit. Segmentation is used to prevent
-the guest OS from accessing the portion of the address space that is
-reserved for Xen. We expect most guest operating systems will use
-ring 1 for their own operation and place applications in ring 3.
-
-On 64-bit systems it is not possible to protect the hypervisor from
-untrusted guest code running in rings 1 and 2. Guests are therefore
-restricted to run in ring 3 only. The guest kernel is protected from its
-applications by context switching between the kernel and currently
-running application.
-
-In this chapter we consider the basic virtual architecture provided by
-Xen: CPU state, exception and interrupt handling, and time.
-Other aspects such as memory and device access are discussed in later
-chapters.
-
-
-\section{CPU state}
-
-All privileged state must be handled by Xen. The guest OS has no
-direct access to CR3 and is not permitted to update privileged bits in
-EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen;
-these are analogous to system calls but occur from ring 1 to ring 0.
-
-A list of all hypercalls is given in Appendix~\ref{a:hypercalls}.
-
-
-\section{Exceptions}
-
-A virtual IDT is provided --- a domain can submit a table of trap
-handlers to Xen via the {\bf set\_trap\_table} hypercall. The
-exception stack frame presented to a virtual trap handler is identical
-to its native equivalent.
-
-
-\section{Interrupts and events}
-
-Interrupts are virtualized by mapping them to \emph{event channels},
-which are delivered asynchronously to the target domain using a callback
-supplied via the {\bf set\_callbacks} hypercall. A guest OS can map
-these events onto its standard interrupt dispatch mechanisms. Xen is
-responsible for determining the target domain that will handle each
-physical interrupt source. For more details on the binding of event
-sources to event channels, see Chapter~\ref{c:devices}.
-
-
-\section{Time}
-
-Guest operating systems need to be aware of the passage of both real
-(or wallclock) time and their own `virtual time' (the time for which
-they have been executing). Furthermore, Xen has a notion of time which
-is used for scheduling. The following notions of time are provided:
-
-\begin{description}
-\item[Cycle counter time.]
-
- This provides a fine-grained time reference. The cycle counter time
- is used to accurately extrapolate the other time references. On SMP
- machines it is currently assumed that the cycle counter time is
- synchronized between CPUs. The current x86-based implementation
- achieves this within inter-CPU communication latencies.
-
-\item[System time.]
-
- This is a 64-bit counter which holds the number of nanoseconds that
- have elapsed since system boot.
-
-\item[Wall clock time.]
-
- This is the time of day in a Unix-style {\bf struct timeval}
- (seconds and microseconds since 1 January 1970, adjusted by leap
- seconds). An NTP client hosted by {\it domain 0} can keep this
- value accurate.
-
-\item[Domain virtual time.]
-
- This progresses at the same pace as system time, but only while a
- domain is executing --- it stops while a domain is de-scheduled.
- Therefore the share of the CPU that a domain receives is indicated
- by the rate at which its virtual time increases.
-
-\end{description}
-
-
-Xen exports timestamps for system time and wall-clock time to guest
-operating systems through a shared page of memory. Xen also provides
-the cycle counter time at the instant the timestamps were calculated,
-and the CPU frequency in Hertz. This allows the guest to extrapolate
-system and wall-clock times accurately based on the current cycle
-counter time.
-
-Since all time stamps need to be updated and read \emph{atomically}
-a version number is also stored in the shared info page, which is
-incremented before and after updating the timestamps. Thus a guest can
-be sure that it read a consistent state by checking the two version
-numbers are equal and even.
-
-Xen includes a periodic ticker which sends a timer event to the
-currently executing domain every 10ms. The Xen scheduler also sends a
-timer event whenever a domain is scheduled; this allows the guest OS
-to adjust for the time that has passed while it has been inactive. In
-addition, Xen allows each domain to request that they receive a timer
-event sent at a specified system time by using the {\bf
- set\_timer\_op} hypercall. Guest OSes may use this timer to
-implement timeout values when they block.
-
-
-\section{Xen CPU Scheduling}
-
-Xen offers a uniform API for CPU schedulers. It is possible to choose
-from a number of schedulers at boot and it should be easy to add more.
-The SEDF and Credit schedulers are part of the normal Xen
-distribution. SEDF will be going away and its use should be
-avoided once the credit scheduler has stabilized and become the default.
-The Credit scheduler provides proportional fair shares of the
-host's CPUs to the running domains. It does this while transparently
-load balancing runnable VCPUs across the whole system.
-
-\paragraph*{Note: SMP host support}
-Xen has always supported SMP host systems. When using the credit scheduler,
-a domain's VCPUs will be dynamically moved across physical CPUs to maximise
-domain and system throughput. VCPUs can also be manually restricted to be
-mapped only on a subset of the host's physical CPUs, using the pinning
-mechanism.
-
-
-%% More information on the characteristics and use of these schedulers
-%% is available in {\bf Sched-HOWTO.txt}.
-
-
-\section{Privileged operations}
-
-Xen exports an extended interface to privileged domains (viz.\ {\it
- Domain 0}). This allows such domains to build and boot other domains
-on the server, and provides control interfaces for managing
-scheduling, memory, networking, and block devices.
-
-\chapter{Memory}
-\label{c:memory}
-
-Xen is responsible for managing the allocation of physical memory to
-domains, and for ensuring safe use of the paging and segmentation
-hardware.
-
-
-\section{Memory Allocation}
-
-As well as allocating a portion of physical memory for its own private
-use, Xen also reserves s small fixed portion of every virtual address
-space. This is located in the top 64MB on 32-bit systems, the top
-168MB on PAE systems, and a larger portion in the middle of the
-address space on 64-bit systems. Unreserved physical memory is
-available for allocation to domains at a page granularity. Xen tracks
-the ownership and use of each page, which allows it to enforce secure
-partitioning between domains.
-
-Each domain has a maximum and current physical memory allocation. A
-guest OS may run a `balloon driver' to dynamically adjust its current
-memory allocation up to its limit.
-
-
-\section{Pseudo-Physical Memory}
-
-Since physical memory is allocated and freed on a page granularity,
-there is no guarantee that a domain will receive a contiguous stretch
-of physical memory. However most operating systems do not have good
-support for operating in a fragmented physical address space. To aid
-porting such operating systems to run on top of Xen, we make a
-distinction between \emph{machine memory} and \emph{pseudo-physical
- memory}.
-
-Put simply, machine memory refers to the entire amount of memory
-installed in the machine, including that reserved by Xen, in use by
-various domains, or currently unallocated. We consider machine memory
-to comprise a set of 4kB \emph{machine page frames} numbered
-consecutively starting from 0. Machine frame numbers mean the same
-within Xen or any domain.
-
-Pseudo-physical memory, on the other hand, is a per-domain
-abstraction. It allows a guest operating system to consider its memory
-allocation to consist of a contiguous range of physical page frames
-starting at physical frame 0, despite the fact that the underlying
-machine page frames may be sparsely allocated and in any order.
-
-To achieve this, Xen maintains a globally readable {\it
- machine-to-physical} table which records the mapping from machine
-page frames to pseudo-physical ones. In addition, each domain is
-supplied with a {\it physical-to-machine} table which performs the
-inverse mapping. Clearly the machine-to-physical table has size
-proportional to the amount of RAM installed in the machine, while each
-physical-to-machine table has size proportional to the memory
-allocation of the given domain.
-
-Architecture dependent code in guest operating systems can then use
-the two tables to provide the abstraction of pseudo-physical memory.
-In general, only certain specialized parts of the operating system
-(such as page table management) needs to understand the difference
-between machine and pseudo-physical addresses.
-
-
-\section{Page Table Updates}
-
-In the default mode of operation, Xen enforces read-only access to
-page tables and requires guest operating systems to explicitly request
-any modifications. Xen validates all such requests and only applies
-updates that it deems safe. This is necessary to prevent domains from
-adding arbitrary mappings to their page tables.
-
-To aid validation, Xen associates a type and reference count with each
-memory page. A page has one of the following mutually-exclusive types
-at any point in time: page directory ({\sf PD}), page table ({\sf
- PT}), local descriptor table ({\sf LDT}), global descriptor table
-({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always
-create readable mappings of its own memory regardless of its current
-type.
-
-%%% XXX: possibly explain more about ref count 'lifecyle' here?
-This mechanism is used to maintain the invariants required for safety;
-for example, a domain cannot have a writable mapping to any part of a
-page table as this would require the page concerned to simultaneously
-be of types {\sf PT} and {\sf RW}.
-
-\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count, domid\_t domid)}
-
-This hypercall is used to make updates to either the domain's
-pagetables or to the machine to physical mapping table. It supports
-submitting a queue of updates, allowing batching for maximal
-performance. Explicitly queuing updates using this interface will
-cause any outstanding writable pagetable state to be flushed from the
-system.
-
-\section{Writable Page Tables}
-
-Xen also provides an alternative mode of operation in which guests
-have the illusion that their page tables are directly writable. Of
-course this is not really the case, since Xen must still validate
-modifications to ensure secure partitioning. To this end, Xen traps
-any write attempt to a memory page of type {\sf PT} (i.e., that is
-currently part of a page table). If such an access occurs, Xen
-temporarily allows write access to that page while at the same time
-\emph{disconnecting} it from the page table that is currently in use.
-This allows the guest to safely make updates to the page because the
-newly-updated entries cannot be used by the MMU until Xen revalidates
-and reconnects the page. Reconnection occurs automatically in a
-number of situations: for example, when the guest modifies a different
-page-table page, when the domain is preempted, or whenever the guest
-uses Xen's explicit page-table update interfaces.
-
-Writable pagetable functionality is enabled when the guest requests
-it, using a {\bf vm\_assist} hypercall. Writable pagetables do {\em
-not} provide full virtualisation of the MMU, so the memory management
-code of the guest still needs to be aware that it is running on Xen.
-Since the guest's page tables are used directly, it must translate
-pseudo-physical addresses to real machine addresses when building page
-table entries. The guest may not attempt to map its own pagetables
-writably, since this would violate the memory type invariants; page
-tables will automatically be made writable by the hypervisor, as
-necessary.
-
-\section{Shadow Page Tables}
-
-Finally, Xen also supports a form of \emph{shadow page tables} in
-which the guest OS uses a independent copy of page tables which are
-unknown to the hardware (i.e.\ which are never pointed to by {\tt
- cr3}). Instead Xen propagates changes made to the guest's tables to
-the real ones, and vice versa. This is useful for logging page writes
-(e.g.\ for live migration or checkpoint). A full version of the shadow
-page tables also allows guest OS porting with less effort.
-
-
-\section{Segment Descriptor Tables}
-
-At start of day a guest is supplied with a default GDT, which does not reside
-within its own memory allocation. If the guest wishes to use other
-than the default `flat' ring-1 and ring-3 segments that this GDT
-provides, it must register a custom GDT and/or LDT with Xen, allocated
-from its own memory.
-
-The following hypercall is used to specify a new GDT:
-
-\begin{quote}
- int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em
- entries})
-
- \emph{frame\_list}: An array of up to 14 machine page frames within
- which the GDT resides. Any frame registered as a GDT frame may only
- be mapped read-only within the guest's address space (e.g., no
- writable mappings, no use as a page-table page, and so on). Only 14
- pages may be specified because pages 15 and 16 are reserved for
- the hypervisor's GDT entries.
-
- \emph{entries}: The number of descriptor-entry slots in the GDT.
-\end{quote}
-
-The LDT is updated via the generic MMU update mechanism (i.e., via the
-{\bf mmu\_update} hypercall.
-
-\section{Start of Day}
-
-The start-of-day environment for guest operating systems is rather
-different to that provided by the underlying hardware. In particular,
-the processor is already executing in protected mode with paging
-enabled.
-
-{\it Domain 0} is created and booted by Xen itself. For all subsequent
-domains, the analogue of the boot-loader is the {\it domain builder},
-user-space software running in {\it domain 0}. The domain builder is
-responsible for building the initial page tables for a domain and
-loading its kernel image at the appropriate virtual address.
-
-\section{VM assists}
-
-Xen provides a number of ``assists'' for guest memory management.
-These are available on an ``opt-in'' basis to provide commonly-used
-extra functionality to a guest.
-
-\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
-
-The {\bf cmd} parameter describes the action to be taken, whilst the
-{\bf type} parameter describes the kind of assist that is being
-referred to. Available commands are as follows:
-
-\begin{description}
-\item[VMASST\_CMD\_enable] Enable a particular assist type
-\item[VMASST\_CMD\_disable] Disable a particular assist type
-\end{description}
-
-And the available types are:
-
-\begin{description}
-\item[VMASST\_TYPE\_4gb\_segments] Provide emulated support for
- instructions that rely on 4GB segments (such as the techniques used
- by some TLS solutions).
-\item[VMASST\_TYPE\_4gb\_segments\_notify] Provide a callback (via trap number
- 15) to the guest if the above segment fixups are used: allows the guest to
- display a warning message during boot.
-\item[VMASST\_TYPE\_writable\_pagetables] Enable writable pagetable
- mode - described above.
-\end{description}
-
-
-\chapter{Xen Info Pages}
-
-The {\bf Shared info page} is used to share various CPU-related state
-between the guest OS and the hypervisor. This information includes VCPU
-status, time information and event channel (virtual interrupt) state.
-The {\bf Start info page} is used to pass build-time information to
-the guest when it boots and when it is resumed from a suspended state.
-This chapter documents the fields included in the {\bf
-shared\_info\_t} and {\bf start\_info\_t} structures for use by the
-guest OS.
-
-\section{Shared info page}
-
-The {\bf shared\_info\_t} is accessed at run time by both Xen and the
-guest OS. It is used to pass information relating to the
-virtual CPU and virtual machine state between the OS and the
-hypervisor.
-
-The structure is declared in {\bf xen/include/public/xen.h}:
-
-\scriptsize
-\begin{verbatim}
-typedef struct shared_info {
- vcpu_info_t vcpu_info[XEN_LEGACY_MAX_VCPUS];
-
- /*
- * A domain can create "event channels" on which it can send and receive
- * asynchronous event notifications. There are three classes of event that
- * are delivered by this mechanism:
- * 1. Bi-directional inter- and intra-domain connections. Domains must
- * arrange out-of-band to set up a connection (usually by allocating
- * an unbound 'listener' port and advertising that via a storage service
- * such as xenstore).
- * 2. Physical interrupts. A domain with suitable hardware-access
- * privileges can bind an event-channel port to a physical interrupt
- * source.
- * 3. Virtual interrupts ('events'). A domain can bind an event-channel
- * port to a virtual interrupt source, such as the virtual-timer
- * device or the emergency console.
- *
- * Event channels are addressed by a "port index". Each channel is
- * associated with two bits of information:
- * 1. PENDING -- notifies the domain that there is a pending notification
- * to be processed. This bit is cleared by the guest.
- * 2. MASK -- if this bit is clear then a 0->1 transition of PENDING
- * will cause an asynchronous upcall to be scheduled. This bit is only
- * updated by the guest. It is read-only within Xen. If a channel
- * becomes pending while the channel is masked then the 'edge' is lost
- * (i.e., when the channel is unmasked, the guest must manually handle
- * pending notifications as no upcall will be scheduled by Xen).
- *
- * To expedite scanning of pending notifications, any 0->1 pending
- * transition on an unmasked channel causes a corresponding bit in a
- * per-vcpu selector word to be set. Each bit in the selector covers a
- * 'C long' in the PENDING bitfield array.
- */
- unsigned long evtchn_pending[sizeof(unsigned long) * 8];
- unsigned long evtchn_mask[sizeof(unsigned long) * 8];
-
- /*
- * Wallclock time: updated only by control software. Guests should base
- * their gettimeofday() syscall on this wallclock-base value.
- */
- uint32_t wc_version; /* Version counter: see vcpu_time_info_t. */
- uint32_t wc_sec; /* Secs 00:00:00 UTC, Jan 1, 1970. */
- uint32_t wc_nsec; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */
-
- arch_shared_info_t arch;
-
-} shared_info_t;
-\end{verbatim}
-\normalsize
-
-\begin{description}
-\item[vcpu\_info] An array of {\bf vcpu\_info\_t} structures, each of
- which holds either runtime information about a virtual CPU, or is
- ``empty'' if the corresponding VCPU does not exist.
-\item[evtchn\_pending] Guest-global array, with one bit per event
- channel. Bits are set if an event is currently pending on that
- channel.
-\item[evtchn\_mask] Guest-global array for masking notifications on
- event channels.
-\item[wc\_version] Version counter for current wallclock time.
-\item[wc\_sec] Whole seconds component of current wallclock time.
-\item[wc\_nsec] Nanoseconds component of current wallclock time.
-\item[arch] Host architecture-dependent portion of the shared info
- structure.
-\end{description}
-
-\subsection{vcpu\_info\_t}
-
-\scriptsize
-\begin{verbatim}
-typedef struct vcpu_info {
- /*
- * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
- * a pending notification for a particular VCPU. It is then cleared
- * by the guest OS /before/ checking for pending work, thus avoiding
- * a set-and-check race. Note that the mask is only accessed by Xen
- * on the CPU that is currently hosting the VCPU. This means that the
- * pending and mask flags can be updated by the guest without special
- * synchronisation (i.e., no need for the x86 LOCK prefix).
- * This may seem suboptimal because if the pending flag is set by
- * a different CPU then an IPI may be scheduled even when the mask
- * is set. However, note:
- * 1. The task of 'interrupt holdoff' is covered by the per-event-
- * channel mask bits. A 'noisy' event that is continually being
- * triggered can be masked at source at this very precise
- * granularity.
- * 2. The main purpose of the per-VCPU mask is therefore to restrict
- * reentrant execution: whether for concurrency control, or to
- * prevent unbounded stack usage. Whatever the purpose, we expect
- * that the mask will be asserted only for short periods at a time,
- * and so the likelihood of a 'spurious' IPI is suitably small.
- * The mask is read before making an event upcall to the guest: a
- * non-zero mask therefore guarantees that the VCPU will not receive
- * an upcall activation. The mask is cleared when the VCPU requests
- * to block: this avoids wakeup-waiting races.
- */
- uint8_t evtchn_upcall_pending;
- uint8_t evtchn_upcall_mask;
- unsigned long evtchn_pending_sel;
- arch_vcpu_info_t arch;
- vcpu_time_info_t time;
-} vcpu_info_t; /* 64 bytes (x86) */
-\end{verbatim}
-\normalsize
-
-\begin{description}
-\item[evtchn\_upcall\_pending] This is set non-zero by Xen to indicate
- that there are pending events to be received.
-\item[evtchn\_upcall\_mask] This is set non-zero to disable all
- interrupts for this CPU for short periods of time. If individual
- event channels need to be masked, the {\bf evtchn\_mask} in the {\bf
- shared\_info\_t} is used instead.
-\item[evtchn\_pending\_sel] When an event is delivered to this VCPU, a
- bit is set in this selector to indicate which word of the {\bf
- evtchn\_pending} array in the {\bf shared\_info\_t} contains the
- event in question.
-\item[arch] Architecture-specific VCPU info. On x86 this contains the
- virtualized CR2 register (page fault linear address) for this VCPU.
-\item[time] Time values for this VCPU.
-\end{description}
-
-\subsection{vcpu\_time\_info}
-
-\scriptsize
-\begin{verbatim}
-typedef struct vcpu_time_info {
- /*
- * Updates to the following values are preceded and followed by an
- * increment of 'version'. The guest can therefore detect updates by
- * looking for changes to 'version'. If the least-significant bit of
- * the version number is set then an update is in progress and the guest
- * must wait to read a consistent set of values.
- * The correct way to interact with the version number is similar to
- * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry.
- */
- uint32_t version;
- uint32_t pad0;
- uint64_t tsc_timestamp; /* TSC at last update of time vals. */
- uint64_t system_time; /* Time, in nanosecs, since boot. */
- /*
- * Current system time:
- * system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul
- * CPU frequency (Hz):
- * ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift
- */
- uint32_t tsc_to_system_mul;
- int8_t tsc_shift;
- int8_t pad1[3];
-} vcpu_time_info_t; /* 32 bytes */
-\end{verbatim}
-\normalsize
-
-\begin{description}
-\item[version] Used to ensure the guest gets consistent time updates.
-\item[tsc\_timestamp] Cycle counter timestamp of last time value;
- could be used to expolate in between updates, for instance.
-\item[system\_time] Time since boot (nanoseconds).
-\item[tsc\_to\_system\_mul] Cycle counter to nanoseconds multiplier
-(used in extrapolating current time).
-\item[tsc\_shift] Cycle counter to nanoseconds shift (used in
-extrapolating current time).
-\end{description}
-
-\subsection{arch\_shared\_info\_t}
-
-On x86, the {\bf arch\_shared\_info\_t} is defined as follows (from
-xen/public/arch-x86\_32.h):
-
-\scriptsize
-\begin{verbatim}
-typedef struct arch_shared_info {
- unsigned long max_pfn; /* max pfn that appears in table */
- /* Frame containing list of mfns containing list of mfns containing p2m. */
- unsigned long pfn_to_mfn_frame_list_list;
-} arch_shared_info_t;
-\end{verbatim}
-\normalsize
-
-\begin{description}
-\item[max\_pfn] The maximum PFN listed in the physical-to-machine
- mapping table (P2M table).
-\item[pfn\_to\_mfn\_frame\_list\_list] Machine address of the frame
- that contains the machine addresses of the P2M table frames.
-\end{description}
-
-\section{Start info page}
-
-The start info structure is declared as the following (in {\bf
-xen/include/public/xen.h}):
-
-\scriptsize
-\begin{verbatim}
-#define MAX_GUEST_CMDLINE 1024
-typedef struct start_info {
- /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
- char magic[32]; /* "Xen-<version>.<subversion>". */
- unsigned long nr_pages; /* Total pages allocated to this domain. */
- unsigned long shared_info; /* MACHINE address of shared info struct. */
- uint32_t flags; /* SIF_xxx flags. */
- unsigned long store_mfn; /* MACHINE page number of shared page. */
- uint32_t store_evtchn; /* Event channel for store communication. */
- unsigned long console_mfn; /* MACHINE address of console page. */
- uint32_t console_evtchn; /* Event channel for console messages. */
- /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */
- unsigned long pt_base; /* VIRTUAL address of page directory. */
- unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
- unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
- unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
- unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
- int8_t cmd_line[MAX_GUEST_CMDLINE];
-} start_info_t;
-\end{verbatim}
-\normalsize
-
-The fields are in two groups: the first group are always filled in
-when a domain is booted or resumed, the second set are only used at
-boot time.
-
-The always-available group is as follows:
-
-\begin{description}
-\item[magic] A text string identifying the Xen version to the guest.
-\item[nr\_pages] The number of real machine pages available to the
- guest.
-\item[shared\_info] Machine address of the shared info structure,
- allowing the guest to map it during initialisation.
-\item[flags] Flags for describing optional extra settings to the
- guest.
-\item[store\_mfn] Machine address of the Xenstore communications page.
-\item[store\_evtchn] Event channel to communicate with the store.
-\item[console\_mfn] Machine address of the console data page.
-\item[console\_evtchn] Event channel to notify the console backend.
-\end{description}
-
-The boot-only group may only be safely referred to during system boot:
-
-\begin{description}
-\item[pt\_base] Virtual address of the page directory created for us
- by the domain builder.
-\item[nr\_pt\_frames] Number of frames used by the builders' bootstrap
- pagetables.
-\item[mfn\_list] Virtual address of the list of machine frames this
- domain owns.
-\item[mod\_start] Virtual address of any pre-loaded modules
- (e.g. ramdisk)
-\item[mod\_len] Size of pre-loaded module (if any).
-\item[cmd\_line] Kernel command line passed by the domain builder.
-\end{description}
-
-
-% by Mark Williamson <mark.williamson@cl.cam.ac.uk>
-
-\chapter{Event Channels}
-\label{c:eventchannels}
-
-Event channels are the basic primitive provided by Xen for event
-notifications. An event is the Xen equivalent of a hardware
-interrupt. They essentially store one bit of information, the event
-of interest is signalled by transitioning this bit from 0 to 1.
-
-Notifications are received by a guest via an upcall from Xen,
-indicating when an event arrives (setting the bit). Further
-notifications are masked until the bit is cleared again (therefore,
-guests must check the value of the bit after re-enabling event
-delivery to ensure no missed notifications).
-
-Event notifications can be masked by setting a flag; this is
-equivalent to disabling interrupts and can be used to ensure atomicity
-of certain operations in the guest kernel.
-
-\section{Hypercall interface}
-
-\hypercall{event\_channel\_op(evtchn\_op\_t *op)}
-
-The event channel operation hypercall is used for all operations on
-event channels / ports. Operations are distinguished by the value of
-the {\bf cmd} field of the {\bf op} structure. The possible commands
-are described below:
-
-\begin{description}
-
-\item[EVTCHNOP\_alloc\_unbound]
- Allocate a new event channel port, ready to be connected to by a
- remote domain.
- \begin{itemize}
- \item Specified domain must exist.
- \item A free port must exist in that domain.
- \end{itemize}
- Unprivileged domains may only allocate their own ports, privileged
- domains may also allocate ports in other domains.
-\item[EVTCHNOP\_bind\_interdomain]
- Bind an event channel for interdomain communications.
- \begin{itemize}
- \item Caller domain must have a free port to bind.
- \item Remote domain must exist.
- \item Remote port must be allocated and currently unbound.
- \item Remote port must be expecting the caller domain as the ``remote''.
- \end{itemize}
-\item[EVTCHNOP\_bind\_virq]
- Allocate a port and bind a VIRQ to it.
- \begin{itemize}
- \item Caller domain must have a free port to bind.
- \item VIRQ must be valid.
- \item VCPU must exist.
- \item VIRQ must not currently be bound to an event channel.
- \end{itemize}
-\item[EVTCHNOP\_bind\_ipi]
- Allocate and bind a port for notifying other virtual CPUs.
- \begin{itemize}
- \item Caller domain must have a free port to bind.
- \item VCPU must exist.
- \end{itemize}
-\item[EVTCHNOP\_bind\_pirq]
- Allocate and bind a port to a real IRQ.
- \begin{itemize}
- \item Caller domain must have a free port to bind.
- \item PIRQ must be within the valid range.
- \item Another binding for this PIRQ must not exist for this domain.
- \item Caller must have an available port.
- \end{itemize}
-\item[EVTCHNOP\_close]
- Close an event channel (no more events will be received).
- \begin{itemize}
- \item Port must be valid (currently allocated).
- \end{itemize}
-\item[EVTCHNOP\_send] Send a notification on an event channel attached
- to a port.
- \begin{itemize}
- \item Port must be valid.
- \item Only valid for Interdomain, IPI or Allocated Unbound ports.
- \end{itemize}
-\item[EVTCHNOP\_status] Query the status of a port; what kind of port,
- whether it is bound, what remote domain is expected, what PIRQ or
- VIRQ it is bound to, what VCPU will be notified, etc.
- Unprivileged domains may only query the state of their own ports.
- Privileged domains may query any port.
-\item[EVTCHNOP\_bind\_vcpu] Bind event channel to a particular VCPU -
- receive notification upcalls only on that VCPU.
- \begin{itemize}
- \item VCPU must exist.
- \item Port must be valid.
- \item Event channel must be either: allocated but unbound, bound to
- an interdomain event channel, bound to a PIRQ.
- \end{itemize}
-
-\end{description}
-
-%%
-%% grant_tables.tex
-%%
-%% Made by Mark Williamson
-%% Login <mark@maw48>
-%%
-
-\chapter{Grant tables}
-\label{c:granttables}
-
-Xen's grant tables provide a generic mechanism to memory sharing
-between domains. This shared memory interface underpins the split
-device drivers for block and network IO.
-
-Each domain has its own {\bf grant table}. This is a data structure
-that is shared with Xen; it allows the domain to tell Xen what kind of
-permissions other domains have on its pages. Entries in the grant
-table are identified by {\bf grant references}. A grant reference is
-an integer, which indexes into the grant table. It acts as a
-capability which the grantee can use to perform operations on the
-granter's memory.
-
-This capability-based system allows shared-memory communications
-between unprivileged domains. A grant reference also encapsulates the
-details of a shared page, removing the need for a domain to know the
-real machine address of a page it is sharing. This makes it possible
-to share memory correctly with domains running in fully virtualised
-memory.
-
-\section{Interface}
-
-\subsection{Grant table manipulation}
-
-Creating and destroying grant references is done by direct access to
-the grant table. This removes the need to involve Xen when creating
-grant references, modifying access permissions, etc. The grantee
-domain will invoke hypercalls to use the grant references. Four main
-operations can be accomplished by directly manipulating the table:
-
-\begin{description}
-\item[Grant foreign access] allocate a new entry in the grant table
- and fill out the access permissions accordingly. The access
- permissions will be looked up by Xen when the grantee attempts to
- use the reference to map the granted frame.
-\item[End foreign access] check that the grant reference is not
- currently in use, then remove the mapping permissions for the frame.
- This prevents further mappings from taking place but does not allow
- forced revocations of existing mappings.
-\item[Grant foreign transfer] allocate a new entry in the table
- specifying transfer permissions for the grantee. Xen will look up
- this entry when the grantee attempts to transfer a frame to the
- granter.
-\item[End foreign transfer] remove permissions to prevent a transfer
- occurring in future. If the transfer is already committed,
- modifying the grant table cannot prevent it from completing.
-\end{description}
-
-\subsection{Hypercalls}
-
-Use of grant references is accomplished via a hypercall. The grant
-table op hypercall takes three arguments:
-
-\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
-
-{\bf cmd} indicates the grant table operation of interest. {\bf uop}
-is a pointer to a structure (or an array of structures) describing the
-operation to be performed. The {\bf count} field describes how many
-grant table operations are being batched together.
-
-The core logic is situated in {\bf xen/common/grant\_table.c}. The
-grant table operation hypercall can be used to perform the following
-actions:
-
-\begin{description}
-\item[GNTTABOP\_map\_grant\_ref] Given a grant reference from another
- domain, map the referred page into the caller's address space.
-\item[GNTTABOP\_unmap\_grant\_ref] Remove a mapping to a granted frame
- from the caller's address space. This is used to voluntarily
- relinquish a mapping to a granted page.
-\item[GNTTABOP\_setup\_table] Setup grant table for caller domain.
-\item[GNTTABOP\_dump\_table] Debugging operation.
-\item[GNTTABOP\_transfer] Given a transfer reference from another
- domain, transfer ownership of a page frame to that domain.
-\end{description}
-
-%%
-%% xenstore.tex
-%%
-%% Made by Mark Williamson
-%% Login <mark@maw48>
-%%
-
-\chapter{Xenstore}
-
-Xenstore is the mechanism by which control-plane activities occur.
-These activities include:
-
-\begin{itemize}
-\item Setting up shared memory regions and event channels for use with
- the split device drivers.
-\item Notifying the guest of control events (e.g. balloon driver
- requests)
-\item Reporting back status information from the guest
- (e.g. performance-related statistics, etc).
-\end{itemize}
-
-The store is arranged as a hierarchical collection of key-value pairs.
-Each domain has a directory hierarchy containing data related to its
-configuration. Domains are permitted to register for notifications
-about changes in subtrees of the store, and to apply changes to the
-store transactionally.
-
-\section{Guidelines}
-
-A few principles govern the operation of the store:
-
-\begin{itemize}
-\item Domains should only modify the contents of their own
- directories.
-\item The setup protocol for a device channel should simply consist of
- entering the configuration data into the store.
-\item The store should allow device discovery without requiring the
- relevant device drivers to be loaded: a Xen ``bus'' should be
- visible to probing code in the guest.
-\item The store should be usable for inter-tool communications,
- allowing the tools themselves to be decomposed into a number of
- smaller utilities, rather than a single monolithic entity. This
- also facilitates the development of alternate user interfaces to the
- same functionality.
-\end{itemize}
-
-\section{Store layout}
-
-There are three main paths in XenStore:
-
-\begin{description}
-\item[/vm] stores configuration information about domain
-\item[/local/domain] stores information about the domain on the local node (domid, etc.)
-\item[/tool] stores information for the various tools
-\end{description}
-
-The {\bf /vm} path stores configuration information for a domain.
-This information doesn't change and is indexed by the domain's UUID.
-A {\bf /vm} entry contains the following information:
-
-\begin{description}
-\item[uuid] uuid of the domain (somewhat redundant)
-\item[on\_reboot] the action to take on a domain reboot request (destroy or restart)
-\item[on\_poweroff] the action to take on a domain halt request (destroy or restart)
-\item[on\_crash] the action to take on a domain crash (destroy or restart)
-\item[vcpus] the number of allocated vcpus for the domain
-\item[memory] the amount of memory (in megabytes) for the domain Note: appears to sometimes be empty for domain-0
-\item[vcpu\_avail] the number of active vcpus for the domain (vcpus - number of disabled vcpus)
-\item[name] the name of the domain
-\end{description}
-
-
-{\bf /vm/$<$uuid$>$/image/}
-
-The image path is only available for Domain-Us and contains:
-\begin{description}
-\item[ostype] identifies the builder type (linux or vmx)
-\item[kernel] path to kernel on domain-0
-\item[cmdline] command line to pass to domain-U kernel
-\item[ramdisk] path to ramdisk on domain-0
-\end{description}
-
-{\bf /local}
-
-The {\tt /local} path currently only contains one directory, {\tt
-/local/domain} that is indexed by domain id. It contains the running
-domain information. The reason to have two storage areas is that
-during migration, the uuid doesn't change but the domain id does. The
-{\tt /local/domain} directory can be created and populated before
-finalizing the migration enabling localhost to localhost migration.
-
-{\bf /local/domain/$<$domid$>$}
-
-This path contains:
-
-\begin{description}
-\item[cpu\_time] xend start time (this is only around for domain-0)
-\item[handle] private handle for xend
-\item[name] see /vm
-\item[on\_reboot] see /vm
-\item[on\_poweroff] see /vm
-\item[on\_crash] see /vm
-\item[vm] the path to the VM directory for the domain
-\item[domid] the domain id (somewhat redundant)
-\item[running] indicates that the domain is currently running
-\item[memory] the current memory in megabytes for the domain (empty for domain-0?)
-\item[maxmem\_KiB] the maximum memory for the domain (in kilobytes)
-\item[memory\_KiB] the memory allocated to the domain (in kilobytes)
-\item[cpu] the current CPU the domain is pinned to (empty for domain-0?)
-\item[cpu\_weight] the weight assigned to the domain
-\item[vcpu\_avail] a bitmap telling the domain whether it may use a given VCPU
-\item[online\_vcpus] how many vcpus are currently online
-\item[vcpus] the total number of vcpus allocated to the domain
-\item[console/] a directory for console information
- \begin{description}
- \item[ring-ref] the grant table reference of the console ring queue
- \item[port] the event channel being used for the console ring queue (local port)
- \item[tty] the current tty the console data is being exposed of
- \item[limit] the limit (in bytes) of console data to buffer
- \end{description}
-\item[backend/] a directory containing all backends the domain hosts
- \begin{description}
- \item[vbd/] a directory containing vbd backends
- \begin{description}
- \item[$<$domid$>$/] a directory containing vbd's for domid
- \begin{description}
- \item[$<$virtual-device$>$/] a directory for a particular
- virtual-device on domid
- \begin{description}
- \item[frontend-id] domain id of frontend
- \item[frontend] the path to the frontend domain
- \item[physical-device] backend device number
- \item[sector-size] backend sector size
- \item[info] 0 read/write, 1 read-only (is this right?)
- \item[domain] name of frontend domain
- \item[params] parameters for device
- \item[type] the type of the device
- \item[dev] the virtual device (as given by the user)
- \item[node] output from block creation script
- \end{description}
- \end{description}
- \end{description}
-
- \item[vif/] a directory containing vif backends
- \begin{description}
- \item[$<$domid$>$/] a directory containing vif's for domid
- \begin{description}
- \item[$<$vif number$>$/] a directory for each vif
- \item[frontend-id] the domain id of the frontend
- \item[frontend] the path to the frontend
- \item[mac] the mac address of the vif
- \item[bridge] the bridge the vif is connected to
- \item[handle] the handle of the vif
- \item[script] the script used to create/stop the vif
- \item[domain] the name of the frontend
- \end{description}
- \end{description}
-
- \item[vtpm/] a directory containing vtpm backends
- \begin{description}
- \item[$<$domid$>$/] a directory containing vtpm's for domid
- \begin{description}
- \item[$<$vtpm number$>$/] a directory for each vtpm
- \item[frontend-id] the domain id of the frontend
- \item[frontend] the path to the frontend
- \item[instance] the instance of the virtual TPM that is used
- \item[pref{\textunderscore}instance] the instance number as given in the VM configuration file;
- may be different from {\bf instance}
- \item[domain] the name of the domain of the frontend
- \end{description}
- \end{description}
-
- \end{description}
-
- \item[device/] a directory containing the frontend devices for the
- domain
- \begin{description}
- \item[vbd/] a directory containing vbd frontend devices for the
- domain
- \begin{description}
- \item[$<$virtual-device$>$/] a directory containing the vbd frontend for
- virtual-device
- \begin{description}
- \item[virtual-device] the device number of the frontend device
- \item[backend-id] the domain id of the backend
- \item[backend] the path of the backend in the store (/local/domain
- path)
- \item[ring-ref] the grant table reference for the block request
- ring queue
- \item[event-channel] the event channel used for the block request
- ring queue
- \end{description}
-
- \item[vif/] a directory containing vif frontend devices for the
- domain
- \begin{description}
- \item[$<$id$>$/] a directory for vif id frontend device for the domain
- \begin{description}
- \item[backend-id] the backend domain id
- \item[mac] the mac address of the vif
- \item[handle] the internal vif handle
- \item[backend] a path to the backend's store entry
- \item[tx-ring-ref] the grant table reference for the transmission ring queue
- \item[rx-ring-ref] the grant table reference for the receiving ring queue
- \item[event-channel] the event channel used for the two ring queues
- \end{description}
- \end{description}
-
- \item[vtpm/] a directory containing the vtpm frontend device for the
- domain
- \begin{description}
- \item[$<$id$>$] a directory for vtpm id frontend device for the domain
- \begin{description}
- \item[backend-id] the backend domain id
- \item[backend] a path to the backend's store entry
- \item[ring-ref] the grant table reference for the tx/rx ring
- \item[event-channel] the event channel used for the ring
- \end{description}
- \end{description}
-
- \item[device-misc/] miscellaneous information for devices
- \begin{description}
- \item[vif/] miscellaneous information for vif devices
- \begin{description}
- \item[nextDeviceID] the next device id to use
- \end{description}
- \end{description}
- \end{description}
- \end{description}
-
- \item[security/] access control information for the domain
- \begin{description}
- \item[ssidref] security reference identifier used inside the hypervisor
- \item[access\_control/] security label used by management tools
- \begin{description}
- \item[label] security label name
- \item[policy] security policy name
- \end{description}
- \end{description}
-
- \item[store/] per-domain information for the store
- \begin{description}
- \item[port] the event channel used for the store ring queue
- \item[ring-ref] - the grant table reference used for the store's
- communication channel
- \end{description}
-
- \item[image] - private xend information
-\end{description}
-
-
-\chapter{Devices}
-\label{c:devices}
-
-Virtual devices under Xen are provided by a {\bf split device driver}
-architecture. The illusion of the virtual device is provided by two
-co-operating drivers: the {\bf frontend}, which runs an the
-unprivileged domain and the {\bf backend}, which runs in a domain with
-access to the real device hardware (often called a {\bf driver
-domain}; in practice domain 0 usually fulfills this function).
-
-The frontend driver appears to the unprivileged guest as if it were a
-real device, for instance a block or network device. It receives IO
-requests from its kernel as usual, however since it does not have
-access to the physical hardware of the system it must then issue
-requests to the backend. The backend driver is responsible for
-receiving these IO requests, verifying that they are safe and then
-issuing them to the real device hardware. The backend driver appears
-to its kernel as a normal user of in-kernel IO functionality. When
-the IO completes the backend notifies the frontend that the data is
-ready for use; the frontend is then able to report IO completion to
-its own kernel.
-
-Frontend drivers are designed to be simple; most of the complexity is
-in the backend, which has responsibility for translating device
-addresses, verifying that requests are well-formed and do not violate
-isolation guarantees, etc.
-
-Split drivers exchange requests and responses in shared memory, with
-an event channel for asynchronous notifications of activity. When the
-frontend driver comes up, it uses Xenstore to set up a shared memory
-frame and an interdomain event channel for communications with the
-backend. Once this connection is established, the two can communicate
-directly by placing requests / responses into shared memory and then
-sending notifications on the event channel. This separation of
-notification from data transfer allows message batching, and results
-in very efficient device access.
-
-This chapter focuses on some individual split device interfaces
-available to Xen guests.
-
-
-\section{Network I/O}
-
-Virtual network device services are provided by shared memory
-communication with a backend domain. From the point of view of other
-domains, the backend may be viewed as a virtual ethernet switch
-element with each domain having one or more virtual network interfaces
-connected to it.
-
-From the point of view of the backend domain itself, the network
-backend driver consists of a number of ethernet devices. Each of
-these has a logical direct connection to a virtual network device in
-another domain. This allows the backend domain to route, bridge,
-firewall, etc the traffic to / from the other domains using normal
-operating system mechanisms.
-
-\subsection{Backend Packet Handling}
-
-The backend driver is responsible for a variety of actions relating to
-the transmission and reception of packets from the physical device.
-With regard to transmission, the backend performs these key actions:
-
-\begin{itemize}
-\item {\bf Validation:} To ensure that domains do not attempt to
- generate invalid (e.g. spoofed) traffic, the backend driver may
- validate headers ensuring that source MAC and IP addresses match the
- interface that they have been sent from.
-
- Validation functions can be configured using standard firewall rules
- ({\small{\tt iptables}} in the case of Linux).
-
-\item {\bf Scheduling:} Since a number of domains can share a single
- physical network interface, the backend must mediate access when
- several domains each have packets queued for transmission. This
- general scheduling function subsumes basic shaping or rate-limiting
- schemes.
-
-\item {\bf Logging and Accounting:} The backend domain can be
- configured with classifier rules that control how packets are
- accounted or logged. For example, log messages might be generated
- whenever a domain attempts to send a TCP packet containing a SYN.
-\end{itemize}
-
-On receipt of incoming packets, the backend acts as a simple
-demultiplexer: Packets are passed to the appropriate virtual interface
-after any necessary logging and accounting have been carried out.
-
-\subsection{Data Transfer}
-
-Each virtual interface uses two ``descriptor rings'', one for
-transmit, the other for receive. Each descriptor identifies a block
-of contiguous machine memory allocated to the domain.
-
-The transmit ring carries packets to transmit from the guest to the
-backend domain. The return path of the transmit ring carries messages
-indicating that the contents have been physically transmitted and the
-backend no longer requires the associated pages of memory.
-
-To receive packets, the guest places descriptors of unused pages on
-the receive ring. The backend will return received packets by
-exchanging these pages in the domain's memory with new pages
-containing the received data, and passing back descriptors regarding
-the new packets on the ring. This zero-copy approach allows the
-backend to maintain a pool of free pages to receive packets into, and
-then deliver them to appropriate domains after examining their
-headers.
-
-% Real physical addresses are used throughout, with the domain
-% performing translation from pseudo-physical addresses if that is
-% necessary.
-
-If a domain does not keep its receive ring stocked with empty buffers
-then packets destined to it may be dropped. This provides some
-defence against receive livelock problems because an overloaded domain
-will cease to receive further data. Similarly, on the transmit path,
-it provides the application with feedback on the rate at which packets
-are able to leave the system.
-
-Flow control on rings is achieved by including a pair of producer
-indexes on the shared ring page. Each side will maintain a private
-consumer index indicating the next outstanding message. In this
-manner, the domains cooperate to divide the ring into two message
-lists, one in each direction. Notification is decoupled from the
-immediate placement of new messages on the ring; the event channel
-will be used to generate notification when {\em either} a certain
-number of outstanding messages are queued, {\em or} a specified number
-of nanoseconds have elapsed since the oldest message was placed on the
-ring.
-
-%% Not sure if my version is any better -- here is what was here
-%% before: Synchronization between the backend domain and the guest is
-%% achieved using counters held in shared memory that is accessible to
-%% both. Each ring has associated producer and consumer indices
-%% indicating the area in the ring that holds descriptors that contain
-%% data. After receiving {\it n} packets or {\t nanoseconds} after
-%% receiving the first packet, the hypervisor sends an event to the
-%% domain.
-
-
-\subsection{Network ring interface}
-
-The network device uses two shared memory rings for communication: one
-for transmit, one for receive.
-
-Transmit requests are described by the following structure:
-
-\scriptsize
-\begin{verbatim}
-typedef struct netif_tx_request {
- grant_ref_t gref; /* Reference to buffer page */
- uint16_t offset; /* Offset within buffer page */
- uint16_t flags; /* NETTXF_* */
- uint16_t id; /* Echoed in response message. */
- uint16_t size; /* Packet size in bytes. */
-} netif_tx_request_t;
-\end{verbatim}
-\normalsize
-
-\begin{description}
-\item[gref] Grant reference for the network buffer
-\item[offset] Offset to data
-\item[flags] Transmit flags (currently only NETTXF\_csum\_blank is
- supported, to indicate that the protocol checksum field is
- incomplete).
-\item[id] Echoed to guest by the backend in the ring-level response so
- that the guest can match it to this request
-\item[size] Buffer size
-\end{description}
-
-Each transmit request is followed by a transmit response at some later
-date. This is part of the shared-memory communication protocol and
-allows the guest to (potentially) retire internal structures related
-to the request. It does not imply a network-level response. This
-structure is as follows:
-
-\scriptsize
-\begin{verbatim}
-typedef struct netif_tx_response {
- uint16_t id;
- int16_t status;
-} netif_tx_response_t;
-\end{verbatim}
-\normalsize
-
-\begin{description}
-\item[id] Echo of the ID field in the corresponding transmit request.
-\item[status] Success / failure status of the transmit request.
-\end{description}
-
-Receive requests must be queued by the frontend, accompanied by a
-donation of page-frames to the backend. The backend transfers page
-frames full of data back to the guest
-
-\scriptsize
-\begin{verbatim}
-typedef struct {
- uint16_t id; /* Echoed in response message. */
- grant_ref_t gref; /* Reference to incoming granted frame */
-} netif_rx_request_t;
-\end{verbatim}
-\normalsize
-
-\begin{description}
-\item[id] Echoed by the frontend to identify this request when
- responding.
-\item[gref] Transfer reference - the backend will use this reference
- to transfer a frame of network data to us.
-\end{description}
-
-Receive response descriptors are queued for each received frame. Note
-that these may only be queued in reply to an existing receive request,
-providing an in-built form of traffic throttling.
-
-\scriptsize
-\begin{verbatim}
-typedef struct {
- uint16_t id;
- uint16_t offset; /* Offset in page of start of received packet */
- uint16_t flags; /* NETRXF_* */
- int16_t status; /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */
-} netif_rx_response_t;
-\end{verbatim}
-\normalsize
-
-\begin{description}
-\item[id] ID echoed from the original request, used by the guest to
- match this response to the original request.
-\item[offset] Offset to data within the transferred frame.
-\item[flags] Transmit flags (currently only NETRXF\_csum\_valid is
- supported, to indicate that the protocol checksum field has already
- been validated).
-\item[status] Success / error status for this operation.
-\end{description}
-
-Note that the receive protocol includes a mechanism for guests to
-receive incoming memory frames but there is no explicit transfer of
-frames in the other direction. Guests are expected to return memory
-to the hypervisor in order to use the network interface. They {\em
-must} do this or they will exceed their maximum memory reservation and
-will not be able to receive incoming frame transfers. When necessary,
-the backend is able to replenish its pool of free network buffers by
-claiming some of this free memory from the hypervisor.
-
-\section{Block I/O}
-
-All guest OS disk access goes through the virtual block device VBD
-interface. This interface allows domains access to portions of block
-storage devices visible to the the block backend device. The VBD
-interface is a split driver, similar to the network interface
-described above. A single shared memory ring is used between the
-frontend and backend drivers for each virtual device, across which
-IO requests and responses are sent.
-
-Any block device accessible to the backend domain, including
-network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
-can be exported as a VBD. Each VBD is mapped to a device node in the
-guest, specified in the guest's startup configuration.
-
-\subsection{Data Transfer}
-
-The per-(virtual)-device ring between the guest and the block backend
-supports two messages:
-
-\begin{description}
-\item [{\small {\tt READ}}:] Read data from the specified block
- device. The front end identifies the device and location to read
- from and attaches pages for the data to be copied to (typically via
- DMA from the device). The backend acknowledges completed read
- requests as they finish.
-
-\item [{\small {\tt WRITE}}:] Write data to the specified block
- device. This functions essentially as {\small {\tt READ}}, except
- that the data moves to the device instead of from it.
-\end{description}
-
-%% Rather than copying data, the backend simply maps the domain's
-%% buffers in order to enable direct DMA to them. The act of mapping
-%% the buffers also increases the reference counts of the underlying
-%% pages, so that the unprivileged domain cannot try to return them to
-%% the hypervisor, install them as page tables, or any other unsafe
-%% behaviour.
-%%
-%% % block API here
-
-\subsection{Block ring interface}
-
-The block interface is defined by the structures passed over the
-shared memory interface. These structures are either requests (from
-the frontend to the backend) or responses (from the backend to the
-frontend).
-
-The request structure is defined as follows:
-
-\scriptsize
-\begin{verbatim}
-typedef struct blkif_request {
- uint8_t operation; /* BLKIF_OP_??? */
- uint8_t nr_segments; /* number of segments */
- blkif_vdev_t handle; /* only for read/write requests */
- uint64_t id; /* private guest value, echoed in resp */
- blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */
- struct blkif_request_segment {
- grant_ref_t gref; /* reference to I/O buffer frame */
- /* @first_sect: first sector in frame to transfer (inclusive). */
- /* @last_sect: last sector in frame to transfer (inclusive). */
- uint8_t first_sect, last_sect;
- } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-} blkif_request_t;
-\end{verbatim}
-\normalsize
-
-The fields are as follows:
-
-\begin{description}
-\item[operation] operation ID: one of the operations described above
-\item[nr\_segments] number of segments for scatter / gather IO
- described by this request
-\item[handle] identifier for a particular virtual device on this
- interface
-\item[id] this value is echoed in the response message for this IO;
- the guest may use it to identify the original request
-\item[sector\_number] start sector on the virtual device for this
- request
-\item[frame\_and\_sects] This array contains structures encoding
- scatter-gather IO to be performed:
- \begin{description}
- \item[gref] The grant reference for the foreign I/O buffer page.
- \item[first\_sect] First sector to access within the buffer page (0 to 7).
- \item[last\_sect] Last sector to access within the buffer page (0 to 7).
- \end{description}
- Data will be transferred into frames at an offset determined by the
- value of {\tt first\_sect}.
-\end{description}
-
-\section{Virtual TPM}
-
-Virtual TPM (VTPM) support provides TPM functionality to each virtual
-machine that requests this functionality in its configuration file.
-The interface enables domains to access their own private TPM like it
-was a hardware TPM built into the machine.
-
-The virtual TPM interface is implemented as a split driver,
-similar to the network and block interfaces described above.
-The user domain hosting the frontend exports a character device /dev/tpm0
-to user-level applications for communicating with the virtual TPM.
-This is the same device interface that is also offered if a hardware TPM
-is available in the system. The backend provides a single interface
-/dev/vtpm where the virtual TPM is waiting for commands from all domains
-that have located their backend in a given domain.
-
-\subsection{Data Transfer}
-
-A single shared memory ring is used between the frontend and backend
-drivers. TPM requests and responses are sent in pages where a pointer
-to those pages and other information is placed into the ring such that
-the backend can map the pages into its memory space using the grant
-table mechanism.
-
-The backend driver has been implemented to only accept well-formed
-TPM requests. To meet this requirement, the length indicator in the
-TPM request must correctly indicate the length of the request.
-Otherwise an error message is automatically sent back by the device driver.
-
-The virtual TPM implementation listens for TPM request on /dev/vtpm. Since
-it must be able to apply the TPM request packet to the virtual TPM instance
-associated with the virtual machine, a 4-byte virtual TPM instance
-identifier is pretended to each packet by the backend driver (in network
-byte order) for internal routing of the request.
-
-\subsection{Virtual TPM ring interface}
-
-The TPM protocol is a strict request/response protocol and therefore
-only one ring is used to send requests from the frontend to the backend
-and responses on the reverse path.
-
-The request/response structure is defined as follows:
-
-\scriptsize
-\begin{verbatim}
-typedef struct {
- unsigned long addr; /* Machine address of packet. */
- grant_ref_t ref; /* grant table access reference. */
- uint16_t unused; /* unused */
- uint16_t size; /* Packet size in bytes. */
-} tpmif_tx_request_t;
-\end{verbatim}
-\normalsize
-
-The fields are as follows:
-
-\begin{description}
-\item[addr] The machine address of the page associated with the TPM
- request/response; a request/response may span multiple
- pages
-\item[ref] The grant table reference associated with the address.
-\item[size] The size of the remaining packet; up to
- PAGE{\textunderscore}SIZE bytes can be found in the
- page referenced by 'addr'
-\end{description}
-
-The frontend initially allocates several pages whose addresses
-are stored in the ring. Only these pages are used for exchange of
-requests and responses.
-
-
-\chapter{Further Information}
-
-If you have questions that are not answered by this manual, the
-sources of information listed below may be of interest to you. Note
-that bug reports, suggestions and contributions related to the
-software (or the documentation) should be sent to the Xen developers'
-mailing list (address below).
-
-
-\section{Other documentation}
-
-If you are mainly interested in using (rather than developing for)
-Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/}
-directory of the Xen source distribution.
-
-% Various HOWTOs are also available in {\tt docs/HOWTOS}.
-
-
-\section{Online references}
-
-The official Xen web site can be found at:
-\begin{quote} {\tt http://www.xensource.com}
-\end{quote}
-
-
-This contains links to the latest versions of all online
-documentation, including the latest version of the FAQ.
-
-Information regarding Xen is also available at the Xen Wiki at
-\begin{quote} {\tt http://wiki.xen.org/wiki/}\end{quote}
-The Xen project uses Bugzilla as its bug tracking system. You'll find
-the Xen Bugzilla at http://bugzilla.xensource.com/bugzilla/.
-
-
-\section{Mailing lists}
-
-There are several mailing lists that are used to discuss Xen related
-topics. The most widely relevant are listed below. An official page of
-mailing lists and subscription information can be found at \begin{quote}
- {\tt http://lists.xensource.com/} \end{quote}
-
-\begin{description}
-\item[xen-devel@lists.xensource.com] Used for development
- discussions and bug reports. Subscribe at: \\
- {\small {\tt http://lists.xensource.com/xen-devel}}
-\item[xen-users@lists.xensource.com] Used for installation and usage
- discussions and requests for help. Subscribe at: \\
- {\small {\tt http://lists.xensource.com/xen-users}}
-\item[xen-announce@lists.xensource.com] Used for announcements only.
- Subscribe at: \\
- {\small {\tt http://lists.xensource.com/xen-announce}}
-\item[xen-changelog@lists.xensource.com] Changelog feed
- from the unstable and 2.0 trees - developer oriented. Subscribe at: \\
- {\small {\tt http://lists.xensource.com/xen-changelog}}
-\end{description}
-
-\appendix
-
-
-\chapter{Xen Hypercalls}
-\label{a:hypercalls}
-
-Hypercalls represent the procedural interface to Xen; this appendix
-categorizes and describes the current set of hypercalls.
-
-\section{Invoking Hypercalls}
-
-Hypercalls are invoked in a manner analogous to system calls in a
-conventional operating system; a software interrupt is issued which
-vectors to an entry point within Xen. On x86/32 machines the
-instruction required is {\tt int \$82}; the (real) IDT is setup so
-that this may only be issued from within ring 1. The particular
-hypercall to be invoked is contained in {\tt EAX} --- a list
-mapping these values to symbolic hypercall names can be found
-in {\tt xen/include/public/xen.h}.
-
-On some occasions a set of hypercalls will be required to carry
-out a higher-level function; a good example is when a guest
-operating wishes to context switch to a new process which
-requires updating various privileged CPU state. As an optimization
-for these cases, there is a generic mechanism to issue a set of
-hypercalls as a batch:
-
-\begin{quote}
-\hypercall{multicall(void *call\_list, int nr\_calls)}
-
-Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
-the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
-call\_list}. Each entry contains the hypercall operation code followed
-by up to 7 word-sized arguments.
-\end{quote}
-
-Note that multicalls are provided purely as an optimization; there is
-no requirement to use them when first porting a guest operating
-system.
-
-
-\section{Virtual CPU Setup}
-
-At start of day, a guest operating system needs to setup the virtual
-CPU it is executing on. This includes installing vectors for the
-virtual IDT so that the guest OS can handle interrupts, page faults,
-etc. However the very first thing a guest OS must setup is a pair
-of hypervisor callbacks: these are the entry points which Xen will
-use when it wishes to notify the guest OS of an occurrence.
-
-\begin{quote}
-\hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
- event\_address, unsigned long failsafe\_selector, unsigned long
- failsafe\_address) }
-
-Register the normal (``event'') and failsafe callbacks for
-event processing. In each case the code segment selector and
-address within that segment are provided. The selectors must
-have RPL 1; in XenLinux we simply use the kernel's CS for both
-{\bf event\_selector} and {\bf failsafe\_selector}.
-
-The value {\bf event\_address} specifies the address of the guest OSes
-event handling and dispatch routine; the {\bf failsafe\_address}
-specifies a separate entry point which is used only if a fault occurs
-when Xen attempts to use the normal callback.
-
-\end{quote}
-
-On x86/64 systems the hypercall takes slightly different
-arguments. This is because callback CS does not need to be specified
-(since teh callbacks are entered via SYSRET), and also because an
-entry address needs to be specified for SYSCALLs from guest user
-space:
-
-\begin{quote}
-\hypercall{set\_callbacks(unsigned long event\_address, unsigned long
- failsafe\_address, unsigned long syscall\_address)}
-\end{quote}
-
-
-After installing the hypervisor callbacks, the guest OS can
-install a `virtual IDT' by using the following hypercall:
-
-\begin{quote}
-\hypercall{set\_trap\_table(trap\_info\_t *table)}
-
-Install one or more entries into the per-domain
-trap handler table (essentially a software version of the IDT).
-Each entry in the array pointed to by {\bf table} includes the
-exception vector number with the corresponding segment selector
-and entry point. Most guest OSes can use the same handlers on
-Xen as when running on the real hardware.
-
-
-\end{quote}
-
-A further hypercall is provided for the management of virtual CPUs:
-
-\begin{quote}
-\hypercall{vcpu\_op(int cmd, int vcpuid, void *extra\_args)}
-
-This hypercall can be used to bootstrap VCPUs, to bring them up and
-down and to test their current status.
-
-\end{quote}
-
-\section{Scheduling and Timer}
-
-Domains are preemptively scheduled by Xen according to the
-parameters installed by domain 0 (see Section~\ref{s:dom0ops}).
-In addition, however, a domain may choose to explicitly
-control certain behavior with the following hypercall:
-
-\begin{quote}
-\hypercall{sched\_op\_new(int cmd, void *extra\_args)}
-
-Request scheduling operation from hypervisor. The following
-sub-commands are available:
-
-\begin{description}
-\item[SCHEDOP\_yield] voluntarily yields the CPU, but leaves the
-caller marked as runnable. No extra arguments are passed to this
-command.
-\item[SCHEDOP\_block] removes the calling domain from the run queue
-and causes it to sleep until an event is delivered to it. No extra
-arguments are passed to this command.
-\item[SCHEDOP\_shutdown] is used to end the calling domain's
-execution. The extra argument is a {\bf sched\_shutdown} structure
-which indicates the reason why the domain suspended (e.g., for reboot,
-halt, power-off).
-\item[SCHEDOP\_poll] allows a VCPU to wait on a set of event channels
-with an optional timeout (all of which are specified in the {\bf
-sched\_poll} extra argument). The semantics are similar to the UNIX
-{\bf poll} system call. The caller must have event-channel upcalls
-masked when executing this command.
-\end{description}
-\end{quote}
-
-{\bf sched\_op\_new} was not available prior to Xen 3.0.2. Older versions
-provide only the following hypercall:
-
-\begin{quote}
-\hypercall{sched\_op(int cmd, unsigned long extra\_arg)}
-
-This hypercall supports the following subset of {\bf sched\_op\_new} commands:
-
-\begin{description}
-\item[SCHEDOP\_yield] (extra argument is 0).
-\item[SCHEDOP\_block] (extra argument is 0).
-\item[SCHEDOP\_shutdown] (extra argument is numeric reason code).
-\end{description}
-\end{quote}
-
-To aid the implementation of a process scheduler within a guest OS,
-Xen provides a virtual programmable timer:
-
-\begin{quote}
-\hypercall{set\_timer\_op(uint64\_t timeout)}
-
-Request a timer event to be sent at the specified system time (time
-in nanoseconds since system boot).
-
-\end{quote}
-
-Note that calling {\bf set\_timer\_op} prior to {\bf sched\_op}
-allows block-with-timeout semantics.
-
-
-\section{Page Table Management}
-
-Since guest operating systems have read-only access to their page
-tables, Xen must be involved when making any changes. The following
-multi-purpose hypercall can be used to modify page-table entries,
-update the machine-to-physical mapping table, flush the TLB, install
-a new page-table base pointer, and more.
-
-\begin{quote}
-\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)}
-
-Update the page table for the domain; a set of {\bf count} updates are
-submitted for processing in a batch, with {\bf success\_count} being
-updated to report the number of successful updates.
-
-Each element of {\bf req[]} contains a pointer (address) and value;
-the least significant 2-bits of the pointer are used to distinguish
-the type of update requested as follows:
-\begin{description}
-
-\item[MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
-page table entry to the associated value; Xen will check that the
-update is safe, as described in Chapter~\ref{c:memory}.
-
-\item[MMU\_MACHPHYS\_UPDATE:] update an entry in the
- machine-to-physical table. The calling domain must own the machine
- page in question (or be privileged).
-\end{description}
-
-\end{quote}
-
-Explicitly updating batches of page table entries is extremely
-efficient, but can require a number of alterations to the guest
-OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
-recommended for new OS ports.
-
-Regardless of which page table update mode is being used, however,
-there are some occasions (notably handling a demand page fault) where
-a guest OS will wish to modify exactly one PTE rather than a
-batch, and where that PTE is mapped into the current address space.
-This is catered for by the following:
-
-\begin{quote}
-\hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
- unsigned long flags)}
-
-Update the currently installed PTE that maps virtual address {\bf va}
-to new value {\bf val}. As with {\bf mmu\_update}, Xen checks the
-modification is safe before applying it. The {\bf flags} determine
-which kind of TLB flush, if any, should follow the update.
-
-\end{quote}
-
-Finally, sufficiently privileged domains may occasionally wish to manipulate
-the pages of others:
-
-\begin{quote}
-\hypercall{update\_va\_mapping\_otherdomain(unsigned long va, uint64\_t val,
- unsigned long flags, domid\_t domid)}
-
-Identical to {\bf update\_va\_mapping} save that the pages being
-mapped must belong to the domain {\bf domid}.
-
-\end{quote}
-
-An additional MMU hypercall provides an ``extended command''
-interface. This provides additional functionality beyond the basic
-table updating commands:
-
-\begin{quote}
-
-\hypercall{mmuext\_op(struct mmuext\_op *op, int count, int *success\_count, domid\_t domid)}
-
-This hypercall is used to perform additional MMU operations. These
-include updating {\tt cr3} (or just re-installing it for a TLB flush),
-requesting various kinds of TLB flush, flushing the cache, installing
-a new LDT, or pinning \& unpinning page-table pages (to ensure their
-reference count doesn't drop to zero which would require a
-revalidation of all entries). Some of the operations available are
-restricted to domains with sufficient system privileges.
-
-It is also possible for privileged domains to reassign page ownership
-via an extended MMU operation, although grant tables are used instead
-of this where possible; see Section~\ref{s:idc}.
-
-\end{quote}
-
-Finally, a hypercall interface is exposed to activate and deactivate
-various optional facilities provided by Xen for memory management.
-
-\begin{quote}
-\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
-
-Toggle various memory management modes (in particular writable page
-tables).
-
-\end{quote}
-
-\section{Segmentation Support}
-
-Xen allows guest OSes to install a custom GDT if they require it;
-this is context switched transparently whenever a domain is
-[de]scheduled. The following hypercall is effectively a
-`safe' version of {\tt lgdt}:
-
-\begin{quote}
-\hypercall{set\_gdt(unsigned long *frame\_list, int entries)}
-
-Install a global descriptor table for a domain; {\bf frame\_list} is
-an array of up to 16 machine page frames within which the GDT resides,
-with {\bf entries} being the actual number of descriptor-entry
-slots. All page frames must be mapped read-only within the guest's
-address space, and the table must be large enough to contain Xen's
-reserved entries (see {\bf xen/include/public/arch-x86\_32.h}).
-
-\end{quote}
-
-Many guest OSes will also wish to install LDTs; this is achieved by
-using {\bf mmu\_update} with an extended command, passing the
-linear address of the LDT base along with the number of entries. No
-special safety checks are required; Xen needs to perform this task
-simply since {\tt lldt} requires CPL 0.
-
-
-Xen also allows guest operating systems to update just an
-individual segment descriptor in the GDT or LDT:
-
-\begin{quote}
-\hypercall{update\_descriptor(uint64\_t ma, uint64\_t desc)}
-
-Update the GDT/LDT entry at machine address {\bf ma}; the new
-8-byte descriptor is stored in {\bf desc}.
-Xen performs a number of checks to ensure the descriptor is
-valid.
-
-\end{quote}
-
-Guest OSes can use the above in place of context switching entire
-LDTs (or the GDT) when the number of changing descriptors is small.
-
-\section{Context Switching}
-
-When a guest OS wishes to context switch between two processes,
-it can use the page table and segmentation hypercalls described
-above to perform the the bulk of the privileged work. In addition,
-however, it will need to invoke Xen to switch the kernel (ring 1)
-stack pointer:
-
-\begin{quote}
-\hypercall{stack\_switch(unsigned long ss, unsigned long esp)}
-
-Request kernel stack switch from hypervisor; {\bf ss} is the new
-stack segment, which {\bf esp} is the new stack pointer.
-
-\end{quote}
-
-A useful hypercall for context switching allows ``lazy'' save and
-restore of floating point state:
-
-\begin{quote}
-\hypercall{fpu\_taskswitch(int set)}
-
-This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
-control register; this means that the next attempt to use floating
-point will cause a trap which the guest OS can trap. Typically it will
-then save/restore the FP state, and clear the {\tt TS} bit, using the
-same call.
-\end{quote}
-
-This is provided as an optimization only; guest OSes can also choose
-to save and restore FP state on all context switches for simplicity.
-
-Finally, a hypercall is provided for entering vm86 mode:
-
-\begin{quote}
-\hypercall{switch\_vm86}
-
-This allows the guest to run code in vm86 mode, which is needed for
-some legacy software.
-\end{quote}
-
-\section{Physical Memory Management}
-
-As mentioned previously, each domain has a maximum and current
-memory allocation. The maximum allocation, set at domain creation
-time, cannot be modified. However a domain can choose to reduce
-and subsequently grow its current allocation by using the
-following call:
-
-\begin{quote}
-\hypercall{memory\_op(unsigned int op, void *arg)}
-
-Increase or decrease current memory allocation (as determined by
-the value of {\bf op}). The available operations are:
-
-\begin{description}
-\item[XENMEM\_increase\_reservation] Request an increase in machine
- memory allocation; {\bf arg} must point to a {\bf
- xen\_memory\_reservation} structure.
-\item[XENMEM\_decrease\_reservation] Request a decrease in machine
- memory allocation; {\bf arg} must point to a {\bf
- xen\_memory\_reservation} structure.
-\item[XENMEM\_maximum\_ram\_page] Request the frame number of the
- highest-addressed frame of machine memory in the system. {\bf arg}
- must point to an {\bf unsigned long} where this value will be
- stored.
-\item[XENMEM\_current\_reservation] Returns current memory reservation
- of the specified domain.
-\item[XENMEM\_maximum\_reservation] Returns maximum memory reservation
- of the specified domain.
-\end{description}
-
-\end{quote}
-
-In addition to simply reducing or increasing the current memory
-allocation via a `balloon driver', this call is also useful for
-obtaining contiguous regions of machine memory when required (e.g.
-for certain PCI devices, or if using superpages).
-
-
-\section{Inter-Domain Communication}
-\label{s:idc}
-
-Xen provides a simple asynchronous notification mechanism via
-\emph{event channels}. Each domain has a set of end-points (or
-\emph{ports}) which may be bound to an event source (e.g. a physical
-IRQ, a virtual IRQ, or an port in another domain). When a pair of
-end-points in two different domains are bound together, then a `send'
-operation on one will cause an event to be received by the destination
-domain.
-
-The control and use of event channels involves the following hypercall:
-
-\begin{quote}
-\hypercall{event\_channel\_op(evtchn\_op\_t *op)}
-
-Inter-domain event-channel management; {\bf op} is a discriminated
-union which allows the following 7 operations:
-
-\begin{description}
-
-\item[alloc\_unbound:] allocate a free (unbound) local
- port and prepare for connection from a specified domain.
-\item[bind\_virq:] bind a local port to a virtual
-IRQ; any particular VIRQ can be bound to at most one port per domain.
-\item[bind\_pirq:] bind a local port to a physical IRQ;
-once more, a given pIRQ can be bound to at most one port per
-domain. Furthermore the calling domain must be sufficiently
-privileged.
-\item[bind\_interdomain:] construct an interdomain event
-channel; in general, the target domain must have previously allocated
-an unbound port for this channel, although this can be bypassed by
-privileged domains during domain setup.
-\item[close:] close an interdomain event channel.
-\item[send:] send an event to the remote end of a
-interdomain event channel.
-\item[status:] determine the current status of a local port.
-\end{description}
-
-For more details see
-{\bf xen/include/public/event\_channel.h}.
-
-\end{quote}
-
-Event channels are the fundamental communication primitive between
-Xen domains and seamlessly support SMP. However they provide little
-bandwidth for communication {\sl per se}, and hence are typically
-married with a piece of shared memory to produce effective and
-high-performance inter-domain communication.
-
-Safe sharing of memory pages between guest OSes is carried out by
-granting access on a per page basis to individual domains. This is
-achieved by using the {\tt grant\_table\_op} hypercall.
-
-\begin{quote}
-\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
-
-Used to invoke operations on a grant reference, to setup the grant
-table and to dump the tables' contents for debugging.
-
-\end{quote}
-
-\section{IO Configuration}
-
-Domains with physical device access (i.e.\ driver domains) receive
-limited access to certain PCI devices (bus address space and
-interrupts). However many guest operating systems attempt to
-determine the PCI configuration by directly access the PCI BIOS,
-which cannot be allowed for safety.
-
-Instead, Xen provides the following hypercall:
-
-\begin{quote}
-\hypercall{physdev\_op(void *physdev\_op)}
-
-Set and query IRQ configuration details, set the system IOPL, set the
-TSS IO bitmap.
-
-\end{quote}
-
-
-For examples of using {\tt physdev\_op}, see the
-Xen-specific PCI code in the linux sparse tree.
-
-\section{Administrative Operations}
-\label{s:dom0ops}
-
-A large number of control operations are available to a sufficiently
-privileged domain (typically domain 0). These allow the creation and
-management of new domains, for example. A complete list is given
-below: for more details on any or all of these, please see
-{\tt xen/include/public/dom0\_ops.h}
-
-
-\begin{quote}
-\hypercall{dom0\_op(dom0\_op\_t *op)}
-
-Administrative domain operations for domain management. The options are:
-
-\begin{description}
-\item [DOM0\_GETMEMLIST:] get list of pages used by the domain
-
-\item [DOM0\_SCHEDCTL:]
-
-\item [DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain
-
-\item [DOM0\_CREATEDOMAIN:] create a new domain
-
-\item [DOM0\_DESTROYDOMAIN:] deallocate all resources associated
-with a domain
-
-\item [DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run
-queue.
-
-\item [DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
- once again.
-
-\item [DOM0\_GETDOMAININFO:] get statistics about the domain
-
-\item [DOM0\_SETDOMAININFO:] set VCPU-related attributes
-
-\item [DOM0\_MSR:] read or write model specific registers
-
-\item [DOM0\_DEBUG:] interactively invoke the debugger
-
-\item [DOM0\_SETTIME:] set system time
-
-\item [DOM0\_GETPAGEFRAMEINFO:]
-
-\item [DOM0\_READCONSOLE:] read console content from hypervisor buffer ring
-
-\item [DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU
-
-\item [DOM0\_TBUFCONTROL:] get and set trace buffer attributes
-
-\item [DOM0\_PHYSINFO:] get information about the host machine
-
-\item [DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler
-
-\item [DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes
-
-\item [DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain
-
-\item [DOM0\_GETPAGEFRAMEINFO2:] batched interface for getting
-page frame info
-
-\item [DOM0\_ADD\_MEMTYPE:] set MTRRs
-
-\item [DOM0\_DEL\_MEMTYPE:] remove a memory type range
-
-\item [DOM0\_READ\_MEMTYPE:] read MTRR
-
-\item [DOM0\_PERFCCONTROL:] control Xen's software performance
-counters
-
-\item [DOM0\_MICROCODE:] update CPU microcode
-
-\item [DOM0\_IOPORT\_PERMISSION:] modify domain permissions for an
-IO port range (enable / disable a range for a particular domain)
-
-\item [DOM0\_GETVCPUCONTEXT:] get context from a VCPU
-
-\item [DOM0\_GETVCPUINFO:] get current state for a VCPU
-\item [DOM0\_GETDOMAININFOLIST:] batched interface to get domain
-info
-
-\item [DOM0\_PLATFORM\_QUIRK:] inform Xen of a platform quirk it
-needs to handle (e.g. noirqbalance)
-
-\item [DOM0\_PHYSICAL\_MEMORY\_MAP:] get info about dom0's memory
-map
-
-\item [DOM0\_MAX\_VCPUS:] change max number of VCPUs for a domain
-
-\item [DOM0\_SETDOMAINHANDLE:] set the handle for a domain
-
-\end{description}
-\end{quote}
-
-Most of the above are best understood by looking at the code
-implementing them (in {\tt xen/common/dom0\_ops.c}) and in
-the user-space tools that use them (mostly in {\tt tools/libxc}).
-
-\section{Debugging Hypercalls}
-
-A few additional hypercalls are mainly useful for debugging:
-
-\begin{quote}
-\hypercall{console\_io(int cmd, int count, char *str)}
-
-Use Xen to interact with the console; operations are:
-
-{CONSOLEIO\_write}: Output count characters from buffer str.
-
-{CONSOLEIO\_read}: Input at most count characters into buffer str.
-\end{quote}
-
-A pair of hypercalls allows access to the underlying debug registers:
-\begin{quote}
-\hypercall{set\_debugreg(int reg, unsigned long value)}
-
-Set debug register {\bf reg} to {\bf value}
-
-\hypercall{get\_debugreg(int reg)}
-
-Return the contents of the debug register {\bf reg}
-\end{quote}
-
-And finally:
-\begin{quote}
-\hypercall{xen\_version(int cmd)}
-
-Request Xen version number.
-\end{quote}
-
-This is useful to ensure that user-space tools are in sync
-with the underlying hypervisor.
-
-
-\end{document}