aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorakw27@labyrinth.cl.cam.ac.uk <akw27@labyrinth.cl.cam.ac.uk>2004-11-03 19:42:40 +0000
committerakw27@labyrinth.cl.cam.ac.uk <akw27@labyrinth.cl.cam.ac.uk>2004-11-03 19:42:40 +0000
commit7f5babb8dec8c9253b735553260ea178a552f5bc (patch)
tree432e229ad2199d74ce111e3374a92f09fe5dd338
parent5441bb5f9e08117b6c31bf29f6ffa997f4461740 (diff)
downloadxen-7f5babb8dec8c9253b735553260ea178a552f5bc.tar.gz
xen-7f5babb8dec8c9253b735553260ea178a552f5bc.tar.bz2
xen-7f5babb8dec8c9253b735553260ea178a552f5bc.zip
bitkeeper revision 1.1159.157.1 (418934b0-qzq3Mn8ZcEAFEUYycHb2Q)
Device section fixes.
-rw-r--r--docs/src/interface.tex227
1 files changed, 139 insertions, 88 deletions
diff --git a/docs/src/interface.tex b/docs/src/interface.tex
index 1d5ede64bd..69a6d7114a 100644
--- a/docs/src/interface.tex
+++ b/docs/src/interface.tex
@@ -352,120 +352,171 @@ and loading its kernel image at the appropriate virtual address.
-\chapter{Network I/O}
+\chapter{Devices}
+
+Devices such as network and disk are exported to guests using a
+split device driver. The device driver domain, which accesses the
+physical device directly also runs a {\em backend} driver, serving
+requests to that device from guests. Each guest will use a simple
+{\em frontend} driver, to access the backend. Communication between these
+domains is composed of two parts: First, data is placed onto a shared
+memory page between the domains. Second, an event channel between the
+two domains is used to pass notification that data is outstanding.
+This separation of notification from data transfer allows message
+batching, and results in very efficient device access. Specific
+details on inter-domain communication are available in
+Appendix~\ref{s:idc}.
+
+This chapter provides details on some individual device interfaces
+available to Xen guests.
+
+\section{Network I/O}
Virtual network device services are provided by shared memory
-communications with a `backend' domain. From the point of view of
+communications with a backend domain. From the point of view of
other domains, the backend may be viewed as a virtual ethernet switch
element with each domain having one or more virtual network interfaces
connected to it.
-\section{Backend Packet Handling}
-The backend driver is responsible primarily for {\it data-path} operations.
-In terms of networking this means packet transmission and reception.
+\subsection{Backend Packet Handling}
+
+The backend driver is responsible for a variety of actions relating to
+the transmission and reception of packets from the physical device.
+With regard to transmission, the backend performs these key actions:
-On the transmission side, the backend needs to perform two key actions:
\begin{itemize}
-\item {\tt Validation:} A domain may only be allowed to emit packets
-matching a certain specification; for example, ones in which the
-source IP address matches one assigned to the virtual interface over
-which it is sent. The backend would be responsible for ensuring any
-such requirements are met, either by checking or by stamping outgoing
-packets with prescribed values for certain fields.
-
-Validation functions can be configured using standard firewall rules
-(i.e. IP Tables, in the case of Linux).
-
-\item {\tt Scheduling:} Since a number of domains can share a single
-``real'' network interface, the hypervisor must mediate access when
-several domains each have packets queued for transmission. Of course,
-this general scheduling function subsumes basic shaping or
-rate-limiting schemes.
-
-\item {\tt Logging and Accounting:} The hypervisor can be configured
-with classifier rules that control how packets are accounted or
-logged. For example, {\it domain0} could request that it receives a
-log message or copy of the packet whenever another domain attempts to
-send a TCP packet containg a SYN.
+\item {\bf Validation:} To ensure that domains do not attempt to
+ generate invalid (e.g. spoofed) traffic, the backend driver may
+ validate headers ensuring that source MAC and IP addresses match the
+ interface that they have been sent from.
+
+ Validation functions can be configured using standard firewall rules
+ ({\small{\tt iptables}} in the case of Linux).
+
+\item {\bf Scheduling:} Since a number of domains can share a single
+ physical network interface, the backend must mediate access when
+ several domains each have packets queued for transmission. This
+ general scheduling function subsumes basic shaping or rate-limiting
+ schemes.
+
+\item {\bf Logging and Accounting:} The backend domain can be
+ configured with classifier rules that control how packets are
+ accounted or logged. For example, log messages might be generated
+ whenever a domain attempts to send a TCP packet containing a SYN.
\end{itemize}
-On the recive side, the backend's role is relatively straightforward:
-once a packet is received, it just needs to determine the virtual interface(s)
-to which it must be delivered and deliver it via page-flipping.
-
+On receipt of incoming packets, the backend acts as a simple
+demultiplexer: Packets are passed to the appropriate virtual
+interface after any necessary logging and accounting have been carried
+out.
-\section{Data Transfer}
+\subsection{Data Transfer}
Each virtual interface uses two ``descriptor rings'', one for transmit,
the other for receive. Each descriptor identifies a block of contiguous
-physical memory allocated to the domain. There are four cases:
-
-\begin{itemize}
-
-\item The transmit ring carries packets to transmit from the domain to the
-hypervisor.
-
-\item The return path of the transmit ring carries ``empty'' descriptors
-indicating that the contents have been transmitted and the memory can be
-re-used.
-
-\item The receive ring carries empty descriptors from the domain to the
-hypervisor; these provide storage space for that domain's received packets.
-
-\item The return path of the receive ring carries packets that have been
-received.
-\end{itemize}
-
-Real physical addresses are used throughout, with the domain performing
-translation from pseudo-physical addresses if that is necessary.
+physical memory allocated to the domain.
+
+The transmit ring carries packets to transmit from the guest to the
+backend domain. The return path of the transmit ring carries messages
+indicating that the contents have been physically transmitted and the
+backend no longer requires the associated pages of memory.
+
+To receive packets, the guest places descriptors of unused pages on
+the receive ring. The backend will return received packets by
+exchanging these pages in the domain's memory with new pages
+containing the received data, and passing back descriptors regarding
+the new packets on the ring. This zero-copy approach allows the
+backend to maintain a pool of free pages to receive packets into, and
+then deliver them to appropriate domains after examining their
+headers.
+
+%
+%Real physical addresses are used throughout, with the domain performing
+%translation from pseudo-physical addresses if that is necessary.
If a domain does not keep its receive ring stocked with empty buffers then
-packets destined to it may be dropped. This provides some defense against
-receiver-livelock problems because an overload domain will cease to receive
+packets destined to it may be dropped. This provides some defence against
+receive livelock problems because an overload domain will cease to receive
further data. Similarly, on the transmit path, it provides the application
with feedback on the rate at which packets are able to leave the system.
-Synchronization between the hypervisor and the domain is achieved using
-counters held in shared memory that is accessible to both. Each ring has
-associated producer and consumer indices indicating the area in the ring
-that holds descriptors that contain data. After receiving {\it n} packets
-or {\t nanoseconds} after receiving the first packet, the hypervisor sends
-an event to the domain.
-\chapter{Block I/O}
-
-\section{Virtual Block Devices (VBDs)}
-
-All guest OS disk access goes through the VBD interface. The VBD
-interface provides the administrator with the ability to selectively
-grant domains access to portions of block storage devices visible to
-the the block backend device (usually domain 0).
-
-VBDs can literally be backed by any block device accessible to the
-backend domain, including network-based block devices (iSCSI, *NBD,
-etc), loopback devices and LVM / MD devices.
+Flow control on rings is achieved by including a pair of producer
+indexes on the shared ring page. Each side will maintain a private
+consumer index indicating the next outstanding message. In this
+manner, the domains cooperate to divide the ring into two message
+lists, one in each direction. Notification is decoupled from the
+immediate placement of new messages on the ring; the event channel
+will be used to generate notification when {\em either} a certain
+number of outstanding messages are queued, {\em or} a specified number
+of nanoseconds have elapsed since the oldest message was placed on the
+ring.
+
+% Not sure if my version is any better -- here is what was here before:
+%% Synchronization between the backend domain and the guest is achieved using
+%% counters held in shared memory that is accessible to both. Each ring has
+%% associated producer and consumer indices indicating the area in the ring
+%% that holds descriptors that contain data. After receiving {\it n} packets
+%% or {\t nanoseconds} after receiving the first packet, the hypervisor sends
+%% an event to the domain.
+
+\section{Block I/O}
+
+All guest OS disk access goes through the virtual block device VBD
+interface. This interface allows domains access to portions of block
+storage devices visible to the the block backend device. The VBD
+interface is a split driver, similar to the network interface
+described above. A single shared memory ring is used between the
+frontend and backend drivers, across which read and write messages are
+sent.
+
+Any block device accessible to the backend domain, including
+network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
+can be exported as a VBD. Each VBD is mapped to a device node in the
+guest, specified in the guest's startup configuration.
Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since
similar functionality can be achieved using the (more advanced) LVM
system, which is already in widespread use.
\subsection{Data Transfer}
-Domains which have been granted access to a logical block device are permitted
-to read and write it by shared memory communications with the backend domain.
-
-In overview, the same style of descriptor-ring that is used for
-network packets is used here. Each domain has one ring that carries
-operation requests to the hypervisor and carries the results back
-again.
-
-Rather than copying data, the backend simply maps the domain's buffers
-in order to enable direct DMA to them. The act of mapping the buffers
-also increases the reference counts of the underlying pages, so that
-the unprivileged domain cannot try to return them to the hypervisor,
-install them as page tables, or any other unsafe behaviour.
-%block API here
-
-\chapter{Privileged operations}
+
+The single ring between the guest and the block backend supports three
+messages:
+
+\begin{description}
+\item [{\small {\tt PROBE}}:] Return a list of the VBDs available to this guest
+ from the backend. The request includes a descriptor of a free page
+ into which the reply will be written by the backend.
+
+\item [{\small {\tt READ}}:] Read data from the specified block device. The
+ front end identifies the device and location to read from and
+ attaches pages for the data to be copied to (typically via DMA from
+ the device). The backend acknowledges completed read requests as
+ they finish.
+
+\item [{\small {\tt WRITE}}:] Write data to the specified block device. This
+ functions essentially as {\small {\tt READ}}, except that the data moves to
+ the device instead of from it.
+\end{description}
+
+% um... some old text
+%% In overview, the same style of descriptor-ring that is used for
+%% network packets is used here. Each domain has one ring that carries
+%% operation requests to the hypervisor and carries the results back
+%% again.
+
+%% Rather than copying data, the backend simply maps the domain's buffers
+%% in order to enable direct DMA to them. The act of mapping the buffers
+%% also increases the reference counts of the underlying pages, so that
+%% the unprivileged domain cannot try to return them to the hypervisor,
+%% install them as page tables, or any other unsafe behaviour.
+%% %block API here
+
+
+% akw: demoting this to a section -- not sure if there is any point
+% though, maybe just remove it.
+\section{Privileged operations}
{\it Domain0} is responsible for building all other domains on the server
and providing control interfaces for managing scheduling, networking, and
blocks.