aboutsummaryrefslogtreecommitdiffstats
path: root/docs/src/interface/devices.tex
blob: ddd0cd1d2b5792d3e9559eebdbb99f64ff96c9e5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
\chapter{Devices}
\label{c:devices}

Devices such as network and disk are exported to guests using a split
device driver.  The device driver domain, which accesses the physical
device directly also runs a \emph{backend} driver, serving requests to
that device from guests.  Each guest will use a simple \emph{frontend}
driver, to access the backend.  Communication between these domains is
composed of two parts: First, data is placed onto a shared memory page
between the domains.  Second, an event channel between the two domains
is used to pass notification that data is outstanding.  This
separation of notification from data transfer allows message batching,
and results in very efficient device access.

Event channels are used extensively in device virtualization; each
domain has a number of end-points or \emph{ports} each of which may be
bound to one of the following \emph{event sources}:
\begin{itemize}
  \item a physical interrupt from a real device, 
  \item a virtual interrupt (callback) from Xen, or 
  \item a signal from another domain 
\end{itemize}

Events are lightweight and do not carry much information beyond the
source of the notification. Hence when performing bulk data transfer,
events are typically used as synchronization primitives over a shared
memory transport. Event channels are managed via the {\tt
  event\_channel\_op()} hypercall; for more details see
Section~\ref{s:idc}.

This chapter focuses on some individual device interfaces available to
Xen guests.


\section{Network I/O}

Virtual network device services are provided by shared memory
communication with a backend domain.  From the point of view of other
domains, the backend may be viewed as a virtual ethernet switch
element with each domain having one or more virtual network interfaces
connected to it.

\subsection{Backend Packet Handling}

The backend driver is responsible for a variety of actions relating to
the transmission and reception of packets from the physical device.
With regard to transmission, the backend performs these key actions:

\begin{itemize}
\item {\bf Validation:} To ensure that domains do not attempt to
  generate invalid (e.g. spoofed) traffic, the backend driver may
  validate headers ensuring that source MAC and IP addresses match the
  interface that they have been sent from.

  Validation functions can be configured using standard firewall rules
  ({\small{\tt iptables}} in the case of Linux).
  
\item {\bf Scheduling:} Since a number of domains can share a single
  physical network interface, the backend must mediate access when
  several domains each have packets queued for transmission.  This
  general scheduling function subsumes basic shaping or rate-limiting
  schemes.
  
\item {\bf Logging and Accounting:} The backend domain can be
  configured with classifier rules that control how packets are
  accounted or logged.  For example, log messages might be generated
  whenever a domain attempts to send a TCP packet containing a SYN.
\end{itemize}

On receipt of incoming packets, the backend acts as a simple
demultiplexer: Packets are passed to the appropriate virtual interface
after any necessary logging and accounting have been carried out.

\subsection{Data Transfer}

Each virtual interface uses two ``descriptor rings'', one for
transmit, the other for receive.  Each descriptor identifies a block
of contiguous physical memory allocated to the domain.

The transmit ring carries packets to transmit from the guest to the
backend domain.  The return path of the transmit ring carries messages
indicating that the contents have been physically transmitted and the
backend no longer requires the associated pages of memory.

To receive packets, the guest places descriptors of unused pages on
the receive ring.  The backend will return received packets by
exchanging these pages in the domain's memory with new pages
containing the received data, and passing back descriptors regarding
the new packets on the ring.  This zero-copy approach allows the
backend to maintain a pool of free pages to receive packets into, and
then deliver them to appropriate domains after examining their
headers.

% Real physical addresses are used throughout, with the domain
% performing translation from pseudo-physical addresses if that is
% necessary.

If a domain does not keep its receive ring stocked with empty buffers
then packets destined to it may be dropped.  This provides some
defence against receive livelock problems because an overload domain
will cease to receive further data.  Similarly, on the transmit path,
it provides the application with feedback on the rate at which packets
are able to leave the system.

Flow control on rings is achieved by including a pair of producer
indexes on the shared ring page.  Each side will maintain a private
consumer index indicating the next outstanding message.  In this
manner, the domains cooperate to divide the ring into two message
lists, one in each direction.  Notification is decoupled from the
immediate placement of new messages on the ring; the event channel
will be used to generate notification when {\em either} a certain
number of outstanding messages are queued, {\em or} a specified number
of nanoseconds have elapsed since the oldest message was placed on the
ring.

%% Not sure if my version is any better -- here is what was here
%% before: Synchronization between the backend domain and the guest is
%% achieved using counters held in shared memory that is accessible to
%% both.  Each ring has associated producer and consumer indices
%% indicating the area in the ring that holds descriptors that contain
%% data.  After receiving {\it n} packets or {\t nanoseconds} after
%% receiving the first packet, the hypervisor sends an event to the
%% domain.


\section{Block I/O}

All guest OS disk access goes through the virtual block device VBD
interface.  This interface allows domains access to portions of block
storage devices visible to the the block backend device.  The VBD
interface is a split driver, similar to the network interface
described above.  A single shared memory ring is used between the
frontend and backend drivers, across which read and write messages are
sent.

Any block device accessible to the backend domain, including
network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
can be exported as a VBD.  Each VBD is mapped to a device node in the
guest, specified in the guest's startup configuration.

Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since
similar functionality can be achieved using the more complete LVM
system, which is already in widespread use.

\subsection{Data Transfer}

The single ring between the guest and the block backend supports three
messages:

\begin{description}
\item [{\small {\tt PROBE}}:] Return a list of the VBDs available to
  this guest from the backend.  The request includes a descriptor of a
  free page into which the reply will be written by the backend.

\item [{\small {\tt READ}}:] Read data from the specified block
  device.  The front end identifies the device and location to read
  from and attaches pages for the data to be copied to (typically via
  DMA from the device).  The backend acknowledges completed read
  requests as they finish.

\item [{\small {\tt WRITE}}:] Write data to the specified block
  device.  This functions essentially as {\small {\tt READ}}, except
  that the data moves to the device instead of from it.
\end{description}

%% um... some old text: In overview, the same style of descriptor-ring
%% that is used for network packets is used here.  Each domain has one
%% ring that carries operation requests to the hypervisor and carries
%% the results back again.

%% Rather than copying data, the backend simply maps the domain's
%% buffers in order to enable direct DMA to them.  The act of mapping
%% the buffers also increases the reference counts of the underlying
%% pages, so that the unprivileged domain cannot try to return them to
%% the hypervisor, install them as page tables, or any other unsafe
%% behaviour.
%%
%% % block API here