aboutsummaryrefslogtreecommitdiffstats
path: root/docs/src/interface/hypercalls.tex
blob: b31faf3e95acace91bfe503fe16de3603e1b9a0b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
\newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}

\chapter{Xen Hypercalls}
\label{a:hypercalls}

Hypercalls represent the procedural interface to Xen; this appendix 
categorizes and describes the current set of hypercalls. 

\section{Invoking Hypercalls} 

Hypercalls are invoked in a manner analogous to system calls in a
conventional operating system; a software interrupt is issued which
vectors to an entry point within Xen. On x86\_32 machines the
instruction required is {\tt int \$82}; the (real) IDT is setup so
that this may only be issued from within ring 1. The particular 
hypercall to be invoked is contained in {\tt EAX} --- a list 
mapping these values to symbolic hypercall names can be found 
in {\tt xen/include/public/xen.h}. 

On some occasions a set of hypercalls will be required to carry
out a higher-level function; a good example is when a guest 
operating wishes to context switch to a new process which 
requires updating various privileged CPU state. As an optimization
for these cases, there is a generic mechanism to issue a set of 
hypercalls as a batch: 

\begin{quote}
\hypercall{multicall(void *call\_list, int nr\_calls)}

Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
call\_list}. Each entry contains the hypercall operation code followed
by up to 7 word-sized arguments.
\end{quote}

Note that multicalls are provided purely as an optimization; there is
no requirement to use them when first porting a guest operating
system.


\section{Virtual CPU Setup} 

At start of day, a guest operating system needs to setup the virtual
CPU it is executing on. This includes installing vectors for the
virtual IDT so that the guest OS can handle interrupts, page faults,
etc. However the very first thing a guest OS must setup is a pair 
of hypervisor callbacks: these are the entry points which Xen will
use when it wishes to notify the guest OS of an occurrence. 

\begin{quote}
\hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
  event\_address, unsigned long failsafe\_selector, unsigned long
  failsafe\_address) }

Register the normal (``event'') and failsafe callbacks for 
event processing. In each case the code segment selector and 
address within that segment are provided. The selectors must
have RPL 1; in XenLinux we simply use the kernel's CS for both 
{\tt event\_selector} and {\tt failsafe\_selector}.

The value {\tt event\_address} specifies the address of the guest OSes
event handling and dispatch routine; the {\tt failsafe\_address}
specifies a separate entry point which is used only if a fault occurs
when Xen attempts to use the normal callback. 
\end{quote} 


After installing the hypervisor callbacks, the guest OS can 
install a `virtual IDT' by using the following hypercall: 

\begin{quote} 
\hypercall{set\_trap\_table(trap\_info\_t *table)} 

Install one or more entries into the per-domain 
trap handler table (essentially a software version of the IDT). 
Each entry in the array pointed to by {\tt table} includes the 
exception vector number with the corresponding segment selector 
and entry point. Most guest OSes can use the same handlers on 
Xen as when running on the real hardware; an exception is the 
page fault handler (exception vector 14) where a modified 
stack-frame layout is used. 


\end{quote} 



\section{Scheduling and Timer}

Domains are preemptively scheduled by Xen according to the 
parameters installed by domain 0 (see Section~\ref{s:dom0ops}). 
In addition, however, a domain may choose to explicitly 
control certain behavior with the following hypercall: 

\begin{quote} 
\hypercall{sched\_op(unsigned long op)} 

Request scheduling operation from hypervisor. The options are: {\it
yield}, {\it block}, and {\it shutdown}.  {\it yield} keeps the
calling domain runnable but may cause a reschedule if other domains
are runnable.  {\it block} removes the calling domain from the run
queue and cause is to sleeps until an event is delivered to it.  {\it
shutdown} is used to end the domain's execution; the caller can
additionally specify whether the domain should reboot, halt or
suspend.
\end{quote} 

To aid the implementation of a process scheduler within a guest OS,
Xen provides a virtual programmable timer:

\begin{quote}
\hypercall{set\_timer\_op(uint64\_t timeout)} 

Request a timer event to be sent at the specified system time (time 
in nanoseconds since system boot). The hypercall actually passes the 
64-bit timeout value as a pair of 32-bit values. 

\end{quote} 

Note that calling {\tt set\_timer\_op()} prior to {\tt sched\_op} 
allows block-with-timeout semantics. 


\section{Page Table Management} 

Since guest operating systems have read-only access to their page 
tables, Xen must be involved when making any changes. The following
multi-purpose hypercall can be used to modify page-table entries, 
update the machine-to-physical mapping table, flush the TLB, install 
a new page-table base pointer, and more.

\begin{quote} 
\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} 

Update the page table for the domain; a set of {\tt count} updates are
submitted for processing in a batch, with {\tt success\_count} being 
updated to report the number of successful updates.  

Each element of {\tt req[]} contains a pointer (address) and value; 
the least significant 2-bits of the pointer are used to distinguish 
the type of update requested as follows:
\begin{description} 

\item[\it MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
page table entry to the associated value; Xen will check that the
update is safe, as described in Chapter~\ref{c:memory}.

\item[\it MMU\_MACHPHYS\_UPDATE:] update an entry in the
  machine-to-physical table. The calling domain must own the machine
  page in question (or be privileged).

\item[\it MMU\_EXTENDED\_COMMAND:] perform additional MMU operations.
The set of additional MMU operations is considerable, and includes
updating {\tt cr3} (or just re-installing it for a TLB flush),
flushing the cache, installing a new LDT, or pinning \& unpinning
page-table pages (to ensure their reference count doesn't drop to zero
which would require a revalidation of all entries).

Further extended commands are used to deal with granting and 
acquiring page ownership; see Section~\ref{s:idc}. 


\end{description}

More details on the precise format of all commands can be 
found in {\tt xen/include/public/xen.h}. 


\end{quote}

Explicitly updating batches of page table entries is extremely
efficient, but can require a number of alterations to the guest
OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
recommended for new OS ports.

Regardless of which page table update mode is being used, however,
there are some occasions (notably handling a demand page fault) where
a guest OS will wish to modify exactly one PTE rather than a
batch. This is catered for by the following:

\begin{quote} 
\hypercall{update\_va\_mapping(unsigned long page\_nr, unsigned long
val, \\ unsigned long flags)}

Update the currently installed PTE for the page {\tt page\_nr} to 
{\tt val}. As with {\tt mmu\_update()}, Xen checks the modification 
is safe before applying it. The {\tt flags} determine which kind
of TLB flush, if any, should follow the update. 

\end{quote} 

Finally, sufficiently privileged domains may occasionally wish to manipulate 
the pages of others: 
\begin{quote}

\hypercall{update\_va\_mapping\_otherdomain(unsigned long page\_nr,
unsigned long val, unsigned long flags, uint16\_t domid)}

Identical to {\tt update\_va\_mapping()} save that the pages being
mapped must belong to the domain {\tt domid}. 

\end{quote}

This privileged operation is currently used by backend virtual device
drivers to safely map pages containing I/O data. 



\section{Segmentation Support}

Xen allows guest OSes to install a custom GDT if they require it; 
this is context switched transparently whenever a domain is 
[de]scheduled.  The following hypercall is effectively a 
`safe' version of {\tt lgdt}: 

\begin{quote}
\hypercall{set\_gdt(unsigned long *frame\_list, int entries)} 

Install a global descriptor table for a domain; {\tt frame\_list} is
an array of up to 16 machine page frames within which the GDT resides,
with {\tt entries} being the actual number of descriptor-entry
slots. All page frames must be mapped read-only within the guest's
address space, and the table must be large enough to contain Xen's
reserved entries (see {\tt xen/include/public/arch-x86\_32.h}).

\end{quote}

Many guest OSes will also wish to install LDTs; this is achieved by
using {\tt mmu\_update()} with an extended command, passing the
linear address of the LDT base along with the number of entries. No
special safety checks are required; Xen needs to perform this task
simply since {\tt lldt} requires CPL 0.


Xen also allows guest operating systems to update just an 
individual segment descriptor in the GDT or LDT:  

\begin{quote}
\hypercall{update\_descriptor(unsigned long ma, unsigned long word1,
unsigned long word2)}

Update the GDT/LDT entry at machine address {\tt ma}; the new
8-byte descriptor is stored in {\tt word1} and {\tt word2}.
Xen performs a number of checks to ensure the descriptor is 
valid. 

\end{quote}

Guest OSes can use the above in place of context switching entire 
LDTs (or the GDT) when the number of changing descriptors is small. 

\section{Context Switching} 

When a guest OS wishes to context switch between two processes, 
it can use the page table and segmentation hypercalls described
above to perform the the bulk of the privileged work. In addition, 
however, it will need to invoke Xen to switch the kernel (ring 1) 
stack pointer: 

\begin{quote} 
\hypercall{stack\_switch(unsigned long ss, unsigned long esp)} 

Request kernel stack switch from hypervisor; {\tt ss} is the new 
stack segment, which {\tt esp} is the new stack pointer. 

\end{quote} 

A final useful hypercall for context switching allows ``lazy'' 
save and restore of floating point state: 

\begin{quote}
\hypercall{fpu\_taskswitch(void)} 

This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
control register; this means that the next attempt to use floating
point will cause a trap which the guest OS can trap. Typically it will
then save/restore the FP state, and clear the {\tt TS} bit. 
\end{quote} 

This is provided as an optimization only; guest OSes can also choose
to save and restore FP state on all context switches for simplicity. 


\section{Physical Memory Management}

As mentioned previously, each domain has a maximum and current 
memory allocation. The maximum allocation, set at domain creation 
time, cannot be modified. However a domain can choose to reduce 
and subsequently grow its current allocation by using the
following call: 

\begin{quote} 
\hypercall{dom\_mem\_op(unsigned int op, unsigned long *extent\_list,
  unsigned long nr\_extents, unsigned int extent\_order)}

Increase or decrease current memory allocation (as determined by 
the value of {\tt op}). Each invocation provides a list of 
extents each of which is $2^s$ pages in size, 
where $s$ is the value of {\tt extent\_order}. 

\end{quote} 

In addition to simply reducing or increasing the current memory
allocation via a `balloon driver', this call is also useful for 
obtaining contiguous regions of machine memory when required (e.g. 
for certain PCI devices, or if using superpages).  


\section{Inter-Domain Communication}
\label{s:idc} 

Xen provides a simple asynchronous notification mechanism via
\emph{event channels}. Each domain has a set of end-points (or
\emph{ports}) which may be bound to an event source (e.g. a physical
IRQ, a virtual IRQ, or an port in another domain). When a pair of
end-points in two different domains are bound together, then a `send'
operation on one will cause an event to be received by the destination
domain.

The control and use of event channels involves the following hypercall: 

\begin{quote}
\hypercall{event\_channel\_op(evtchn\_op\_t *op)} 

Inter-domain event-channel management; {\tt op} is a discriminated 
union which allows the following 7 operations: 

\begin{description} 

\item[\it alloc\_unbound:] allocate a free (unbound) local
  port and prepare for connection from a specified domain. 
\item[\it bind\_virq:] bind a local port to a virtual 
IRQ; any particular VIRQ can be bound to at most one port per domain. 
\item[\it bind\_pirq:] bind a local port to a physical IRQ;
once more, a given pIRQ can be bound to at most one port per
domain. Furthermore the calling domain must be sufficiently
privileged.
\item[\it bind\_interdomain:] construct an interdomain event 
channel; in general, the target domain must have previously allocated 
an unbound port for this channel, although this can be bypassed by 
privileged domains during domain setup. 
\item[\it close:] close an interdomain event channel. 
\item[\it send:] send an event to the remote end of a 
interdomain event channel. 
\item[\it status:] determine the current status of a local port. 
\end{description} 

For more details see
{\tt xen/include/public/event\_channel.h}. 

\end{quote} 

Event channels are the fundamental communication primitive between 
Xen domains and seamlessly support SMP. However they provide little
bandwidth for communication {\sl per se}, and hence are typically 
married with a piece of shared memory to produce effective and 
high-performance inter-domain communication. 

Safe sharing of memory pages between guest OSes is carried out by
granting access on a per page basis to individual domains. This is
achieved by using the {\tt grant\_table\_op()} hypercall.

\begin{quote}
\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}

Grant or remove access to a particular page to a particular domain. 

\end{quote} 

This is not currently widely in use by guest operating systems, but 
we intend to integrate support more fully in the near future. 

\section{PCI Configuration} 

Domains with physical device access (i.e.\ driver domains) receive
limited access to certain PCI devices (bus address space and
interrupts). However many guest operating systems attempt to 
determine the PCI configuration by directly access the PCI BIOS, 
which cannot be allowed for safety. 

Instead, Xen provides the following hypercall: 

\begin{quote}
\hypercall{physdev\_op(void *physdev\_op)}

Perform a PCI configuration option; depending on the value 
of {\tt physdev\_op} this can be a PCI config read, a PCI config 
write, or a small number of other queries. 

\end{quote} 


For examples of using {\tt physdev\_op()}, see the 
Xen-specific PCI code in the linux sparse tree. 

\section{Administrative Operations}
\label{s:dom0ops}

A large number of control operations are available to a sufficiently
privileged domain (typically domain 0). These allow the creation and
management of new domains, for example. A complete list is given 
below: for more details on any or all of these, please see 
{\tt xen/include/public/dom0\_ops.h} 


\begin{quote}
\hypercall{dom0\_op(dom0\_op\_t *op)} 

Administrative domain operations for domain management. The options are:

\begin{description} 
\item [\it DOM0\_CREATEDOMAIN:] create a new domain

\item [\it DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run 
queue. 

\item [\it DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
  once again. 

\item [\it DOM0\_DESTROYDOMAIN:] deallocate all resources associated
with a domain

\item [\it DOM0\_GETMEMLIST:] get list of pages used by the domain

\item [\it DOM0\_SCHEDCTL:]

\item [\it DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain

\item [\it DOM0\_BUILDDOMAIN:] do final guest OS setup for domain

\item [\it DOM0\_GETDOMAINFO:] get statistics about the domain

\item [\it DOM0\_GETPAGEFRAMEINFO:] 

\item [\it DOM0\_GETPAGEFRAMEINFO2:]

\item [\it DOM0\_IOPL:] set I/O privilege level

\item [\it DOM0\_MSR:] read or write model specific registers

\item [\it DOM0\_DEBUG:] interactively invoke the debugger

\item [\it DOM0\_SETTIME:] set system time

\item [\it DOM0\_READCONSOLE:] read console content from hypervisor buffer ring

\item [\it DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU

\item [\it DOM0\_GETTBUFS:] get information about the size and location of
                      the trace buffers (only on trace-buffer enabled builds)

\item [\it DOM0\_PHYSINFO:] get information about the host machine

\item [\it DOM0\_PCIDEV\_ACCESS:] modify PCI device access permissions

\item [\it DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler

\item [\it DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes

\item [\it DOM0\_SETDOMAININITIALMEM:] set initial memory allocation of a domain

\item [\it DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain

\item [\it DOM0\_SETDOMAINVMASSIST:] set domain VM assist options
\end{description} 
\end{quote} 

Most of the above are best understood by looking at the code 
implementing them (in {\tt xen/common/dom0\_ops.c}) and in 
the user-space tools that use them (mostly in {\tt tools/libxc}). 

\section{Debugging Hypercalls} 

A few additional hypercalls are mainly useful for debugging: 

\begin{quote} 
\hypercall{console\_io(int cmd, int count, char *str)}

Use Xen to interact with the console; operations are:

{\it CONSOLEIO\_write}: Output count characters from buffer str.

{\it CONSOLEIO\_read}: Input at most count characters into buffer str.
\end{quote} 

A pair of hypercalls allows access to the underlying debug registers: 
\begin{quote}
\hypercall{set\_debugreg(int reg, unsigned long value)}

Set debug register {\tt reg} to {\tt value} 

\hypercall{get\_debugreg(int reg)}

Return the contents of the debug register {\tt reg}
\end{quote}

And finally: 
\begin{quote}
\hypercall{xen\_version(int cmd)}

Request Xen version number.
\end{quote} 

This is useful to ensure that user-space tools are in sync 
with the underlying hypervisor. 

\section{Deprecated Hypercalls}

Xen is under constant development and refinement; as such there 
are plans to improve the way in which various pieces of functionality 
are exposed to guest OSes. 

\begin{quote} 
\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}

Toggle various memory management modes (in particular wrritable page
tables and superpage support). 

\end{quote} 

This is likely to be replaced with mode values in the shared 
information page since this is more resilient for resumption 
after migration or checkpoint.