docs/src/interface.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901

\documentclass[11pt,twoside,final,openright]{xenstyle}
\usepackage{a4,graphicx,setspace,times}
\setstretch{1.15}

\begin{document}

% TITLE PAGE
\pagestyle{empty}
\begin{center}
\vspace*{\fill}
\includegraphics{figs/xenlogo.eps}
\vfill
\vfill
\vfill
\begin{tabular}{l}
{\Huge \bf Interface manual} \\[4mm]
{\huge Xen v2.0 for x86} \\[80mm]

{\Large Xen is Copyright (c) 2004, The Xen Team} \\[3mm]
{\Large University of Cambridge, UK} \\[20mm]
{\large Last updated on 11th March, 2004}
\end{tabular}
\vfill
\end{center}
\cleardoublepage

% TABLE OF CONTENTS
\pagestyle{plain}
\pagenumbering{roman}
{ \parskip 0pt plus 1pt
  \tableofcontents }
\cleardoublepage

% PREPARE FOR MAIN TEXT
\pagenumbering{arabic}
\raggedbottom
\widowpenalty=10000
\clubpenalty=10000
\parindent=0pt
\parskip=5pt
\renewcommand{\topfraction}{.8}
\renewcommand{\bottomfraction}{.8}
\renewcommand{\textfraction}{.2}
\renewcommand{\floatpagefraction}{.8}
\setstretch{1.15}

\chapter{Introduction}
Xen allows the hardware resouces of a machine to be virtualized and
dynamically partitioned such as to allow multiple different 'guest'
operating system images to be run simultaneously.

Virtualizing the machine in this manner provides flexibility allowing
different users to choose their preferred operating system (Windows,
Linux, NetBSD, or a custom operating system).  Furthermore, Xen provides
secure partitioning between these 'domains', and enables better resource
accounting and QoS isolation than can be achieved with a conventional
operating system.

The hypervisor runs directly on server hardware and dynamically partitions
it between a number of {\it domains}, each of which hosts an instance
of a {\it guest operating system}.  The hypervisor provides just enough
abstraction of the machine to allow effective isolation and resource 
management between these domains.

Xen essentially takes a virtual machine approach as pioneered by IBM
VM/370.  However, unlike VM/370 or more recent efforts such as VMWare
and Virtual PC, Xen doesn not attempt to completely virtualize the
underlying hardware.  Instead parts of the hosted guest operating
systems are modified to work with the hypervisor; the operating system
is effectively ported to a new target architecture, typically
requiring changes in just the machine-dependent code.  The user-level
API is unchanged, thus existing binaries and operating system
distributions can work unmodified.

In addition to exporting virtualized instances of CPU, memory, network and
block devicees, Xen exposes a control interface to set how these resources
are shared between the running domains.  The control interface is privileged
and may only be accessed by one particular virtual machine: {\it domain0}.
This domain is a required part of any Xen-base server and runs the application
software that manages the control-plane aspects of the platform.  Running the
control software in {\it domain0}, distinct from the hypervisor itself, allows
the Xen framework to separate the notions of {\it mechanism} and {\it policy}
within the system.


\chapter{CPU state}

All privileged state must be handled by Xen.  The guest OS has no
direct access to CR3 and is not permitted to update privileged bits in
EFLAGS.

\chapter{Exceptions}
The IDT is virtualised by submitting a virtual 'trap
table' to Xen.  Most trap handlers are identical to native x86
handlers.  The page-fault handler is a noteable exception.

\chapter{Interrupts and events}
Interrupts are virtualized by mapping them to events, which are delivered 
asynchronously to the target domain.  A guest OS can map these events onto
its standard interrupt dispatch mechanisms, such as a simple vectoring 
scheme.  Each physical interrupt source controlled by the hypervisor, including
network devices, disks, or the timer subsystem, is responsible for identifying
the target for an incoming interrupt and sending an event to that domain.

This demultiplexing mechanism also provides a device-specific mechanism for 
event coalescing or hold-off.  For example, a guest OS may request to only 
actually receive an event after {\it n} packets are queued ready for delivery
to it, {\it t} nanoseconds after the first packet arrived (which ever is true
first).  This allows latency and throughput requirements to be addressed on a
domain-specific basis.

\chapter{Time}
Guest operating systems need to be aware of the passage of real time and their
own ``virtual time'', i.e. the time they have been executing.  Furthermore, a
notion of time is required in the hypervisor itself for scheduling and the
activities that relate to it.  To this end the hypervisor provides for notions
of time:  cycle counter time, system time, wall clock time, domain virtual 
time.


\section{Cycle counter time}
This provides the finest-grained, free-running time reference, with the
approximate frequency being publicly accessible.  The cycle counter time is
used to accurately extrapolate the other time references.  On SMP machines
it is currently assumed that the cycle counter time is synchronised between
CPUs.  The current x86-based implementation achieves this within inter-CPU
communication latencies.

\section{System time}
This is a 64-bit value containing the nanoseconds elapsed since boot
time.  Unlike cycle counter time, system time accurately reflects the
passage of real time, i.e.  it is adjusted several times a second for timer
drift.  This is done by running an NTP client in {\it domain0} on behalf of
the machine, feeding updates to the hypervisor.  Intermediate values can be
extrapolated using the cycle counter.

\section{Wall clock time}
This is the actual ``time of day'' Unix style struct timeval (i.e. seconds and
microseconds since 1 January 1970, adjusted by leap seconds etc.).  Again, an 
NTP client hosted by {\it domain0} can help maintain this value.  To guest 
operating systems this value will be reported instead of the hardware RTC
clock value and they can use the system time and cycle counter times to start
and remain perfectly in time.


\section{Domain virtual time}
This progresses at the same pace as cycle counter time, but only while a
domain is executing.  It stops while a domain is de-scheduled.  Therefore the
share of the CPU that a domain receives is indicated by the rate at which
its domain virtual time increases, relative to the rate at which cycle
counter time does so.

\section{Time interface}
Xen exports some timestamps to guest operating systems through their shared
info page.  Timestamps are provided for system time and wall-clock time.  Xen
also provides the cycle counter values at the time of the last update
allowing guests to calculate the current values.  The cpu frequency and a
scaling factor are provided for guests to convert cycle counter values to
real time.  Since all time stamps need to be updated and read
\emph{atomically} two version numbers are also stored in the shared info
page.

Xen will ensure that the time stamps are updated frequently enough to avoid
an overflow of the cycle counter values.  A guest can check if its notion of
time is up-to-date by comparing the version numbers.

\section{Timer events}

Xen maintains a periodic timer (currently with a 10ms period) which sends a
timer event to the currently executing domain.  This allows Guest OSes to
keep track of the passing of time when executing.  The scheduler also
arranges for a newly activated domain to receive a timer event when
scheduled so that the Guest OS can adjust to the passage of time while it
has been inactive.

In addition, Xen exports a hypercall interface to each domain which allows
them to request a timer event sent to them at the specified system
time.  Guest OSes may use this timer to implement timeout values when they
block.

\chapter{Memory}

The hypervisor is responsible for providing memory to each of the
domains running over it.  However, the Xen hypervisor's duty is
restricted to managing physical memory and to policying page table
updates.  All other memory management functions are handled
externally.  Start-of-day issues such as building initial page tables
for a domain, loading its kernel image and so on are done by the {\it
domain builder} running in user-space in {\it domain0}.  Paging to
disk and swapping is handled by the guest operating systems
themselves, if they need it.

On a Xen-based system, the hypervisor itself runs in {\it ring 0}.  It
has full access to the physical memory available in the system and is
responsible for allocating portions of it to the domains.  Guest
operating systems run in and use {\it rings 1}, {\it 2} and {\it 3} as
they see fit, aside from the fact that segmentation is used to prevent
the guest OS from accessing a portion of the linear address space that
is reserved for use by the hypervisor.  This approach allows
transitions between the guest OS and hypervisor without flushing the
TLB.  We expect most guest operating systems will use ring 1 for their
own operation and place applications (if they support such a notion)
in ring 3.

\section{Physical Memory Allocation}
The hypervisor reserves a small fixed portion of physical memory at
system boot time.  This special memory region is located at the
beginning of physical memory and is mapped at the very top of every
virtual address space.

Any physical memory that is not used directly by the hypervisor is divided into
pages and is available for allocation to domains.  The hypervisor tracks which
pages are free and which pages have been allocated to each domain.  When a new
domain is initialized, the hypervisor allocates it pages drawn from the free 
list.  The amount of memory required by the domain is passed to the hypervisor
as one of the parameters for new domain initialization by the domain builder.

Domains can never be allocated further memory beyond that which was
requested for them on initialization.  However, a domain can return
pages to the hypervisor if it discovers that its memory requirements
have diminished.

% put reasons for why pages might be returned here.
\section{Page Table Updates}
In addition to managing physical memory allocation, the hypervisor is also in
charge of performing page table updates on behalf of the domains.  This is 
neccessary to prevent domains from adding arbitrary mappings to their page
tables or introducing mappings to other's page tables.

\section{Writabel Page Tables}
A domain can also request write access to its page tables.  In this
mode, Xen notes write attempts to page table pages and makes the page
temporarily writable.  In-use page table pages are also disconnect
from the page directory.  The domain can now update entries in these
page table pages without the assistance of Xen.  As soon as the
writabel page table pages get used as page table pages, Xen makes the
pages read-only again and revalidates the entries in the pages.

\section{Segment Descriptor Tables}

On boot a guest is supplied with a default GDT, which is {\em not}
taken from its own memory allocation.  If the guest wishes to use other
than the default `flat' ring-1 and ring-3 segments that this default
table provides, it must register a custom GDT and/or LDT with Xen,
allocated from its own memory.

int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em entries})

{\em frame\_list}: An array of up to 16 page frames within which the
GDT resides.  Any frame registered as a GDT frame may only be mapped
read-only within the guest's address space (e.g., no writable
mappings, no use as a page-table page, and so on).

{\em entries}: The number of descriptor-entry slots in the GDT.  Note
that the table must be large enough to contain Xen's reserved entries;
thus we must have '{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}'.
Note also that, after registering the GDT, slots {\em FIRST\_} through
{\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and
may be overwritten by Xen.

\section{Pseudo-Physical Memory}
The usual problem of external fragmentation means that a domain is
unlikely to receive a contiguous stretch of physical memory.  However,
most guest operating systems do not have built-in support for
operating in a fragmented physical address space e.g. Linux has to
have a one-to-one mapping for its physical memory.  There a notion of
{\it pseudo physical memory} is introdouced.  Xen maintains a {\it
real physical} to {\it pseudo physical} mapping which can be consulted
by every domain.  Additionally, at its start of day, a domain is
supplied a {\it pseudo physical} to {\it real physical} mapping which
it needs to keep updated itself.  From that moment onwards {\it pseudo
physical} addresses are used instead of discontiguous {\it real
physical} addresses.  Thus, the rest of the guest OS code has an
impression of operating in a contiguous address space.  Guest OS page
tables contain real physical addresses.  Mapping {\it pseudo physical}
to {\it real physical} addresses is needed on page table updates and
also on remapping memory regions with the guest OS.


\chapter{Network I/O}

Virtual network device services are provided by shared memory
communications with a `backend' domain.  From the point of view of
other domains, the backend may be viewed as a virtual ethernet switch
element with each domain having one or more virtual network interfaces
connected to it.

\section{Backend Packet Handling}
The backend driver is responsible primarily for {\it data-path} operations.
In terms of networking this means packet transmission and reception.

On the transmission side, the backend needs to perform two key actions:
\begin{itemize}
\item {\tt Validation:} A domain may only be allowed to emit packets
matching a certain specification; for example, ones in which the
source IP address matches one assigned to the virtual interface over
which it is sent.  The backend would be responsible for ensuring any
such requirements are met, either by checking or by stamping outgoing
packets with prescribed values for certain fields.

Validation functions can be configured using standard firewall rules
(i.e. IP Tables, in the case of Linux).

\item {\tt Scheduling:} Since a number of domains can share a single
``real'' network interface, the hypervisor must mediate access when
several domains each have packets queued for transmission.  Of course,
this general scheduling function subsumes basic shaping or
rate-limiting schemes.

\item {\tt Logging and Accounting:} The hypervisor can be configured
with classifier rules that control how packets are accounted or
logged.  For example, {\it domain0} could request that it receives a
log message or copy of the packet whenever another domain attempts to
send a TCP packet containg a SYN.
\end{itemize}

On the recive side, the backend's role is relatively straightforward:
once a packet is received, it just needs to determine the virtual interface(s)
to which it must be delivered and deliver it via page-flipping. 


\section{Data Transfer}

Each virtual interface uses two ``descriptor rings'', one for transmit,
the other for receive.  Each descriptor identifies a block of contiguous
physical memory allocated to the domain.  There are four cases:

\begin{itemize}

\item The transmit ring carries packets to transmit from the domain to the
hypervisor.

\item The return path of the transmit ring carries ``empty'' descriptors
indicating that the contents have been transmitted and the memory can be
re-used.

\item The receive ring carries empty descriptors from the domain to the 
hypervisor; these provide storage space for that domain's received packets.

\item The return path of the receive ring carries packets that have been
received.
\end{itemize}

Real physical addresses are used throughout, with the domain performing 
translation from pseudo-physical addresses if that is necessary.

If a domain does not keep its receive ring stocked with empty buffers then 
packets destined to it may be dropped.  This provides some defense against 
receiver-livelock problems because an overload domain will cease to receive
further data.  Similarly, on the transmit path, it provides the application
with feedback on the rate at which packets are able to leave the system.

Synchronization between the hypervisor and the domain is achieved using 
counters held in shared memory that is accessible to both.  Each ring has
associated producer and consumer indices indicating the area in the ring
that holds descriptors that contain data.  After receiving {\it n} packets
or {\t nanoseconds} after receiving the first packet, the hypervisor sends
an event to the domain. 

\chapter{Block I/O}

\section{Virtual Block Devices (VBDs)}

All guest OS disk access goes through the VBD interface.  The VBD
interface provides the administrator with the ability to selectively
grant domains access to portions of block storage devices visible to
the the block backend device (usually domain 0).

VBDs can literally be backed by any block device accessible to the
backend domain, including network-based block devices (iSCSI, *NBD,
etc), loopback devices and LVM / MD devices.

Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since
similar functionality can be achieved using the (more advanced) LVM
system, which is already in widespread use.

\subsection{Data Transfer}
Domains which have been granted access to a logical block device are permitted
to read and write it by shared memory communications with the backend domain. 

In overview, the same style of descriptor-ring that is used for
network packets is used here.  Each domain has one ring that carries
operation requests to the hypervisor and carries the results back
again.

Rather than copying data, the backend simply maps the domain's buffers
in order to enable direct DMA to them.  The act of mapping the buffers
also increases the reference counts of the underlying pages, so that
the unprivileged domain cannot try to return them to the hypervisor,
install them as page tables, or any other unsafe behaviour.
%block API here 

\chapter{Privileged operations}
{\it Domain0} is responsible for building all other domains on the server
and providing control interfaces for managing scheduling, networking, and
blocks.

\chapter{CPU Scheduler}

Xen offers a uniform API for CPU schedulers.  It is possible to choose
from a number of schedulers at boot and it should be easy to add more.

\paragraph*{Note: SMP host support}
Xen has always supported SMP host systems.  Domains are statically assigned to
CPUs, either at creation time or when manually pinning to a particular CPU.
The current schedulers then run locally on each CPU to decide which of the
assigned domains should be run there.

\section{Standard Schedulers}

These BVT, Atropos and Round Robin schedulers are part of the normal
Xen distribution.  BVT provides proportional fair shares of the CPU to
the running domains.  Atropos can be used to reserve absolute shares
of the CPU for each domain.  Round-robin is provided as an example of
Xen's internal scheduler API.

More information on the characteristics and use of these schedulers is
available in { \tt Sched-HOWTO.txt }.

\section{Scheduling API}

The scheduling API is used by both the schedulers described above and should
also be used by any new schedulers.  It provides a generic interface and also
implements much of the ``boilerplate'' code.

Schedulers conforming to this API are described by the following
structure:

\begin{verbatim}
struct scheduler
{
    char *name;             /* full name for this scheduler      */
    char *opt_name;         /* option name for this scheduler    */
    unsigned int sched_id;  /* ID for this scheduler             */

    int          (*init_scheduler) ();
    int          (*alloc_task)     (struct task_struct *);
    void         (*add_task)       (struct task_struct *);
    void         (*free_task)      (struct task_struct *);
    void         (*rem_task)       (struct task_struct *);
    void         (*wake_up)        (struct task_struct *);
    void         (*do_block)       (struct task_struct *);
    task_slice_t (*do_schedule)    (s_time_t);
    int          (*control)        (struct sched_ctl_cmd *);
    int          (*adjdom)         (struct task_struct *,
                                    struct sched_adjdom_cmd *);
    s32          (*reschedule)     (struct task_struct *);
    void         (*dump_settings)  (void);
    void         (*dump_cpu_state) (int);
    void         (*dump_runq_el)   (struct task_struct *);
};
\end{verbatim}

The only method that {\em must} be implemented is
{\tt do\_schedule()}.  However, if there is not some implementation for the
{\tt wake\_up()} method then waking tasks will not get put on the runqueue!

The fields of the above structure are described in more detail below.

\subsubsection{name}

The name field should point to a descriptive ASCII string.

\subsubsection{opt\_name}

This field is the value of the {\tt sched=} boot-time option that will select
this scheduler.

\subsubsection{sched\_id}

This is an integer that uniquely identifies this scheduler.  There should be a
macro corrsponding to this scheduler ID in {\tt <hypervisor-ifs/sched-if.h>}.

\subsubsection{init\_scheduler}

\paragraph*{Purpose}

This is a function for performing any scheduler-specific initialisation.  For
instance, it might allocate memory for per-CPU scheduler data and initialise it
appropriately.

\paragraph*{Call environment}

This function is called after the initialisation performed by the generic
layer.  The function is called exactly once, for the scheduler that has been
selected.

\paragraph*{Return values}

This should return negative on failure --- this will cause an
immediate panic and the system will fail to boot.

\subsubsection{alloc\_task}

\paragraph*{Purpose}
Called when a {\tt task\_struct} is allocated by the generic scheduler
layer.  A particular scheduler implementation may use this method to
allocate per-task data for this task.  It may use the {\tt
sched\_priv} pointer in the {\tt task\_struct} to point to this data.

\paragraph*{Call environment}
The generic layer guarantees that the {\tt sched\_priv} field will
remain intact from the time this method is called until the task is
deallocated (so long as the scheduler implementation does not change
it explicitly!).

\paragraph*{Return values}
Negative on failure.

\subsubsection{add\_task}

\paragraph*{Purpose}

Called when a task is initially added by the generic layer.

\paragraph*{Call environment}

The fields in the {\tt task\_struct} are now filled out and available for use.
Schedulers should implement appropriate initialisation of any per-task private
information in this method.

\subsubsection{free\_task}

\paragraph*{Purpose}

Schedulers should free the space used by any associated private data
structures.

\paragraph*{Call environment}

This is called when a {\tt task\_struct} is about to be deallocated.
The generic layer will have done generic task removal operations and
(if implemented) called the scheduler's {\tt rem\_task} method before
this method is called.

\subsubsection{rem\_task}

\paragraph*{Purpose}

This is called when a task is being removed from scheduling (but is
not yet being freed).

\subsubsection{wake\_up}

\paragraph*{Purpose}

Called when a task is woken up, this method should put the task on the runqueue
(or do the scheduler-specific equivalent action).

\paragraph*{Call environment}

The task is already set to state RUNNING.

\subsubsection{do\_block}

\paragraph*{Purpose}

This function is called when a task is blocked.  This function should
not remove the task from the runqueue.

\paragraph*{Call environment}

The EVENTS\_MASTER\_ENABLE\_BIT is already set and the task state changed to
TASK\_INTERRUPTIBLE on entry to this method.  A call to the {\tt
  do\_schedule} method will be made after this method returns, in
order to select the next task to run.

\subsubsection{do\_schedule}

This method must be implemented.

\paragraph*{Purpose}

The method is called each time a new task must be chosen for scheduling on the
current CPU.  The current time as passed as the single argument (the current
task can be found using the {\tt current} macro).

This method should select the next task to run on this CPU and set it's minimum
time to run as well as returning the data described below.

This method should also take the appropriate action if the previous
task has blocked, e.g. removing it from the runqueue.

\paragraph*{Call environment}

The other fields in the {\tt task\_struct} are updated by the generic layer,
which also performs all Xen-specific tasks and performs the actual task switch
(unless the previous task has been chosen again).

This method is called with the {\tt schedule\_lock} held for the current CPU
and local interrupts disabled.

\paragraph*{Return values}

Must return a {\tt struct task\_slice} describing what task to run and how long
for (at maximum).

\subsubsection{control}

\paragraph*{Purpose}

This method is called for global scheduler control operations.  It takes a
pointer to a {\tt struct sched\_ctl\_cmd}, which it should either
source data from or populate with data, depending on the value of the
{\tt direction} field.

\paragraph*{Call environment}

The generic layer guarantees that when this method is called, the
caller selected the correct scheduler ID, hence the scheduler's
implementation does not need to sanity-check these parts of the call.

\paragraph*{Return values}

This function should return the value to be passed back to user space, hence it
should either be 0 or an appropriate errno value.

\subsubsection{sched\_adjdom}

\paragraph*{Purpose}

This method is called to adjust the scheduling parameters of a particular
domain, or to query their current values.  The function should check
the {\tt direction} field of the {\tt sched\_adjdom\_cmd} it receives in
order to determine which of these operations is being performed.

\paragraph*{Call environment}

The generic layer guarantees that the caller has specified the correct
control interface version and scheduler ID and that the supplied {\tt
task\_struct} will not be deallocated during the call (hence it is not
necessary to {\tt get\_task\_struct}).

\paragraph*{Return values}

This function should return the value to be passed back to user space, hence it
should either be 0 or an appropriate errno value.

\subsubsection{reschedule}

\paragraph*{Purpose}

This method is called to determine if a reschedule is required as a result of a
particular task.

\paragraph*{Call environment}
The generic layer will cause a reschedule if the current domain is the idle
task or it has exceeded its minimum time slice before a reschedule.  The
generic layer guarantees that the task passed is not currently running but is
on the runqueue.

\paragraph*{Return values}

Should return a mask of CPUs to cause a reschedule on.

\subsubsection{dump\_settings}

\paragraph*{Purpose}

If implemented, this should dump any private global settings for this
scheduler to the console.

\paragraph*{Call environment}

This function is called with interrupts enabled.

\subsubsection{dump\_cpu\_state}

\paragraph*{Purpose}

This method should dump any private settings for the specified CPU.

\paragraph*{Call environment}

This function is called with interrupts disabled and the {\tt schedule\_lock}
for the specified CPU held.

\subsubsection{dump\_runq\_el}

\paragraph*{Purpose}

This method should dump any private settings for the specified task.

\paragraph*{Call environment}

This function is called with interrupts disabled and the {\tt schedule\_lock}
for the task's CPU held.


\chapter{Debugging}

Xen provides tools for debugging both Xen and guest OSes.  Currently, the
Pervasive Debugger provides a GDB stub, which provides facilities for symbolic
debugging of Xen itself and of OS kernels running on top of Xen.  The Trace
Buffer provides a lightweight means to log data about Xen's internal state and
behaviour at runtime, for later analysis.

\section{Pervasive Debugger}

Information on using the pervasive debugger is available in pdb.txt.


\section{Trace Buffer}

The trace buffer provides a means to observe Xen's operation from domain 0.
Trace events, inserted at key points in Xen's code, record data that can be
read by the {\tt xentrace} tool.  Recording these events has a low overhead
and hence the trace buffer may be useful for debugging timing-sensitive
behaviours.

\subsection{Internal API}

To use the trace buffer functionality from within Xen, you must {\tt \#include
<xen/trace.h>}, which contains definitions related to the trace buffer.  Trace
events are inserted into the buffer using the {\tt TRACE\_xD} ({\tt x} = 0, 1,
2, 3, 4 or 5) macros.  These all take an event number, plus {\tt x} additional
(32-bit) data as their arguments.  For trace buffer-enabled builds of Xen these
will insert the event ID and data into the trace buffer, along with the current
value of the CPU cycle-counter.  For builds without the trace buffer enabled,
the macros expand to no-ops and thus can be left in place without incurring
overheads.

\subsection{Trace-enabled builds}

By default, the trace buffer is enabled only in debug builds (i.e. {\tt NDEBUG}
is not defined).  It can be enabled separately by defining {\tt TRACE\_BUFFER},
either in {\tt <xen/config.h>} or on the gcc command line.

The size (in pages) of the per-CPU trace buffers can be specified using the
{\tt tbuf\_size=n } boot parameter to Xen.  If the size is set to 0, the trace
buffers will be disabled.

\subsection{Dumping trace data}

When running a trace buffer build of Xen, trace data are written continuously
into the buffer data areas, with newer data overwriting older data.  This data
can be captured using the {\tt xentrace} program in Domain 0.

The {\tt xentrace} tool uses {\tt /dev/mem} in domain 0 to map the trace
buffers into its address space.  It then periodically polls all the buffers for
new data, dumping out any new records from each buffer in turn.  As a result,
for machines with multiple (logical) CPUs, the trace buffer output will not be
in overall chronological order.

The output from {\tt xentrace} can be post-processed using {\tt
xentrace\_cpusplit} (used to split trace data out into per-cpu log files) and
{\tt xentrace\_format} (used to pretty-print trace data).  For the predefined
trace points, there is an example format file in {\tt tools/xentrace/formats }.

For more information, see the manual pages for {\tt xentrace}, {\tt
xentrace\_format} and {\tt xentrace\_cpusplit}.


\chapter{Hypervisor calls}

\section{ set\_trap\_table(trap\_info\_t *table)} 

Install trap handler table.

\section{ mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} 
Update the page table for the domain. Updates can be batched.
success\_count will be updated to report the number of successfull
updates.  The update types are:

{\it MMU\_NORMAL\_PT\_UPDATE}:

{\it MMU\_MACHPHYS\_UPDATE}:

{\it MMU\_EXTENDED\_COMMAND}:

\section{ set\_gdt(unsigned long *frame\_list, int entries)} 
Set the global descriptor table - virtualization for lgdt.

\section{ stack\_switch(unsigned long ss, unsigned long esp)} 
Request context switch from hypervisor.

\section{ set\_callbacks(unsigned long event\_selector, unsigned long event\_address,
                        unsigned long failsafe\_selector, unsigned
 long failsafe\_address) } Register OS event processing routine.  In
 Linux both the event\_selector and failsafe\_selector are the
 kernel's CS.  The value event\_address specifies the address for an
 interrupt handler dispatch routine and failsafe\_address specifies a
 handler for application faults.

\section{ fpu\_taskswitch(void)} 
Notify hypervisor that fpu registers needed to be save on context switch.

\section{ sched\_op(unsigned long op)} 
Request scheduling operation from hypervisor. The options are: {\it
yield}, {\it block}, and {\it shutdown}.  {\it yield} keeps the
calling domain run-able but may cause a reschedule if other domains
are run-able.  {\it block} removes the calling domain from the run
queue and the domains sleeps until an event is delivered to it.  {\it
shutdown} is used to end the domain's execution and allows to specify
whether the domain should reboot, halt or suspend..

\section{ dom0\_op(dom0\_op\_t *op)} 
Administrative domain operations for domain management. The options are:

{\it DOM0\_CREATEDOMAIN}: create new domain, specifying the name and memory usage
in kilobytes.

{\it DOM0\_CREATEDOMAIN}: create domain

{\it DOM0\_PAUSEDOMAIN}: mark domain as unschedulable

{\it DOM0\_UNPAUSEDOMAIN}: mark domain as schedulable

{\it DOM0\_DESTROYDOMAIN}: deallocate resources associated with the domain

{\it DOM0\_GETMEMLIST}: get list of pages used by the domain

{\it DOM0\_SCHEDCTL}:

{\it DOM0\_ADJUSTDOM}: adjust scheduling priorities for domain

{\it DOM0\_BUILDDOMAIN}: do final guest OS setup for domain

{\it DOM0\_GETDOMAINFO}: get statistics about the domain

{\it DOM0\_GETPAGEFRAMEINFO}:

{\it DOM0\_IOPL}: set IO privilege level

{\it DOM0\_MSR}:

{\it DOM0\_DEBUG}: interactively call pervasive debugger

{\it DOM0\_SETTIME}: set system time

{\it DOM0\_READCONSOLE}: read console content from hypervisor buffer ring

{\it DOM0\_PINCPUDOMAIN}: pin domain to a particular CPU

{\it DOM0\_GETTBUFS}: get information about the size and location of
                      the trace buffers (only on trace-buffer enabled builds)

{\it DOM0\_PHYSINFO}: get information about the host machine

{\it DOM0\_PCIDEV\_ACCESS}: modify PCI device access permissions

{\it DOM0\_SCHED\_ID}: get the ID of the current Xen scheduler

{\it DOM0\_SHADOW\_CONTROL}:

{\it DOM0\_SETDOMAINNAME}: set the name of a domain

{\it DOM0\_SETDOMAININITIALMEM}: set initial memory allocation of a domain

{\it DOM0\_SETDOMAINMAXMEM}: set maximum memory allocation of a domain

{\it DOM0\_GETPAGEFRAMEINFO2}:

{\it DOM0\_SETDOMAINVMASSIST}: set domain VM assist options


\section{ set\_debugreg(int reg, unsigned long value)}
set debug register reg to value

\section{ get\_debugreg(int reg)}
 get the debug register reg

\section{ update\_descriptor(unsigned long ma, unsigned long word1, unsigned long word2)} 

\section{ set\_fast\_trap(int idx)}
 install traps to allow guest OS to bypass hypervisor

\section{ dom\_mem\_op(unsigned int op, unsigned long *extent\_list, unsigned long nr\_extents, unsigned int extent\_order)}
Increase or decrease memory reservations for guest OS

\section{ multicall(void *call\_list, int nr\_calls)}
Execute a series of hypervisor calls

\section{ update\_va\_mapping(unsigned long page\_nr, unsigned long val, unsigned long flags)}

\section{ set\_timer\_op(uint64\_t timeout)} 
Request a timer event to be sent at the specified system time.

\section{ event\_channel\_op(void *op)} 
Iinter-domain event-channel management.

\section{ xen\_version(int cmd)}
Request Xen version number.

\section{ console\_io(int cmd, int count, char *str)}
Interact with the console, operations are:

{\it CONSOLEIO\_write}: Output count characters from buffer str.

{\it CONSOLEIO\_read}: Input at most count characters into buffer str.

\section{ physdev\_op(void *physdev\_op)}

\section{ grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}

\section{ vm\_assist(unsigned int cmd, unsigned int type)}

\section{ update\_va\_mapping\_otherdomain(unsigned long page\_nr, unsigned long val, unsigned long flags, uint16\_t domid)}

\end{document}