docs/src/interface.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216

\documentclass[11pt,twoside,final,openright,a4paper]{report}
\usepackage{graphicx,html,setspace,times}
\usepackage{parskip}
\setstretch{1.15}

% LIBRARY FUNCTIONS

\newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}

\begin{document}

% TITLE PAGE
\pagestyle{empty}
\begin{center}
\vspace*{\fill}
\includegraphics{figs/xenlogo.eps}
\vfill
\vfill
\vfill
\begin{tabular}{l}
{\Huge \bf Interface manual} \\[4mm]
{\huge Xen v3.0 for x86} \\[80mm]

{\Large Xen is Copyright (c) 2002-2005, The Xen Team} \\[3mm]
{\Large University of Cambridge, UK} \\[20mm]
\end{tabular}
\end{center}

{\bf DISCLAIMER: This documentation is always under active development
and as such there may be mistakes and omissions --- watch out for
these and please report any you find to the developer's mailing list.
The latest version is always available on-line.  Contributions of
material, suggestions and corrections are welcome.  }

\vfill
\cleardoublepage

% TABLE OF CONTENTS
\pagestyle{plain}
\pagenumbering{roman}
{ \parskip 0pt plus 1pt
  \tableofcontents }
\cleardoublepage

% PREPARE FOR MAIN TEXT
\pagenumbering{arabic}
\raggedbottom
\widowpenalty=10000
\clubpenalty=10000
\parindent=0pt
\parskip=5pt
\renewcommand{\topfraction}{.8}
\renewcommand{\bottomfraction}{.8}
\renewcommand{\textfraction}{.2}
\renewcommand{\floatpagefraction}{.8}
\setstretch{1.1}

\chapter{Introduction}

Xen allows the hardware resources of a machine to be virtualized and
dynamically partitioned, allowing multiple different {\em guest}
operating system images to be run simultaneously.  Virtualizing the
machine in this manner provides considerable flexibility, for example
allowing different users to choose their preferred operating system
(e.g., Linux, NetBSD, or a custom operating system).  Furthermore, Xen
provides secure partitioning between virtual machines (known as
{\em domains} in Xen terminology), and enables better resource
accounting and QoS isolation than can be achieved with a conventional
operating system. 

Xen essentially takes a `whole machine' virtualization approach as
pioneered by IBM VM/370.  However, unlike VM/370 or more recent
efforts such as VMware and Virtual PC, Xen does not attempt to
completely virtualize the underlying hardware.  Instead parts of the
hosted guest operating systems are modified to work with the VMM; the
operating system is effectively ported to a new target architecture,
typically requiring changes in just the machine-dependent code.  The
user-level API is unchanged, and so existing binaries and operating
system distributions work without modification.

In addition to exporting virtualized instances of CPU, memory, network
and block devices, Xen exposes a control interface to manage how these
resources are shared between the running domains. Access to the
control interface is restricted: it may only be used by one
specially-privileged VM, known as {\em domain 0}.  This domain is a
required part of any Xen-based server and runs the application software
that manages the control-plane aspects of the platform.  Running the
control software in {\it domain 0}, distinct from the hypervisor
itself, allows the Xen framework to separate the notions of 
mechanism and policy within the system.


\chapter{Virtual Architecture}

In a Xen/x86 system, only the hypervisor runs with full processor
privileges ({\it ring 0} in the x86 four-ring model). It has full
access to the physical memory available in the system and is
responsible for allocating portions of it to running domains.  

On a 32-bit x86 system, guest operating systems may use {\it rings 1},
{\it 2} and {\it 3} as they see fit.  Segmentation is used to prevent
the guest OS from accessing the portion of the address space that is
reserved for Xen.  We expect most guest operating systems will use
ring 1 for their own operation and place applications in ring 3.

On 64-bit systems it is not possible to protect the hypervisor from
untrusted guest code running in rings 1 and 2. Guests are therefore
restricted to run in ring 3 only. The guest kernel is protected from its
applications by context switching between the kernel and currently
running application.

In this chapter we consider the basic virtual architecture provided by
Xen: CPU state, exception and interrupt handling, and time.
Other aspects such as memory and device access are discussed in later
chapters.


\section{CPU state}

All privileged state must be handled by Xen.  The guest OS has no
direct access to CR3 and is not permitted to update privileged bits in
EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen;
these are analogous to system calls but occur from ring 1 to ring 0.

A list of all hypercalls is given in Appendix~\ref{a:hypercalls}.


\section{Exceptions}

A virtual IDT is provided --- a domain can submit a table of trap
handlers to Xen via the {\bf set\_trap\_table} hypercall.  The
exception stack frame presented to a virtual trap handler is identical
to its native equivalent.


\section{Interrupts and events}

Interrupts are virtualized by mapping them to \emph{event channels},
which are delivered asynchronously to the target domain using a callback
supplied via the {\bf set\_callbacks} hypercall.  A guest OS can map
these events onto its standard interrupt dispatch mechanisms.  Xen is
responsible for determining the target domain that will handle each
physical interrupt source. For more details on the binding of event
sources to event channels, see Chapter~\ref{c:devices}.


\section{Time}

Guest operating systems need to be aware of the passage of both real
(or wallclock) time and their own `virtual time' (the time for which
they have been executing). Furthermore, Xen has a notion of time which
is used for scheduling. The following notions of time are provided:

\begin{description}
\item[Cycle counter time.]

  This provides a fine-grained time reference.  The cycle counter time
  is used to accurately extrapolate the other time references.  On SMP
  machines it is currently assumed that the cycle counter time is
  synchronized between CPUs.  The current x86-based implementation
  achieves this within inter-CPU communication latencies.

\item[System time.]

  This is a 64-bit counter which holds the number of nanoseconds that
  have elapsed since system boot.

\item[Wall clock time.]

  This is the time of day in a Unix-style {\bf struct timeval}
  (seconds and microseconds since 1 January 1970, adjusted by leap
  seconds).  An NTP client hosted by {\it domain 0} can keep this
  value accurate.

\item[Domain virtual time.]

  This progresses at the same pace as system time, but only while a
  domain is executing --- it stops while a domain is de-scheduled.
  Therefore the share of the CPU that a domain receives is indicated
  by the rate at which its virtual time increases.

\end{description}


Xen exports timestamps for system time and wall-clock time to guest
operating systems through a shared page of memory.  Xen also provides
the cycle counter time at the instant the timestamps were calculated,
and the CPU frequency in Hertz.  This allows the guest to extrapolate
system and wall-clock times accurately based on the current cycle
counter time.

Since all time stamps need to be updated and read \emph{atomically}
a version number is also stored in the shared info page, which is
incremented before and after updating the timestamps. Thus a guest can
be sure that it read a consistent state by checking the two version
numbers are equal and even.

Xen includes a periodic ticker which sends a timer event to the
currently executing domain every 10ms.  The Xen scheduler also sends a
timer event whenever a domain is scheduled; this allows the guest OS
to adjust for the time that has passed while it has been inactive.  In
addition, Xen allows each domain to request that they receive a timer
event sent at a specified system time by using the {\bf
  set\_timer\_op} hypercall.  Guest OSes may use this timer to
implement timeout values when they block.


\section{Xen CPU Scheduling}

Xen offers a uniform API for CPU schedulers.  It is possible to choose
from a number of schedulers at boot and it should be easy to add more.
The SEDF and Credit schedulers are part of the normal Xen
distribution.  SEDF will be going away and its use should be
avoided once the credit scheduler has stabilized and become the default.
The Credit scheduler provides proportional fair shares of the
host's CPUs to the running domains. It does this while transparently
load balancing runnable VCPUs across the whole system.

\paragraph*{Note: SMP host support}
Xen has always supported SMP host systems. When using the credit scheduler,
a domain's VCPUs will be dynamically moved across physical CPUs to maximise
domain and system throughput. VCPUs can also be manually restricted to be
mapped only on a subset of the host's physical CPUs, using the pinning
mechanism.


%% More information on the characteristics and use of these schedulers
%% is available in {\bf Sched-HOWTO.txt}.


\section{Privileged operations}

Xen exports an extended interface to privileged domains (viz.\ {\it
  Domain 0}). This allows such domains to build and boot other domains
on the server, and provides control interfaces for managing
scheduling, memory, networking, and block devices.

\chapter{Memory}
\label{c:memory} 

Xen is responsible for managing the allocation of physical memory to
domains, and for ensuring safe use of the paging and segmentation
hardware.


\section{Memory Allocation}

As well as allocating a portion of physical memory for its own private
use, Xen also reserves s small fixed portion of every virtual address
space. This is located in the top 64MB on 32-bit systems, the top
168MB on PAE systems, and a larger portion in the middle of the
address space on 64-bit systems. Unreserved physical memory is
available for allocation to domains at a page granularity.  Xen tracks
the ownership and use of each page, which allows it to enforce secure
partitioning between domains.

Each domain has a maximum and current physical memory allocation.  A
guest OS may run a `balloon driver' to dynamically adjust its current
memory allocation up to its limit.


\section{Pseudo-Physical Memory}

Since physical memory is allocated and freed on a page granularity,
there is no guarantee that a domain will receive a contiguous stretch
of physical memory. However most operating systems do not have good
support for operating in a fragmented physical address space. To aid
porting such operating systems to run on top of Xen, we make a
distinction between \emph{machine memory} and \emph{pseudo-physical
  memory}.

Put simply, machine memory refers to the entire amount of memory
installed in the machine, including that reserved by Xen, in use by
various domains, or currently unallocated. We consider machine memory
to comprise a set of 4kB \emph{machine page frames} numbered
consecutively starting from 0. Machine frame numbers mean the same
within Xen or any domain.

Pseudo-physical memory, on the other hand, is a per-domain
abstraction. It allows a guest operating system to consider its memory
allocation to consist of a contiguous range of physical page frames
starting at physical frame 0, despite the fact that the underlying
machine page frames may be sparsely allocated and in any order.

To achieve this, Xen maintains a globally readable {\it
  machine-to-physical} table which records the mapping from machine
page frames to pseudo-physical ones. In addition, each domain is
supplied with a {\it physical-to-machine} table which performs the
inverse mapping. Clearly the machine-to-physical table has size
proportional to the amount of RAM installed in the machine, while each
physical-to-machine table has size proportional to the memory
allocation of the given domain.

Architecture dependent code in guest operating systems can then use
the two tables to provide the abstraction of pseudo-physical memory.
In general, only certain specialized parts of the operating system
(such as page table management) needs to understand the difference
between machine and pseudo-physical addresses.


\section{Page Table Updates}

In the default mode of operation, Xen enforces read-only access to
page tables and requires guest operating systems to explicitly request
any modifications.  Xen validates all such requests and only applies
updates that it deems safe.  This is necessary to prevent domains from
adding arbitrary mappings to their page tables.

To aid validation, Xen associates a type and reference count with each
memory page. A page has one of the following mutually-exclusive types
at any point in time: page directory ({\sf PD}), page table ({\sf
  PT}), local descriptor table ({\sf LDT}), global descriptor table
({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always
create readable mappings of its own memory regardless of its current
type.

%%% XXX: possibly explain more about ref count 'lifecyle' here?
This mechanism is used to maintain the invariants required for safety;
for example, a domain cannot have a writable mapping to any part of a
page table as this would require the page concerned to simultaneously
be of types {\sf PT} and {\sf RW}.

\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count, domid\_t domid)}

This hypercall is used to make updates to either the domain's
pagetables or to the machine to physical mapping table.  It supports
submitting a queue of updates, allowing batching for maximal
performance.  Explicitly queuing updates using this interface will
cause any outstanding writable pagetable state to be flushed from the
system.

\section{Writable Page Tables}

Xen also provides an alternative mode of operation in which guests
have the illusion that their page tables are directly writable.  Of
course this is not really the case, since Xen must still validate
modifications to ensure secure partitioning. To this end, Xen traps
any write attempt to a memory page of type {\sf PT} (i.e., that is
currently part of a page table).  If such an access occurs, Xen
temporarily allows write access to that page while at the same time
\emph{disconnecting} it from the page table that is currently in use.
This allows the guest to safely make updates to the page because the
newly-updated entries cannot be used by the MMU until Xen revalidates
and reconnects the page.  Reconnection occurs automatically in a
number of situations: for example, when the guest modifies a different
page-table page, when the domain is preempted, or whenever the guest
uses Xen's explicit page-table update interfaces.

Writable pagetable functionality is enabled when the guest requests
it, using a {\bf vm\_assist} hypercall.  Writable pagetables do {\em
not} provide full virtualisation of the MMU, so the memory management
code of the guest still needs to be aware that it is running on Xen.
Since the guest's page tables are used directly, it must translate
pseudo-physical addresses to real machine addresses when building page
table entries.  The guest may not attempt to map its own pagetables
writably, since this would violate the memory type invariants; page
tables will automatically be made writable by the hypervisor, as
necessary.

\section{Shadow Page Tables}

Finally, Xen also supports a form of \emph{shadow page tables} in
which the guest OS uses a independent copy of page tables which are
unknown to the hardware (i.e.\ which are never pointed to by {\tt
  cr3}). Instead Xen propagates changes made to the guest's tables to
the real ones, and vice versa. This is useful for logging page writes
(e.g.\ for live migration or checkpoint). A full version of the shadow
page tables also allows guest OS porting with less effort.


\section{Segment Descriptor Tables}

At start of day a guest is supplied with a default GDT, which does not reside
within its own memory allocation.  If the guest wishes to use other
than the default `flat' ring-1 and ring-3 segments that this GDT
provides, it must register a custom GDT and/or LDT with Xen, allocated
from its own memory.

The following hypercall is used to specify a new GDT:

\begin{quote}
  int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em
    entries})

  \emph{frame\_list}: An array of up to 14 machine page frames within
  which the GDT resides.  Any frame registered as a GDT frame may only
  be mapped read-only within the guest's address space (e.g., no
  writable mappings, no use as a page-table page, and so on). Only 14
  pages may be specified because pages 15 and 16 are reserved for
  the hypervisor's GDT entries.

  \emph{entries}: The number of descriptor-entry slots in the GDT.
\end{quote}

The LDT is updated via the generic MMU update mechanism (i.e., via the
{\bf mmu\_update} hypercall.

\section{Start of Day}

The start-of-day environment for guest operating systems is rather
different to that provided by the underlying hardware. In particular,
the processor is already executing in protected mode with paging
enabled.

{\it Domain 0} is created and booted by Xen itself. For all subsequent
domains, the analogue of the boot-loader is the {\it domain builder},
user-space software running in {\it domain 0}. The domain builder is
responsible for building the initial page tables for a domain and
loading its kernel image at the appropriate virtual address.

\section{VM assists}

Xen provides a number of ``assists'' for guest memory management.
These are available on an ``opt-in'' basis to provide commonly-used
extra functionality to a guest.

\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}

The {\bf cmd} parameter describes the action to be taken, whilst the
{\bf type} parameter describes the kind of assist that is being
referred to.  Available commands are as follows:

\begin{description}
\item[VMASST\_CMD\_enable] Enable a particular assist type
\item[VMASST\_CMD\_disable] Disable a particular assist type
\end{description}

And the available types are:

\begin{description}
\item[VMASST\_TYPE\_4gb\_segments] Provide emulated support for
  instructions that rely on 4GB segments (such as the techniques used
  by some TLS solutions).
\item[VMASST\_TYPE\_4gb\_segments\_notify] Provide a callback (via trap number
  15) to the guest if the above segment fixups are used: allows the guest to
  display a warning message during boot.
\item[VMASST\_TYPE\_writable\_pagetables] Enable writable pagetable
  mode - described above.
\end{description}


\chapter{Xen Info Pages}

The {\bf Shared info page} is used to share various CPU-related state
between the guest OS and the hypervisor.  This information includes VCPU
status, time information and event channel (virtual interrupt) state.
The {\bf Start info page} is used to pass build-time information to
the guest when it boots and when it is resumed from a suspended state.
This chapter documents the fields included in the {\bf
shared\_info\_t} and {\bf start\_info\_t} structures for use by the
guest OS.

\section{Shared info page}

The {\bf shared\_info\_t} is accessed at run time by both Xen and the
guest OS.  It is used to pass information relating to the
virtual CPU and virtual machine state between the OS and the
hypervisor.

The structure is declared in {\bf xen/include/public/xen.h}:

\scriptsize
\begin{verbatim}
typedef struct shared_info {
    vcpu_info_t vcpu_info[XEN_LEGACY_MAX_VCPUS];

    /*
     * A domain can create "event channels" on which it can send and receive
     * asynchronous event notifications. There are three classes of event that
     * are delivered by this mechanism:
     *  1. Bi-directional inter- and intra-domain connections. Domains must
     *     arrange out-of-band to set up a connection (usually by allocating
     *     an unbound 'listener' port and advertising that via a storage service
     *     such as xenstore).
     *  2. Physical interrupts. A domain with suitable hardware-access
     *     privileges can bind an event-channel port to a physical interrupt
     *     source.
     *  3. Virtual interrupts ('events'). A domain can bind an event-channel
     *     port to a virtual interrupt source, such as the virtual-timer
     *     device or the emergency console.
     * 
     * Event channels are addressed by a "port index". Each channel is
     * associated with two bits of information:
     *  1. PENDING -- notifies the domain that there is a pending notification
     *     to be processed. This bit is cleared by the guest.
     *  2. MASK -- if this bit is clear then a 0->1 transition of PENDING
     *     will cause an asynchronous upcall to be scheduled. This bit is only
     *     updated by the guest. It is read-only within Xen. If a channel
     *     becomes pending while the channel is masked then the 'edge' is lost
     *     (i.e., when the channel is unmasked, the guest must manually handle
     *     pending notifications as no upcall will be scheduled by Xen).
     * 
     * To expedite scanning of pending notifications, any 0->1 pending
     * transition on an unmasked channel causes a corresponding bit in a
     * per-vcpu selector word to be set. Each bit in the selector covers a
     * 'C long' in the PENDING bitfield array.
     */
    unsigned long evtchn_pending[sizeof(unsigned long) * 8];
    unsigned long evtchn_mask[sizeof(unsigned long) * 8];

    /*
     * Wallclock time: updated only by control software. Guests should base
     * their gettimeofday() syscall on this wallclock-base value.
     */
    uint32_t wc_version;      /* Version counter: see vcpu_time_info_t. */
    uint32_t wc_sec;          /* Secs  00:00:00 UTC, Jan 1, 1970.  */
    uint32_t wc_nsec;         /* Nsecs 00:00:00 UTC, Jan 1, 1970.  */

    arch_shared_info_t arch;

} shared_info_t;
\end{verbatim}
\normalsize

\begin{description}
\item[vcpu\_info] An array of {\bf vcpu\_info\_t} structures, each of
  which holds either runtime information about a virtual CPU, or is
  ``empty'' if the corresponding VCPU does not exist.
\item[evtchn\_pending] Guest-global array, with one bit per event
  channel.  Bits are set if an event is currently pending on that
  channel.
\item[evtchn\_mask] Guest-global array for masking notifications on
  event channels.
\item[wc\_version] Version counter for current wallclock time.
\item[wc\_sec] Whole seconds component of current wallclock time.
\item[wc\_nsec] Nanoseconds component of current wallclock time.
\item[arch] Host architecture-dependent portion of the shared info
  structure.
\end{description}

\subsection{vcpu\_info\_t}

\scriptsize
\begin{verbatim}
typedef struct vcpu_info {
    /*
     * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
     * a pending notification for a particular VCPU. It is then cleared 
     * by the guest OS /before/ checking for pending work, thus avoiding
     * a set-and-check race. Note that the mask is only accessed by Xen
     * on the CPU that is currently hosting the VCPU. This means that the
     * pending and mask flags can be updated by the guest without special
     * synchronisation (i.e., no need for the x86 LOCK prefix).
     * This may seem suboptimal because if the pending flag is set by
     * a different CPU then an IPI may be scheduled even when the mask
     * is set. However, note:
     *  1. The task of 'interrupt holdoff' is covered by the per-event-
     *     channel mask bits. A 'noisy' event that is continually being
     *     triggered can be masked at source at this very precise
     *     granularity.
     *  2. The main purpose of the per-VCPU mask is therefore to restrict
     *     reentrant execution: whether for concurrency control, or to
     *     prevent unbounded stack usage. Whatever the purpose, we expect
     *     that the mask will be asserted only for short periods at a time,
     *     and so the likelihood of a 'spurious' IPI is suitably small.
     * The mask is read before making an event upcall to the guest: a
     * non-zero mask therefore guarantees that the VCPU will not receive
     * an upcall activation. The mask is cleared when the VCPU requests
     * to block: this avoids wakeup-waiting races.
     */
    uint8_t evtchn_upcall_pending;
    uint8_t evtchn_upcall_mask;
    unsigned long evtchn_pending_sel;
    arch_vcpu_info_t arch;
    vcpu_time_info_t time;
} vcpu_info_t; /* 64 bytes (x86) */
\end{verbatim}
\normalsize

\begin{description}
\item[evtchn\_upcall\_pending] This is set non-zero by Xen to indicate
  that there are pending events to be received.
\item[evtchn\_upcall\_mask] This is set non-zero to disable all
  interrupts for this CPU for short periods of time.  If individual
  event channels need to be masked, the {\bf evtchn\_mask} in the {\bf
  shared\_info\_t} is used instead.
\item[evtchn\_pending\_sel] When an event is delivered to this VCPU, a
  bit is set in this selector to indicate which word of the {\bf
  evtchn\_pending} array in the {\bf shared\_info\_t} contains the
  event in question.
\item[arch] Architecture-specific VCPU info. On x86 this contains the
  virtualized CR2 register (page fault linear address) for this VCPU.
\item[time] Time values for this VCPU.
\end{description}

\subsection{vcpu\_time\_info}

\scriptsize
\begin{verbatim}
typedef struct vcpu_time_info {
    /*
     * Updates to the following values are preceded and followed by an
     * increment of 'version'. The guest can therefore detect updates by
     * looking for changes to 'version'. If the least-significant bit of
     * the version number is set then an update is in progress and the guest
     * must wait to read a consistent set of values.
     * The correct way to interact with the version number is similar to
     * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry.
     */
    uint32_t version;
    uint32_t pad0;
    uint64_t tsc_timestamp;   /* TSC at last update of time vals.  */
    uint64_t system_time;     /* Time, in nanosecs, since boot.    */
    /*
     * Current system time:
     *   system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul
     * CPU frequency (Hz):
     *   ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift
     */
    uint32_t tsc_to_system_mul;
    int8_t   tsc_shift;
    int8_t   pad1[3];
} vcpu_time_info_t; /* 32 bytes */
\end{verbatim}
\normalsize

\begin{description}
\item[version] Used to ensure the guest gets consistent time updates.
\item[tsc\_timestamp] Cycle counter timestamp of last time value;
  could be used to expolate in between updates, for instance.
\item[system\_time] Time since boot (nanoseconds).
\item[tsc\_to\_system\_mul] Cycle counter to nanoseconds multiplier
(used in extrapolating current time).
\item[tsc\_shift] Cycle counter to nanoseconds shift (used in
extrapolating current time).
\end{description}

\subsection{arch\_shared\_info\_t}

On x86, the {\bf arch\_shared\_info\_t} is defined as follows (from
xen/public/arch-x86\_32.h):

\scriptsize
\begin{verbatim}
typedef struct arch_shared_info {
    unsigned long max_pfn;                  /* max pfn that appears in table */
    /* Frame containing list of mfns containing list of mfns containing p2m. */
    unsigned long pfn_to_mfn_frame_list_list; 
} arch_shared_info_t;
\end{verbatim}
\normalsize

\begin{description}
\item[max\_pfn] The maximum PFN listed in the physical-to-machine
  mapping table (P2M table).
\item[pfn\_to\_mfn\_frame\_list\_list] Machine address of the frame
  that contains the machine addresses of the P2M table frames.
\end{description}

\section{Start info page}

The start info structure is declared as the following (in {\bf
xen/include/public/xen.h}):

\scriptsize
\begin{verbatim}
#define MAX_GUEST_CMDLINE 1024
typedef struct start_info {
    /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME.    */
    char magic[32];             /* "Xen-<version>.<subversion>". */
    unsigned long nr_pages;     /* Total pages allocated to this domain.  */
    unsigned long shared_info;  /* MACHINE address of shared info struct. */
    uint32_t flags;             /* SIF_xxx flags.                         */
    unsigned long store_mfn;    /* MACHINE page number of shared page.    */
    uint32_t store_evtchn;      /* Event channel for store communication. */
    unsigned long console_mfn;  /* MACHINE address of console page.       */
    uint32_t console_evtchn;    /* Event channel for console messages.    */
    /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME).     */
    unsigned long pt_base;      /* VIRTUAL address of page directory.     */
    unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames.       */
    unsigned long mfn_list;     /* VIRTUAL address of page-frame list.    */
    unsigned long mod_start;    /* VIRTUAL address of pre-loaded module.  */
    unsigned long mod_len;      /* Size (bytes) of pre-loaded module.     */
    int8_t cmd_line[MAX_GUEST_CMDLINE];
} start_info_t;
\end{verbatim}
\normalsize

The fields are in two groups: the first group are always filled in
when a domain is booted or resumed, the second set are only used at
boot time.

The always-available group is as follows:

\begin{description}
\item[magic] A text string identifying the Xen version to the guest.
\item[nr\_pages] The number of real machine pages available to the
  guest.
\item[shared\_info] Machine address of the shared info structure,
  allowing the guest to map it during initialisation.
\item[flags] Flags for describing optional extra settings to the
  guest.
\item[store\_mfn] Machine address of the Xenstore communications page.
\item[store\_evtchn] Event channel to communicate with the store.
\item[console\_mfn] Machine address of the console data page.
\item[console\_evtchn] Event channel to notify the console backend.
\end{description}

The boot-only group may only be safely referred to during system boot:

\begin{description}
\item[pt\_base] Virtual address of the page directory created for us
  by the domain builder.
\item[nr\_pt\_frames] Number of frames used by the builders' bootstrap
  pagetables.
\item[mfn\_list] Virtual address of the list of machine frames this
  domain owns.
\item[mod\_start] Virtual address of any pre-loaded modules
  (e.g. ramdisk)
\item[mod\_len] Size of pre-loaded module (if any).
\item[cmd\_line] Kernel command line passed by the domain builder.
\end{description}


% by Mark Williamson <mark.williamson@cl.cam.ac.uk>

\chapter{Event Channels}
\label{c:eventchannels}

Event channels are the basic primitive provided by Xen for event
notifications.  An event is the Xen equivalent of a hardware
interrupt.  They essentially store one bit of information, the event
of interest is signalled by transitioning this bit from 0 to 1.

Notifications are received by a guest via an upcall from Xen,
indicating when an event arrives (setting the bit).  Further
notifications are masked until the bit is cleared again (therefore,
guests must check the value of the bit after re-enabling event
delivery to ensure no missed notifications).

Event notifications can be masked by setting a flag; this is
equivalent to disabling interrupts and can be used to ensure atomicity
of certain operations in the guest kernel.

\section{Hypercall interface}

\hypercall{event\_channel\_op(evtchn\_op\_t *op)}

The event channel operation hypercall is used for all operations on
event channels / ports.  Operations are distinguished by the value of
the {\bf cmd} field of the {\bf op} structure.  The possible commands
are described below:

\begin{description}

\item[EVTCHNOP\_alloc\_unbound]
  Allocate a new event channel port, ready to be connected to by a
  remote domain.
  \begin{itemize}
  \item Specified domain must exist.
  \item A free port must exist in that domain.
  \end{itemize}
  Unprivileged domains may only allocate their own ports, privileged
  domains may also allocate ports in other domains.
\item[EVTCHNOP\_bind\_interdomain]
  Bind an event channel for interdomain communications.
  \begin{itemize}
  \item Caller domain must have a free port to bind.
  \item Remote domain must exist.
  \item Remote port must be allocated and currently unbound.
  \item Remote port must be expecting the caller domain as the ``remote''.
  \end{itemize}
\item[EVTCHNOP\_bind\_virq]
  Allocate a port and bind a VIRQ to it.
  \begin{itemize}
  \item Caller domain must have a free port to bind.
  \item VIRQ must be valid.
  \item VCPU must exist.
  \item VIRQ must not currently be bound to an event channel.
  \end{itemize}
\item[EVTCHNOP\_bind\_ipi]
  Allocate and bind a port for notifying other virtual CPUs.
  \begin{itemize}
  \item Caller domain must have a free port to bind.
  \item VCPU must exist.
  \end{itemize}
\item[EVTCHNOP\_bind\_pirq]
  Allocate and bind a port to a real IRQ.
  \begin{itemize}
  \item Caller domain must have a free port to bind.
  \item PIRQ must be within the valid range.
  \item Another binding for this PIRQ must not exist for this domain.
  \item Caller must have an available port.
  \end{itemize}
\item[EVTCHNOP\_close]
  Close an event channel (no more events will be received).
  \begin{itemize}
  \item Port must be valid (currently allocated).
  \end{itemize}
\item[EVTCHNOP\_send] Send a notification on an event channel attached
  to a port.
  \begin{itemize}
  \item Port must be valid.
  \item Only valid for Interdomain, IPI or Allocated Unbound ports.
  \end{itemize}
\item[EVTCHNOP\_status] Query the status of a port; what kind of port,
  whether it is bound, what remote domain is expected, what PIRQ or
  VIRQ it is bound to, what VCPU will be notified, etc.
  Unprivileged domains may only query the state of their own ports.
  Privileged domains may query any port.
\item[EVTCHNOP\_bind\_vcpu] Bind event channel to a particular VCPU -
  receive notification upcalls only on that VCPU.
  \begin{itemize}
  \item VCPU must exist.
  \item Port must be valid.
  \item Event channel must be either: allocated but unbound, bound to
  an interdomain event channel, bound to a PIRQ.
  \end{itemize}

\end{description}

%%
%% grant_tables.tex
%% 
%% Made by Mark Williamson
%% Login   <mark@maw48>
%%

\chapter{Grant tables}
\label{c:granttables}

Xen's grant tables provide a generic mechanism to memory sharing
between domains.  This shared memory interface underpins the split
device drivers for block and network IO.

Each domain has its own {\bf grant table}.  This is a data structure
that is shared with Xen; it allows the domain to tell Xen what kind of
permissions other domains have on its pages.  Entries in the grant
table are identified by {\bf grant references}.  A grant reference is
an integer, which indexes into the grant table.  It acts as a
capability which the grantee can use to perform operations on the
granter's memory.

This capability-based system allows shared-memory communications
between unprivileged domains.  A grant reference also encapsulates the
details of a shared page, removing the need for a domain to know the
real machine address of a page it is sharing.  This makes it possible
to share memory correctly with domains running in fully virtualised
memory.

\section{Interface}

\subsection{Grant table manipulation}

Creating and destroying grant references is done by direct access to
the grant table.  This removes the need to involve Xen when creating
grant references, modifying access permissions, etc.  The grantee
domain will invoke hypercalls to use the grant references.  Four main
operations can be accomplished by directly manipulating the table:

\begin{description}
\item[Grant foreign access] allocate a new entry in the grant table
  and fill out the access permissions accordingly.  The access
  permissions will be looked up by Xen when the grantee attempts to
  use the reference to map the granted frame.
\item[End foreign access] check that the grant reference is not
  currently in use, then remove the mapping permissions for the frame.
  This prevents further mappings from taking place but does not allow
  forced revocations of existing mappings.
\item[Grant foreign transfer] allocate a new entry in the table
  specifying transfer permissions for the grantee.  Xen will look up
  this entry when the grantee attempts to transfer a frame to the
  granter.
\item[End foreign transfer] remove permissions to prevent a transfer
  occurring in future.  If the transfer is already committed,
  modifying the grant table cannot prevent it from completing.
\end{description}

\subsection{Hypercalls}

Use of grant references is accomplished via a hypercall.  The grant
table op hypercall takes three arguments:

\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}

{\bf cmd} indicates the grant table operation of interest.  {\bf uop}
is a pointer to a structure (or an array of structures) describing the
operation to be performed.  The {\bf count} field describes how many
grant table operations are being batched together.

The core logic is situated in {\bf xen/common/grant\_table.c}.  The
grant table operation hypercall can be used to perform the following
actions:

\begin{description}
\item[GNTTABOP\_map\_grant\_ref] Given a grant reference from another
  domain, map the referred page into the caller's address space.
\item[GNTTABOP\_unmap\_grant\_ref] Remove a mapping to a granted frame
  from the caller's address space.  This is used to voluntarily
  relinquish a mapping to a granted page.
\item[GNTTABOP\_setup\_table] Setup grant table for caller domain.
\item[GNTTABOP\_dump\_table] Debugging operation.
\item[GNTTABOP\_transfer] Given a transfer reference from another
  domain, transfer ownership of a page frame to that domain.
\end{description}

%%
%% xenstore.tex
%% 
%% Made by Mark Williamson
%% Login   <mark@maw48>
%% 

\chapter{Xenstore}

Xenstore is the mechanism by which control-plane activities occur.
These activities include:

\begin{itemize}
\item Setting up shared memory regions and event channels for use with
  the split device drivers.
\item Notifying the guest of control events (e.g. balloon driver
  requests)
\item Reporting back status information from the guest
  (e.g. performance-related statistics, etc).
\end{itemize}

The store is arranged as a hierarchical collection of key-value pairs.
Each domain has a directory hierarchy containing data related to its
configuration.  Domains are permitted to register for notifications
about changes in subtrees of the store, and to apply changes to the
store transactionally.

\section{Guidelines}

A few principles govern the operation of the store:

\begin{itemize}
\item Domains should only modify the contents of their own
  directories.
\item The setup protocol for a device channel should simply consist of
  entering the configuration data into the store.
\item The store should allow device discovery without requiring the
  relevant device drivers to be loaded: a Xen ``bus'' should be
  visible to probing code in the guest.
\item The store should be usable for inter-tool communications,
  allowing the tools themselves to be decomposed into a number of
  smaller utilities, rather than a single monolithic entity.  This
  also facilitates the development of alternate user interfaces to the
  same functionality.
\end{itemize}

\section{Store layout}

There are three main paths in XenStore:

\begin{description}
\item[/vm] stores configuration information about domain
\item[/local/domain] stores information about the domain on the local node (domid, etc.)
\item[/tool] stores information for the various tools
\end{description}

The {\bf /vm} path stores configuration information for a domain.
This information doesn't change and is indexed by the domain's UUID.
A {\bf /vm} entry contains the following information:

\begin{description}
\item[uuid] uuid of the domain (somewhat redundant)
\item[on\_reboot] the action to take on a domain reboot request (destroy or restart)
\item[on\_poweroff] the action to take on a domain halt request (destroy or restart)
\item[on\_crash] the action to take on a domain crash (destroy or restart)
\item[vcpus] the number of allocated vcpus for the domain
\item[memory] the amount of memory (in megabytes) for the domain Note: appears to sometimes be empty for domain-0
\item[vcpu\_avail] the number of active vcpus for the domain (vcpus - number of disabled vcpus)
\item[name] the name of the domain
\end{description}


{\bf /vm/$<$uuid$>$/image/}

The image path is only available for Domain-Us and contains:
\begin{description}
\item[ostype] identifies the builder type (linux or vmx)
\item[kernel] path to kernel on domain-0
\item[cmdline] command line to pass to domain-U kernel
\item[ramdisk] path to ramdisk on domain-0
\end{description}

{\bf /local}

The {\tt /local} path currently only contains one directory, {\tt
/local/domain} that is indexed by domain id.  It contains the running
domain information.  The reason to have two storage areas is that
during migration, the uuid doesn't change but the domain id does.  The
{\tt /local/domain} directory can be created and populated before
finalizing the migration enabling localhost to localhost migration.

{\bf /local/domain/$<$domid$>$}

This path contains:

\begin{description}
\item[cpu\_time] xend start time (this is only around for domain-0)
\item[handle] private handle for xend
\item[name] see /vm
\item[on\_reboot] see /vm
\item[on\_poweroff] see /vm
\item[on\_crash] see /vm
\item[vm] the path to the VM directory for the domain
\item[domid] the domain id (somewhat redundant)
\item[running] indicates that the domain is currently running
\item[memory] the current memory in megabytes for the domain (empty for domain-0?)
\item[maxmem\_KiB] the maximum memory for the domain (in kilobytes)
\item[memory\_KiB] the memory allocated to the domain (in kilobytes)
\item[cpu] the current CPU the domain is pinned to (empty for domain-0?)
\item[cpu\_weight] the weight assigned to the domain
\item[vcpu\_avail] a bitmap telling the domain whether it may use a given VCPU
\item[online\_vcpus] how many vcpus are currently online
\item[vcpus] the total number of vcpus allocated to the domain
\item[console/] a directory for console information
  \begin{description}
  \item[ring-ref] the grant table reference of the console ring queue
  \item[port] the event channel being used for the console ring queue (local port)
  \item[tty] the current tty the console data is being exposed of
  \item[limit] the limit (in bytes) of console data to buffer
  \end{description}
\item[backend/] a directory containing all backends the domain hosts
  \begin{description}
  \item[vbd/] a directory containing vbd backends
    \begin{description}
    \item[$<$domid$>$/] a directory containing vbd's for domid
      \begin{description}
      \item[$<$virtual-device$>$/] a directory for a particular
	virtual-device on domid
	\begin{description}
	\item[frontend-id] domain id of frontend
	\item[frontend] the path to the frontend domain
	\item[physical-device] backend device number
	\item[sector-size] backend sector size
	\item[info] 0 read/write, 1 read-only (is this right?)
	\item[domain] name of frontend domain
	\item[params] parameters for device
	\item[type] the type of the device
	\item[dev] the virtual device (as given by the user)
	\item[node] output from block creation script
	\end{description}
      \end{description}
    \end{description}
  
  \item[vif/] a directory containing vif backends
    \begin{description}
    \item[$<$domid$>$/] a directory containing vif's for domid
      \begin{description}
      \item[$<$vif number$>$/] a directory for each vif
      \item[frontend-id] the domain id of the frontend
      \item[frontend] the path to the frontend
      \item[mac] the mac address of the vif
      \item[bridge] the bridge the vif is connected to
      \item[handle] the handle of the vif
      \item[script] the script used to create/stop the vif
      \item[domain] the name of the frontend
      \end{description}
    \end{description}

  \item[vtpm/] a directory containing vtpm backends
    \begin{description}
    \item[$<$domid$>$/] a directory containing vtpm's for domid
      \begin{description}
      \item[$<$vtpm number$>$/] a directory for each vtpm
      \item[frontend-id] the domain id of the frontend
      \item[frontend] the path to the frontend
      \item[instance] the instance of the virtual TPM that is used
      \item[pref{\textunderscore}instance] the instance number as given in the VM configuration file;
           may be different from {\bf instance}
      \item[domain] the name of the domain of the frontend
      \end{description}
    \end{description}

  \end{description}

  \item[device/] a directory containing the frontend devices for the
    domain
    \begin{description}
    \item[vbd/] a directory containing vbd frontend devices for the
      domain
      \begin{description}
      \item[$<$virtual-device$>$/] a directory containing the vbd frontend for
	virtual-device
	\begin{description}
	\item[virtual-device] the device number of the frontend device
	\item[backend-id] the domain id of the backend
	\item[backend] the path of the backend in the store (/local/domain
	  path)
	\item[ring-ref] the grant table reference for the block request
	  ring queue
	\item[event-channel] the event channel used for the block request
	  ring queue
	\end{description}
	
      \item[vif/] a directory containing vif frontend devices for the
	domain
	\begin{description}
	\item[$<$id$>$/] a directory for vif id frontend device for the domain
	  \begin{description}
	  \item[backend-id] the backend domain id
	  \item[mac] the mac address of the vif
	  \item[handle] the internal vif handle
	  \item[backend] a path to the backend's store entry
	  \item[tx-ring-ref] the grant table reference for the transmission ring queue 
	  \item[rx-ring-ref] the grant table reference for the receiving ring queue 
	  \item[event-channel] the event channel used for the two ring queues 
	  \end{description}
	\end{description}

      \item[vtpm/] a directory containing the vtpm frontend device for the
        domain
        \begin{description}
        \item[$<$id$>$] a directory for vtpm id frontend device for the domain
          \begin{description}
	  \item[backend-id] the backend domain id
          \item[backend] a path to the backend's store entry
          \item[ring-ref] the grant table reference for the tx/rx ring
          \item[event-channel] the event channel used for the ring
          \end{description}
        \end{description}
	
      \item[device-misc/] miscellaneous information for devices 
	\begin{description}
	\item[vif/] miscellaneous information for vif devices
	  \begin{description}
	  \item[nextDeviceID] the next device id to use 
	  \end{description}
	\end{description}
      \end{description}
    \end{description}

  \item[security/] access control information for the domain
    \begin{description}
    \item[ssidref] security reference identifier used inside the hypervisor
    \item[access\_control/] security label used by management tools
      \begin{description}
       \item[label] security label name
       \item[policy] security policy name
      \end{description}
    \end{description}

  \item[store/] per-domain information for the store
    \begin{description}
    \item[port] the event channel used for the store ring queue 
    \item[ring-ref] - the grant table reference used for the store's
      communication channel 
    \end{description}
    
  \item[image] - private xend information 
\end{description}


\chapter{Devices}
\label{c:devices}

Virtual devices under Xen are provided by a {\bf split device driver}
architecture.  The illusion of the virtual device is provided by two
co-operating drivers: the {\bf frontend}, which runs an the
unprivileged domain and the {\bf backend}, which runs in a domain with
access to the real device hardware (often called a {\bf driver
domain}; in practice domain 0 usually fulfills this function).

The frontend driver appears to the unprivileged guest as if it were a
real device, for instance a block or network device.  It receives IO
requests from its kernel as usual, however since it does not have
access to the physical hardware of the system it must then issue
requests to the backend.  The backend driver is responsible for
receiving these IO requests, verifying that they are safe and then
issuing them to the real device hardware.  The backend driver appears
to its kernel as a normal user of in-kernel IO functionality.  When
the IO completes the backend notifies the frontend that the data is
ready for use; the frontend is then able to report IO completion to
its own kernel.

Frontend drivers are designed to be simple; most of the complexity is
in the backend, which has responsibility for translating device
addresses, verifying that requests are well-formed and do not violate
isolation guarantees, etc.

Split drivers exchange requests and responses in shared memory, with
an event channel for asynchronous notifications of activity.  When the
frontend driver comes up, it uses Xenstore to set up a shared memory
frame and an interdomain event channel for communications with the
backend.  Once this connection is established, the two can communicate
directly by placing requests / responses into shared memory and then
sending notifications on the event channel.  This separation of
notification from data transfer allows message batching, and results
in very efficient device access.

This chapter focuses on some individual split device interfaces
available to Xen guests.

        
\section{Network I/O}

Virtual network device services are provided by shared memory
communication with a backend domain.  From the point of view of other
domains, the backend may be viewed as a virtual ethernet switch
element with each domain having one or more virtual network interfaces
connected to it.

From the point of view of the backend domain itself, the network
backend driver consists of a number of ethernet devices.  Each of
these has a logical direct connection to a virtual network device in
another domain.  This allows the backend domain to route, bridge,
firewall, etc the traffic to / from the other domains using normal
operating system mechanisms.

\subsection{Backend Packet Handling}

The backend driver is responsible for a variety of actions relating to
the transmission and reception of packets from the physical device.
With regard to transmission, the backend performs these key actions:

\begin{itemize}
\item {\bf Validation:} To ensure that domains do not attempt to
  generate invalid (e.g. spoofed) traffic, the backend driver may
  validate headers ensuring that source MAC and IP addresses match the
  interface that they have been sent from.

  Validation functions can be configured using standard firewall rules
  ({\small{\tt iptables}} in the case of Linux).
  
\item {\bf Scheduling:} Since a number of domains can share a single
  physical network interface, the backend must mediate access when
  several domains each have packets queued for transmission.  This
  general scheduling function subsumes basic shaping or rate-limiting
  schemes.
  
\item {\bf Logging and Accounting:} The backend domain can be
  configured with classifier rules that control how packets are
  accounted or logged.  For example, log messages might be generated
  whenever a domain attempts to send a TCP packet containing a SYN.
\end{itemize}

On receipt of incoming packets, the backend acts as a simple
demultiplexer: Packets are passed to the appropriate virtual interface
after any necessary logging and accounting have been carried out.

\subsection{Data Transfer}

Each virtual interface uses two ``descriptor rings'', one for
transmit, the other for receive.  Each descriptor identifies a block
of contiguous machine memory allocated to the domain.

The transmit ring carries packets to transmit from the guest to the
backend domain.  The return path of the transmit ring carries messages
indicating that the contents have been physically transmitted and the
backend no longer requires the associated pages of memory.

To receive packets, the guest places descriptors of unused pages on
the receive ring.  The backend will return received packets by
exchanging these pages in the domain's memory with new pages
containing the received data, and passing back descriptors regarding
the new packets on the ring.  This zero-copy approach allows the
backend to maintain a pool of free pages to receive packets into, and
then deliver them to appropriate domains after examining their
headers.

% Real physical addresses are used throughout, with the domain
% performing translation from pseudo-physical addresses if that is
% necessary.

If a domain does not keep its receive ring stocked with empty buffers
then packets destined to it may be dropped.  This provides some
defence against receive livelock problems because an overloaded domain
will cease to receive further data.  Similarly, on the transmit path,
it provides the application with feedback on the rate at which packets
are able to leave the system.

Flow control on rings is achieved by including a pair of producer
indexes on the shared ring page.  Each side will maintain a private
consumer index indicating the next outstanding message.  In this
manner, the domains cooperate to divide the ring into two message
lists, one in each direction.  Notification is decoupled from the
immediate placement of new messages on the ring; the event channel
will be used to generate notification when {\em either} a certain
number of outstanding messages are queued, {\em or} a specified number
of nanoseconds have elapsed since the oldest message was placed on the
ring.

%% Not sure if my version is any better -- here is what was here
%% before: Synchronization between the backend domain and the guest is
%% achieved using counters held in shared memory that is accessible to
%% both.  Each ring has associated producer and consumer indices
%% indicating the area in the ring that holds descriptors that contain
%% data.  After receiving {\it n} packets or {\t nanoseconds} after
%% receiving the first packet, the hypervisor sends an event to the
%% domain.


\subsection{Network ring interface}

The network device uses two shared memory rings for communication: one
for transmit, one for receive.

Transmit requests are described by the following structure:

\scriptsize
\begin{verbatim}
typedef struct netif_tx_request {
    grant_ref_t gref;      /* Reference to buffer page */
    uint16_t offset;       /* Offset within buffer page */
    uint16_t flags;        /* NETTXF_* */
    uint16_t id;           /* Echoed in response message. */
    uint16_t size;         /* Packet size in bytes.       */
} netif_tx_request_t;
\end{verbatim}
\normalsize

\begin{description}
\item[gref] Grant reference for the network buffer
\item[offset] Offset to data
\item[flags] Transmit flags (currently only NETTXF\_csum\_blank is
  supported, to indicate that the protocol checksum field is
  incomplete).
\item[id] Echoed to guest by the backend in the ring-level response so
  that the guest can match it to this request
\item[size] Buffer size
\end{description}

Each transmit request is followed by a transmit response at some later
date.  This is part of the shared-memory communication protocol and
allows the guest to (potentially) retire internal structures related
to the request.  It does not imply a network-level response.  This
structure is as follows:

\scriptsize
\begin{verbatim}
typedef struct netif_tx_response {
    uint16_t id;
    int16_t  status;
} netif_tx_response_t;
\end{verbatim}
\normalsize

\begin{description}
\item[id] Echo of the ID field in the corresponding transmit request.
\item[status] Success / failure status of the transmit request.
\end{description}

Receive requests must be queued by the frontend, accompanied by a
donation of page-frames to the backend.  The backend transfers page
frames full of data back to the guest

\scriptsize
\begin{verbatim}
typedef struct {
    uint16_t    id;        /* Echoed in response message.        */
    grant_ref_t gref;      /* Reference to incoming granted frame */
} netif_rx_request_t;
\end{verbatim}
\normalsize

\begin{description}
\item[id] Echoed by the frontend to identify this request when
  responding.
\item[gref] Transfer reference - the backend will use this reference
  to transfer a frame of network data to us.
\end{description}

Receive response descriptors are queued for each received frame.  Note
that these may only be queued in reply to an existing receive request,
providing an in-built form of traffic throttling.

\scriptsize
\begin{verbatim}
typedef struct {
    uint16_t id;
    uint16_t offset;       /* Offset in page of start of received packet  */
    uint16_t flags;        /* NETRXF_* */
    int16_t  status;       /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */
} netif_rx_response_t;
\end{verbatim}
\normalsize

\begin{description}
\item[id] ID echoed from the original request, used by the guest to
  match this response to the original request.
\item[offset] Offset to data within the transferred frame.
\item[flags] Transmit flags (currently only NETRXF\_csum\_valid is
  supported, to indicate that the protocol checksum field has already
  been validated).
\item[status] Success / error status for this operation.
\end{description}

Note that the receive protocol includes a mechanism for guests to
receive incoming memory frames but there is no explicit transfer of
frames in the other direction.  Guests are expected to return memory
to the hypervisor in order to use the network interface.  They {\em
must} do this or they will exceed their maximum memory reservation and
will not be able to receive incoming frame transfers.  When necessary,
the backend is able to replenish its pool of free network buffers by
claiming some of this free memory from the hypervisor.

\section{Block I/O}

All guest OS disk access goes through the virtual block device VBD
interface.  This interface allows domains access to portions of block
storage devices visible to the the block backend device.  The VBD
interface is a split driver, similar to the network interface
described above.  A single shared memory ring is used between the
frontend and backend drivers for each virtual device, across which
IO requests and responses are sent.

Any block device accessible to the backend domain, including
network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
can be exported as a VBD.  Each VBD is mapped to a device node in the
guest, specified in the guest's startup configuration.

\subsection{Data Transfer}

The per-(virtual)-device ring between the guest and the block backend
supports two messages:

\begin{description}
\item [{\small {\tt READ}}:] Read data from the specified block
  device.  The front end identifies the device and location to read
  from and attaches pages for the data to be copied to (typically via
  DMA from the device).  The backend acknowledges completed read
  requests as they finish.

\item [{\small {\tt WRITE}}:] Write data to the specified block
  device.  This functions essentially as {\small {\tt READ}}, except
  that the data moves to the device instead of from it.
\end{description}

%% Rather than copying data, the backend simply maps the domain's
%% buffers in order to enable direct DMA to them.  The act of mapping
%% the buffers also increases the reference counts of the underlying
%% pages, so that the unprivileged domain cannot try to return them to
%% the hypervisor, install them as page tables, or any other unsafe
%% behaviour.
%%
%% % block API here

\subsection{Block ring interface}

The block interface is defined by the structures passed over the
shared memory interface.  These structures are either requests (from
the frontend to the backend) or responses (from the backend to the
frontend).

The request structure is defined as follows:

\scriptsize
\begin{verbatim}
typedef struct blkif_request {
    uint8_t        operation;    /* BLKIF_OP_???                         */
    uint8_t        nr_segments;  /* number of segments                   */
    blkif_vdev_t   handle;       /* only for read/write requests         */
    uint64_t       id;           /* private guest value, echoed in resp  */
    blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
    struct blkif_request_segment {
        grant_ref_t gref;        /* reference to I/O buffer frame        */
        /* @first_sect: first sector in frame to transfer (inclusive).   */
        /* @last_sect: last sector in frame to transfer (inclusive).     */
        uint8_t     first_sect, last_sect;
    } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
} blkif_request_t;
\end{verbatim}
\normalsize

The fields are as follows:

\begin{description}
\item[operation] operation ID: one of the operations described above
\item[nr\_segments] number of segments for scatter / gather IO
  described by this request
\item[handle] identifier for a particular virtual device on this
  interface
\item[id] this value is echoed in the response message for this IO;
  the guest may use it to identify the original request
\item[sector\_number] start sector on the virtual device for this
  request
\item[frame\_and\_sects] This array contains structures encoding
  scatter-gather IO to be performed:
  \begin{description}
  \item[gref] The grant reference for the foreign I/O buffer page.
  \item[first\_sect] First sector to access within the buffer page (0 to 7).
  \item[last\_sect] Last sector to access within the buffer page (0 to 7).
  \end{description}
  Data will be transferred into frames at an offset determined by the
  value of {\tt first\_sect}.
\end{description}

\section{Virtual TPM}

Virtual TPM (VTPM) support provides TPM functionality to each virtual
machine that requests this functionality in its configuration file.
The interface enables domains to access their own private TPM like it
was a hardware TPM built into the machine.

The virtual TPM interface is implemented as a split driver,
similar to the network and block interfaces described above.
The user domain hosting the frontend exports a character device /dev/tpm0
to user-level applications for communicating with the virtual TPM.
This is the same device interface that is also offered if a hardware TPM
is available in the system. The backend provides a single interface
/dev/vtpm where the virtual TPM is waiting for commands from all domains
that have located their backend in a given domain.

\subsection{Data Transfer}

A single shared memory ring is used between the frontend and backend
drivers. TPM requests and responses are sent in pages where a pointer
to those pages and other information is placed into the ring such that
the backend can map the pages into its memory space using the grant
table mechanism.

The backend driver has been implemented to only accept well-formed
TPM requests. To meet this requirement, the length indicator in the
TPM request must correctly indicate the length of the request.
Otherwise an error message is automatically sent back by the device driver.

The virtual TPM implementation listens for TPM request on /dev/vtpm. Since
it must be able to apply the TPM request packet to the virtual TPM instance
associated with the virtual machine, a 4-byte virtual TPM instance
identifier is pretended to each packet by the backend driver (in network
byte order) for internal routing of the request.

\subsection{Virtual TPM ring interface}

The TPM protocol is a strict request/response protocol and therefore
only one ring is used to send requests from the frontend to the backend
and responses on the reverse path.

The request/response structure is defined as follows:

\scriptsize
\begin{verbatim}
typedef struct {
    unsigned long addr;     /* Machine address of packet.     */
    grant_ref_t ref;        /* grant table access reference.  */
    uint16_t unused;        /* unused                         */
    uint16_t size;          /* Packet size in bytes.          */
} tpmif_tx_request_t;
\end{verbatim}
\normalsize

The fields are as follows:

\begin{description}
\item[addr] The machine address of the page associated with the TPM
            request/response; a request/response may span multiple
            pages
\item[ref]  The grant table reference associated with the address.
\item[size] The size of the remaining packet; up to
            PAGE{\textunderscore}SIZE bytes can be found in the
            page referenced by 'addr'
\end{description}

The frontend initially allocates several pages whose addresses
are stored in the ring. Only these pages are used for exchange of
requests and responses.


\chapter{Further Information}

If you have questions that are not answered by this manual, the
sources of information listed below may be of interest to you.  Note
that bug reports, suggestions and contributions related to the
software (or the documentation) should be sent to the Xen developers'
mailing list (address below).


\section{Other documentation}

If you are mainly interested in using (rather than developing for)
Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/}
directory of the Xen source distribution.

% Various HOWTOs are also available in {\tt docs/HOWTOS}.


\section{Online references}

The official Xen web site can be found at:
\begin{quote} {\tt http://www.xensource.com}
\end{quote}


This contains links to the latest versions of all online
documentation, including the latest version of the FAQ.

Information regarding Xen is also available at the Xen Wiki at
\begin{quote} {\tt http://wiki.xen.org/wiki/}\end{quote}
The Xen project uses Bugzilla as its bug tracking system. You'll find
the Xen Bugzilla at http://bugzilla.xensource.com/bugzilla/.


\section{Mailing lists}

There are several mailing lists that are used to discuss Xen related
topics. The most widely relevant are listed below. An official page of
mailing lists and subscription information can be found at \begin{quote}
  {\tt http://lists.xensource.com/} \end{quote}

\begin{description}
\item[xen-devel@lists.xensource.com] Used for development
  discussions and bug reports.  Subscribe at: \\
  {\small {\tt http://lists.xensource.com/xen-devel}}
\item[xen-users@lists.xensource.com] Used for installation and usage
  discussions and requests for help.  Subscribe at: \\
  {\small {\tt http://lists.xensource.com/xen-users}}
\item[xen-announce@lists.xensource.com] Used for announcements only.
  Subscribe at: \\
  {\small {\tt http://lists.xensource.com/xen-announce}}
\item[xen-changelog@lists.xensource.com] Changelog feed
  from the unstable and 2.0 trees - developer oriented.  Subscribe at: \\
  {\small {\tt http://lists.xensource.com/xen-changelog}}
\end{description}

\appendix


\chapter{Xen Hypercalls}
\label{a:hypercalls}

Hypercalls represent the procedural interface to Xen; this appendix 
categorizes and describes the current set of hypercalls. 

\section{Invoking Hypercalls} 

Hypercalls are invoked in a manner analogous to system calls in a
conventional operating system; a software interrupt is issued which
vectors to an entry point within Xen. On x86/32 machines the
instruction required is {\tt int \$82}; the (real) IDT is setup so
that this may only be issued from within ring 1. The particular 
hypercall to be invoked is contained in {\tt EAX} --- a list 
mapping these values to symbolic hypercall names can be found 
in {\tt xen/include/public/xen.h}. 

On some occasions a set of hypercalls will be required to carry
out a higher-level function; a good example is when a guest 
operating wishes to context switch to a new process which 
requires updating various privileged CPU state. As an optimization
for these cases, there is a generic mechanism to issue a set of 
hypercalls as a batch: 

\begin{quote}
\hypercall{multicall(void *call\_list, int nr\_calls)}

Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
call\_list}. Each entry contains the hypercall operation code followed
by up to 7 word-sized arguments.
\end{quote}

Note that multicalls are provided purely as an optimization; there is
no requirement to use them when first porting a guest operating
system.


\section{Virtual CPU Setup} 

At start of day, a guest operating system needs to setup the virtual
CPU it is executing on. This includes installing vectors for the
virtual IDT so that the guest OS can handle interrupts, page faults,
etc. However the very first thing a guest OS must setup is a pair 
of hypervisor callbacks: these are the entry points which Xen will
use when it wishes to notify the guest OS of an occurrence. 

\begin{quote}
\hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
  event\_address, unsigned long failsafe\_selector, unsigned long
  failsafe\_address) }

Register the normal (``event'') and failsafe callbacks for 
event processing. In each case the code segment selector and 
address within that segment are provided. The selectors must
have RPL 1; in XenLinux we simply use the kernel's CS for both 
{\bf event\_selector} and {\bf failsafe\_selector}.

The value {\bf event\_address} specifies the address of the guest OSes
event handling and dispatch routine; the {\bf failsafe\_address}
specifies a separate entry point which is used only if a fault occurs
when Xen attempts to use the normal callback. 

\end{quote} 

On x86/64 systems the hypercall takes slightly different
arguments. This is because callback CS does not need to be specified
(since teh callbacks are entered via SYSRET), and also because an
entry address needs to be specified for SYSCALLs from guest user
space:

\begin{quote}
\hypercall{set\_callbacks(unsigned long event\_address, unsigned long
  failsafe\_address, unsigned long syscall\_address)}
\end{quote} 


After installing the hypervisor callbacks, the guest OS can 
install a `virtual IDT' by using the following hypercall: 

\begin{quote} 
\hypercall{set\_trap\_table(trap\_info\_t *table)} 

Install one or more entries into the per-domain 
trap handler table (essentially a software version of the IDT). 
Each entry in the array pointed to by {\bf table} includes the 
exception vector number with the corresponding segment selector 
and entry point. Most guest OSes can use the same handlers on 
Xen as when running on the real hardware.


\end{quote} 

A further hypercall is provided for the management of virtual CPUs:

\begin{quote}
\hypercall{vcpu\_op(int cmd, int vcpuid, void *extra\_args)}

This hypercall can be used to bootstrap VCPUs, to bring them up and
down and to test their current status.

\end{quote}

\section{Scheduling and Timer}

Domains are preemptively scheduled by Xen according to the 
parameters installed by domain 0 (see Section~\ref{s:dom0ops}). 
In addition, however, a domain may choose to explicitly 
control certain behavior with the following hypercall: 

\begin{quote} 
\hypercall{sched\_op\_new(int cmd, void *extra\_args)}

Request scheduling operation from hypervisor. The following
sub-commands are available:

\begin{description}
\item[SCHEDOP\_yield] voluntarily yields the CPU, but leaves the
caller marked as runnable. No extra arguments are passed to this
command. 
\item[SCHEDOP\_block] removes the calling domain from the run queue
and causes it to sleep until an event is delivered to it. No extra 
arguments are passed to this command. 
\item[SCHEDOP\_shutdown] is used to end the calling domain's
execution. The extra argument is a {\bf sched\_shutdown} structure
which indicates the reason why the domain suspended (e.g., for reboot,
halt, power-off).
\item[SCHEDOP\_poll] allows a VCPU to wait on a set of event channels
with an optional timeout (all of which are specified in the {\bf
sched\_poll} extra argument). The semantics are similar to the UNIX
{\bf poll} system call. The caller must have event-channel upcalls
masked when executing this command.
\end{description}
\end{quote} 

{\bf sched\_op\_new}  was not available prior to Xen 3.0.2. Older versions
provide only the following hypercall:

\begin{quote} 
\hypercall{sched\_op(int cmd, unsigned long extra\_arg)}

This hypercall supports the following subset of {\bf sched\_op\_new} commands:

\begin{description}
\item[SCHEDOP\_yield] (extra argument is 0).
\item[SCHEDOP\_block] (extra argument is 0).
\item[SCHEDOP\_shutdown] (extra argument is numeric reason code).
\end{description}
\end{quote}

To aid the implementation of a process scheduler within a guest OS,
Xen provides a virtual programmable timer:

\begin{quote}
\hypercall{set\_timer\_op(uint64\_t timeout)} 

Request a timer event to be sent at the specified system time (time 
in nanoseconds since system boot).

\end{quote} 

Note that calling {\bf set\_timer\_op} prior to {\bf sched\_op} 
allows block-with-timeout semantics. 


\section{Page Table Management} 

Since guest operating systems have read-only access to their page 
tables, Xen must be involved when making any changes. The following
multi-purpose hypercall can be used to modify page-table entries, 
update the machine-to-physical mapping table, flush the TLB, install 
a new page-table base pointer, and more.

\begin{quote} 
\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} 

Update the page table for the domain; a set of {\bf count} updates are
submitted for processing in a batch, with {\bf success\_count} being 
updated to report the number of successful updates.  

Each element of {\bf req[]} contains a pointer (address) and value; 
the least significant 2-bits of the pointer are used to distinguish 
the type of update requested as follows:
\begin{description} 

\item[MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
page table entry to the associated value; Xen will check that the
update is safe, as described in Chapter~\ref{c:memory}.

\item[MMU\_MACHPHYS\_UPDATE:] update an entry in the
  machine-to-physical table. The calling domain must own the machine
  page in question (or be privileged).
\end{description}

\end{quote}

Explicitly updating batches of page table entries is extremely
efficient, but can require a number of alterations to the guest
OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
recommended for new OS ports.

Regardless of which page table update mode is being used, however,
there are some occasions (notably handling a demand page fault) where
a guest OS will wish to modify exactly one PTE rather than a
batch, and where that PTE is mapped into the current address space.
This is catered for by the following:

\begin{quote} 
\hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
                         unsigned long flags)}

Update the currently installed PTE that maps virtual address {\bf va}
to new value {\bf val}. As with {\bf mmu\_update}, Xen checks the
modification  is safe before applying it. The {\bf flags} determine
which kind of TLB flush, if any, should follow the update. 

\end{quote} 

Finally, sufficiently privileged domains may occasionally wish to manipulate 
the pages of others: 

\begin{quote}
\hypercall{update\_va\_mapping\_otherdomain(unsigned long va, uint64\_t val,
                         unsigned long flags, domid\_t domid)}

Identical to {\bf update\_va\_mapping} save that the pages being
mapped must belong to the domain {\bf domid}. 

\end{quote}

An additional MMU hypercall provides an ``extended command''
interface.  This provides additional functionality beyond the basic
table updating commands:

\begin{quote}

\hypercall{mmuext\_op(struct mmuext\_op *op, int count, int *success\_count, domid\_t domid)}

This hypercall is used to perform additional MMU operations.  These
include updating {\tt cr3} (or just re-installing it for a TLB flush),
requesting various kinds of TLB flush, flushing the cache, installing
a new LDT, or pinning \& unpinning page-table pages (to ensure their
reference count doesn't drop to zero which would require a
revalidation of all entries).  Some of the operations available are
restricted to domains with sufficient system privileges.

It is also possible for privileged domains to reassign page ownership
via an extended MMU operation, although grant tables are used instead
of this where possible; see Section~\ref{s:idc}.

\end{quote}

Finally, a hypercall interface is exposed to activate and deactivate
various optional facilities provided by Xen for memory management.

\begin{quote} 
\hypercall{vm\_assist(unsigned int cmd, unsigned int type)}

Toggle various memory management modes (in particular writable page
tables).

\end{quote} 

\section{Segmentation Support}

Xen allows guest OSes to install a custom GDT if they require it; 
this is context switched transparently whenever a domain is 
[de]scheduled.  The following hypercall is effectively a 
`safe' version of {\tt lgdt}: 

\begin{quote}
\hypercall{set\_gdt(unsigned long *frame\_list, int entries)} 

Install a global descriptor table for a domain; {\bf frame\_list} is
an array of up to 16 machine page frames within which the GDT resides,
with {\bf entries} being the actual number of descriptor-entry
slots. All page frames must be mapped read-only within the guest's
address space, and the table must be large enough to contain Xen's
reserved entries (see {\bf xen/include/public/arch-x86\_32.h}).

\end{quote}

Many guest OSes will also wish to install LDTs; this is achieved by
using {\bf mmu\_update} with an extended command, passing the
linear address of the LDT base along with the number of entries. No
special safety checks are required; Xen needs to perform this task
simply since {\tt lldt} requires CPL 0.


Xen also allows guest operating systems to update just an 
individual segment descriptor in the GDT or LDT:  

\begin{quote}
\hypercall{update\_descriptor(uint64\_t ma, uint64\_t desc)}

Update the GDT/LDT entry at machine address {\bf ma}; the new
8-byte descriptor is stored in {\bf desc}.
Xen performs a number of checks to ensure the descriptor is 
valid. 

\end{quote}

Guest OSes can use the above in place of context switching entire 
LDTs (or the GDT) when the number of changing descriptors is small. 

\section{Context Switching} 

When a guest OS wishes to context switch between two processes, 
it can use the page table and segmentation hypercalls described
above to perform the the bulk of the privileged work. In addition, 
however, it will need to invoke Xen to switch the kernel (ring 1) 
stack pointer: 

\begin{quote} 
\hypercall{stack\_switch(unsigned long ss, unsigned long esp)} 

Request kernel stack switch from hypervisor; {\bf ss} is the new 
stack segment, which {\bf esp} is the new stack pointer. 

\end{quote} 

A useful hypercall for context switching allows ``lazy'' save and
restore of floating point state:

\begin{quote}
\hypercall{fpu\_taskswitch(int set)} 

This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
control register; this means that the next attempt to use floating
point will cause a trap which the guest OS can trap. Typically it will
then save/restore the FP state, and clear the {\tt TS} bit, using the
same call.
\end{quote} 

This is provided as an optimization only; guest OSes can also choose
to save and restore FP state on all context switches for simplicity. 

Finally, a hypercall is provided for entering vm86 mode:

\begin{quote}
\hypercall{switch\_vm86}

This allows the guest to run code in vm86 mode, which is needed for
some legacy software.
\end{quote}

\section{Physical Memory Management}

As mentioned previously, each domain has a maximum and current 
memory allocation. The maximum allocation, set at domain creation 
time, cannot be modified. However a domain can choose to reduce 
and subsequently grow its current allocation by using the
following call: 

\begin{quote} 
\hypercall{memory\_op(unsigned int op, void *arg)}

Increase or decrease current memory allocation (as determined by 
the value of {\bf op}).  The available operations are:

\begin{description}
\item[XENMEM\_increase\_reservation] Request an increase in machine
  memory allocation; {\bf arg} must point to a {\bf
  xen\_memory\_reservation} structure.
\item[XENMEM\_decrease\_reservation] Request a decrease in machine
  memory allocation; {\bf arg} must point to a {\bf
  xen\_memory\_reservation} structure.
\item[XENMEM\_maximum\_ram\_page] Request the frame number of the
  highest-addressed frame of machine memory in the system.  {\bf arg}
  must point to an {\bf unsigned long} where this value will be
  stored.
\item[XENMEM\_current\_reservation] Returns current memory reservation
  of the specified domain.
\item[XENMEM\_maximum\_reservation] Returns maximum memory reservation
  of the specified domain.
\end{description}

\end{quote} 

In addition to simply reducing or increasing the current memory
allocation via a `balloon driver', this call is also useful for 
obtaining contiguous regions of machine memory when required (e.g. 
for certain PCI devices, or if using superpages).  


\section{Inter-Domain Communication}
\label{s:idc} 

Xen provides a simple asynchronous notification mechanism via
\emph{event channels}. Each domain has a set of end-points (or
\emph{ports}) which may be bound to an event source (e.g. a physical
IRQ, a virtual IRQ, or an port in another domain). When a pair of
end-points in two different domains are bound together, then a `send'
operation on one will cause an event to be received by the destination
domain.

The control and use of event channels involves the following hypercall: 

\begin{quote}
\hypercall{event\_channel\_op(evtchn\_op\_t *op)} 

Inter-domain event-channel management; {\bf op} is a discriminated 
union which allows the following 7 operations: 

\begin{description} 

\item[alloc\_unbound:] allocate a free (unbound) local
  port and prepare for connection from a specified domain. 
\item[bind\_virq:] bind a local port to a virtual 
IRQ; any particular VIRQ can be bound to at most one port per domain. 
\item[bind\_pirq:] bind a local port to a physical IRQ;
once more, a given pIRQ can be bound to at most one port per
domain. Furthermore the calling domain must be sufficiently
privileged.
\item[bind\_interdomain:] construct an interdomain event 
channel; in general, the target domain must have previously allocated 
an unbound port for this channel, although this can be bypassed by 
privileged domains during domain setup. 
\item[close:] close an interdomain event channel. 
\item[send:] send an event to the remote end of a 
interdomain event channel. 
\item[status:] determine the current status of a local port. 
\end{description} 

For more details see
{\bf xen/include/public/event\_channel.h}. 

\end{quote} 

Event channels are the fundamental communication primitive between 
Xen domains and seamlessly support SMP. However they provide little
bandwidth for communication {\sl per se}, and hence are typically 
married with a piece of shared memory to produce effective and 
high-performance inter-domain communication. 

Safe sharing of memory pages between guest OSes is carried out by
granting access on a per page basis to individual domains. This is
achieved by using the {\tt grant\_table\_op} hypercall.

\begin{quote}
\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}

Used to invoke operations on a grant reference, to setup the grant
table and to dump the tables' contents for debugging.

\end{quote} 

\section{IO Configuration} 

Domains with physical device access (i.e.\ driver domains) receive
limited access to certain PCI devices (bus address space and
interrupts). However many guest operating systems attempt to 
determine the PCI configuration by directly access the PCI BIOS, 
which cannot be allowed for safety. 

Instead, Xen provides the following hypercall: 

\begin{quote}
\hypercall{physdev\_op(void *physdev\_op)}

Set and query IRQ configuration details, set the system IOPL, set the
TSS IO bitmap.

\end{quote} 


For examples of using {\tt physdev\_op}, see the 
Xen-specific PCI code in the linux sparse tree. 

\section{Administrative Operations}
\label{s:dom0ops}

A large number of control operations are available to a sufficiently
privileged domain (typically domain 0). These allow the creation and
management of new domains, for example. A complete list is given 
below: for more details on any or all of these, please see 
{\tt xen/include/public/dom0\_ops.h} 


\begin{quote}
\hypercall{dom0\_op(dom0\_op\_t *op)} 

Administrative domain operations for domain management. The options are:

\begin{description} 
\item [DOM0\_GETMEMLIST:] get list of pages used by the domain

\item [DOM0\_SCHEDCTL:]

\item [DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain

\item [DOM0\_CREATEDOMAIN:] create a new domain

\item [DOM0\_DESTROYDOMAIN:] deallocate all resources associated
with a domain

\item [DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run 
queue. 

\item [DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
  once again. 

\item [DOM0\_GETDOMAININFO:] get statistics about the domain

\item [DOM0\_SETDOMAININFO:] set VCPU-related attributes

\item [DOM0\_MSR:] read or write model specific registers

\item [DOM0\_DEBUG:] interactively invoke the debugger

\item [DOM0\_SETTIME:] set system time

\item [DOM0\_GETPAGEFRAMEINFO:] 

\item [DOM0\_READCONSOLE:] read console content from hypervisor buffer ring

\item [DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU

\item [DOM0\_TBUFCONTROL:] get and set trace buffer attributes

\item [DOM0\_PHYSINFO:] get information about the host machine

\item [DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler

\item [DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes

\item [DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain

\item [DOM0\_GETPAGEFRAMEINFO2:] batched interface for getting
page frame info

\item [DOM0\_ADD\_MEMTYPE:] set MTRRs

\item [DOM0\_DEL\_MEMTYPE:] remove a memory type range

\item [DOM0\_READ\_MEMTYPE:] read MTRR

\item [DOM0\_PERFCCONTROL:] control Xen's software performance
counters

\item [DOM0\_MICROCODE:] update CPU microcode

\item [DOM0\_IOPORT\_PERMISSION:] modify domain permissions for an
IO port range (enable / disable a range for a particular domain)

\item [DOM0\_GETVCPUCONTEXT:] get context from a VCPU

\item [DOM0\_GETVCPUINFO:] get current state for a VCPU
\item [DOM0\_GETDOMAININFOLIST:] batched interface to get domain
info

\item [DOM0\_PLATFORM\_QUIRK:] inform Xen of a platform quirk it
needs to handle (e.g. noirqbalance)

\item [DOM0\_PHYSICAL\_MEMORY\_MAP:] get info about dom0's memory
map

\item [DOM0\_MAX\_VCPUS:] change max number of VCPUs for a domain

\item [DOM0\_SETDOMAINHANDLE:] set the handle for a domain

\end{description} 
\end{quote} 

Most of the above are best understood by looking at the code 
implementing them (in {\tt xen/common/dom0\_ops.c}) and in 
the user-space tools that use them (mostly in {\tt tools/libxc}). 

\section{Debugging Hypercalls} 

A few additional hypercalls are mainly useful for debugging: 

\begin{quote} 
\hypercall{console\_io(int cmd, int count, char *str)}

Use Xen to interact with the console; operations are:

{CONSOLEIO\_write}: Output count characters from buffer str.

{CONSOLEIO\_read}: Input at most count characters into buffer str.
\end{quote} 

A pair of hypercalls allows access to the underlying debug registers: 
\begin{quote}
\hypercall{set\_debugreg(int reg, unsigned long value)}

Set debug register {\bf reg} to {\bf value} 

\hypercall{get\_debugreg(int reg)}

Return the contents of the debug register {\bf reg}
\end{quote}

And finally: 
\begin{quote}
\hypercall{xen\_version(int cmd)}

Request Xen version number.
\end{quote} 

This is useful to ensure that user-space tools are in sync 
with the underlying hypervisor. 


\end{document}