Xen Scheduler HOWTO =================== by Mark Williamson (c) 2004 Intel Research Cambridge Introduction ------------ Xen offers a choice of CPU schedulers. All available schedulers are included in Xen at compile time and the administrator may select a particular scheduler using a boot-time parameter to Xen. It is expected that administrators will choose the scheduler most appropriate to their application and configure the machine to boot with that scheduler. Note: the default scheduler is the Borrowed Virtual Time (BVT) scheduler which was also used in previous releases of Xen. No configuration changes are required to keep using this scheduler. This file provides a brief description of the CPU schedulers available in Xen, what they are useful for and the parameters that are used to configure them. This information is necessarily fairly technical at the moment. The recommended way to fully understand the scheduling algorithms is to read the relevant research papers. The interface to the schedulers is basically "raw" at the moment, without sanity checking - administrators should be careful when setting the parameters since it is possible for a mistake to hang domains, or the entire system (in particular, double check parameters for sanity and make sure that DOM0 will get enough CPU time to remain usable). Note that xc_dom_control.py takes time values in nanoseconds. Future tools will implement friendlier control interfaces. Borrowed Virtual Time (BVT) --------------------------- All releases of Xen have featured the BVT scheduler, which is used to provide proportional fair shares of the CPU based on weights assigned to domains. BVT is "work conserving" - the CPU will never be left idle if there are runnable tasks. BVT uses "virtual time" to make decisions on which domain should be scheduled on the processor. Each time a scheduling decision is required, BVT evaluates the "Effective Virtual Time" of all domains and then schedules the domain with the least EVT. Domains are allowed to "borrow" virtual time by "time warping", which reduces their EVT by a certain amount, so that they may be scheduled sooner. In order to maintain long term fairness, there are limits on when a domain can time warp and for how long. [ For more details read the SOSP'99 paper by Duda and Cheriton ] In the Xen implementation, domains time warp when they unblock, so that domain wakeup latencies are reduced. The BVT algorithm uses the following per-domain parameters (set using xc_dom_control.py cpu_bvtset): * mcuadv - the MCU (Minimum Charging Unit) advance determines the proportional share of the CPU that a domain receives. It is set inversely proportionally to a domain's sharing weight. * warp - the amount of "virtual time" the domain is allowed to warp backwards * warpl - the warp limit is the maximum time a domain can run warped for * warpu - the unwarp requirement is the minimum time a domain must run unwarped for before it can warp again BVT also has the following global parameter (set using xc_dom_control.py cpu_bvtslice): * ctx_allow - the context switch allowance is similar to the "quantum" in traditional schedulers. It is the minimum time that a scheduled domain will be allowed to run before be pre-empted. This prevents thrashing of the CPU. BVT can now be selected by passing the 'sched=bvt' argument to Xen at boot-time and is the default scheduler if no 'sched' argument is supplied. Atropos ------- Atropos is a scheduler originally developed for the Nemesis multimedia operating system. Atropos can be used to reserve absolute shares of the CPU. It also includes some features to improve the efficiency of domains that block for I/O and to allow spare CPU time to be shared out. The Atropos algorithm has the following parameters for each domain (set using xc_dom_control.py cpu_atropos_set): * slice - The length of time per period that a domain is guaranteed. * period - The period over which a domain is guaranteed to receive its slice of CPU time. * latency - The latency hint is used to control how soon after waking up a domain should be scheduled. * xtratime - This is a true (1) / false (0) flag that specifies whether a domain should be allowed a share of the system slack time. Every domain has an associated period and slice. The domain should receive 'slice' nanoseconds every 'period' nanoseconds. This allows the administrator to configure both the absolute share of the CPU a domain receives and the frequency with which it is scheduled. When domains unblock, their period is reduced to the value of the latency hint (the slice is scaled accordingly so that they still get the same proportion of the CPU). For each subsequent period, the slice and period times are doubled until they reach their original values. Atropos is selected by adding 'sched=atropos' to Xen's boot-time arguments. Note: don't overcommit the CPU when using Atropos (i.e. don't reserve more CPU than is available - the utilisation should be kept to slightly less than 100% in order to ensure predictable behaviour). Round-Robin ----------- The Round-Robin scheduler is provided as a simple example of Xen's internal scheduler API. For production systems, one of the other schedulers should be used, since they are more flexible and more efficient. The Round-robin scheduler has one global parameter (set using xc_dom_control.py cpu_rrobin_slice): * rr_slice - The time for which each domain runs before the next scheduling decision is made. The Round-Robin scheduler can be selected by adding 'sched=rrobin' to Xen's boot-time arguments.