aboutsummaryrefslogtreecommitdiffstats
path: root/xen/common/sched_credit.c
Commit message (Collapse)AuthorAgeFilesLines
* scheduler: adjust internal locking interfaceJan Beulich2013-10-141-4/+7
| | | | | | | | | | | | | | | Make the locking functions return the lock pointers, so they can be passed to the unlocking functions (which in turn can check that the lock is still actually providing the intended protection, i.e. the parameters determining which lock is the right one didn't change). Further use proper spin lock primitives rather than open coded local_irq_...() constructs, so that interrupts can be re-enabled as appropriate while spinning. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
* sched_credit: filter node-affinity mask against online cpusDario Faggioli2013-09-201-11/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in _csched_cpu_pick(), as not doing so may result in the domain's node-affinity mask (as retrieved by csched_balance_cpumask() ) and online mask (as retrieved by cpupool_scheduler_cpumask() ) having an empty intersection. Therefore, when attempting a node-affinity load balancing step and running this: ... /* Pick an online CPU from the proper affinity mask */ csched_balance_cpumask(vc, balance_step, &cpus); cpumask_and(&cpus, &cpus, online); ... we end up with an empty cpumask (in cpus). At this point, in the following code: .... /* If present, prefer vc's current processor */ cpu = cpumask_test_cpu(vc->processor, &cpus) ? vc->processor : cpumask_cycle(vc->processor, &cpus); .... an ASSERT (from inside cpumask_cycle() ) triggers like this: (XEN) Xen call trace: (XEN) [<ffff82d08011b124>] _csched_cpu_pick+0x1d2/0x652 (XEN) [<ffff82d08011b5b2>] csched_cpu_pick+0xe/0x10 (XEN) [<ffff82d0801232de>] vcpu_migrate+0x167/0x31e (XEN) [<ffff82d0801238cc>] cpu_disable_scheduler+0x1c8/0x287 (XEN) [<ffff82d080101b3f>] cpupool_unassign_cpu_helper+0x20/0xb4 (XEN) [<ffff82d08010544f>] continue_hypercall_tasklet_handler+0x4a/0xb1 (XEN) [<ffff82d080127793>] do_tasklet_work+0x78/0xab (XEN) [<ffff82d080127a70>] do_tasklet+0x5f/0x8b (XEN) [<ffff82d080158985>] idle_loop+0x57/0x5e (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 1: (XEN) Assertion 'cpu < nr_cpu_ids' failed at /home/dario/Sources/xen/xen/xen.git/xen/include/xe:16481 It is for example sufficient to have a domain with node-affinity to NUMA node 1 running, and issueing a `xl cpupool-numa-split' would make the above happen. That is because, by default, all the existing domains remain assigned to the first cpupool, and it now (after the cpupool-numa-split) only includes NUMA node 0. This change prevents that by generalizing the function used for figuring out whether a node-affinity load balancing step is legit or not. This way we can, in _csched_cpu_pick(), figure out early enough that the mask would end up empty, skip the step all together and avoid the splat. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
* sched/credit: Remove redundant assignments from alloc_* functionsAndrew Cooper2013-09-131-4/+0
| | | | | | | | Noticed because Coverity was complaining at the atomic_set(), but because of the use of xzalloc(), these assignments of 0 are completely redundent. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
* credit1: replace cpumask_empty() usesJan Beulich2013-08-231-4/+3
| | | | | | | | | | | In one case it was redundant with the operation it got combined with, and in the other it could easily be replaced by range checking the result of a subsequent operation. (When running on big systems, operations on CPU masks aren't cheap enough to use them carelessly.) Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
* xen: allow for explicitly specifying node-affinityDario Faggioli2013-04-171-4/+44
| | | | | | | | | | | | | | | Make it possible to pass the node-affinity of a domain to the hypervisor from the upper layers, instead of always being computed automatically. Note that this also required generalizing the Flask hooks for setting and getting the affinity, so that they now deal with both vcpu and node affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Keir Fraser <keir@xen.org>
* xen: sched_credit: let the scheduler know about node-affinityDario Faggioli2013-04-171-153/+307
| | | | | | | | | | | | | | | | | | | | | | | | As vcpu-affinity tells where VCPUs must run, node-affinity tells where they prefer to. While respecting vcpu-affinity remains mandatory, node-affinity is not that strict, it only expresses a preference, although honouring it will bring significant performance benefits (especially as compared to not having any affinity at all). This change modifies the VCPUs load balancing algorithm (for the credit scheduler only), introducing a two steps logic. During the first step, we use both the vcpu-affinity and the node-affinity masks (by looking at their intersection). The aim is giving precedence to the PCPUs where the domain prefers to run, as expressed by its node-affinity (with the intersection with the vcpu-afinity being necessary in order to avoid running a VCPU where it never should). If that fails in finding a valid PCPU, the node-affinity is just ignored and, in the second step, we fall back to using cpu-affinity only. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org>
* xen: sched_credit: when picking, make sure we get an idle one, if anyDario Faggioli2013-04-171-0/+15
| | | | | | | | | | | | | | | | | The pcpu picking algorithm treats two threads of a SMT core the same. More specifically, if one is idle and the other one is busy, they both will be assigned a weight of 1. Therefore, when picking begins, if the first target pcpu is the busy thread (and if there are no other idle pcpu than its sibling), that will never change. This change fixes this by ensuring that, before entering the core of the picking algorithm, the target pcpu is an idle one (if there is an idle pcpu at all, of course). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org>
* credit1: Use atomic bit operations for the flags structureGeorge Dunlap2013-03-041-13/+10
| | | | | | | | | | | | | | The flags structure is not protected by locks (or more precisely, it is protected using an inconsistent set of locks); we therefore need to make sure that all accesses are atomic-safe. This is particulary important in the case of the PARKED flag, which if clobbered while changing the YIELD bit will leave a vcpu wedged in an offline state. Using the atomic bitops also requires us to change the size of the "flags" element. Spotted-by: Igor Pavlikevich <ipavlikevich@gmail.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
* credit1: track residual from divisions done during accountingJan Beulich2013-02-281-1/+7
| | | | | | | | This should help with under-accounting of vCPU-s running for extremly short periods of time, but becoming runnable again at a high frequency. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
* xen: sched_credit: add some tracingDario Faggioli2012-12-181-1/+34
| | | | | | | | About tickling, and PCPU selection. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* xen: sched_credit: improve tickling of idle CPUsDario Faggioli2012-12-181-36/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, when a VCPU wakes-up, we check whether it should preempt what is running on the PCPU, and whether or not the waking VCPU can be migrated (by tickling some idlers). However, this can result in suboptimal or even wrong behaviour, as explained here: http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html This change, instead, when deciding which PCPU(s) to tickle, upon VCPU wake-up, considers both what it is likely to happen on the PCPU where the wakeup occurs,and whether or not there are idlers where the woken-up VCPU can run. In fact, if there are, we can avoid interrupting the running VCPU. Only in case there aren't any of these PCPUs, preemption and migration are the way to go. This has been tested (on top of the previous change) by running the following benchmarks inside 2, 6 and 10 VMs, concurrently, on a shared host, each with 2 VCPUs and 960 MB of memory (host had 16 ways and 12 GB RAM). 1) All VMs had 'cpus="all"' in their config file. $ sysbench --test=cpu ... (time, lower is better) | VMs | w/o this change | w/ this change | | 2 | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 | | 6 | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 | | 10 | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 | $ sysbench --test=memory ... (throughput, higher is better) | VMs | w/o this change | w/ this change | | 2 | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 | | 6 | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 | | 10 | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 | $ specjbb2005 ... (throughput, higher is better) | VMs | w/o this change | w/ this change | | 2 | 43150.63 +/- 1359.5616 | 43275.427 +/- 606.28185 | | 6 | 29274.29 +/- 1024.4042 | 29716.189 +/- 1290.1878 | | 10 | 19061.28 +/- 512.88561 | 19192.599 +/- 605.66058 | 2) All VMs had their VCPUs statically pinned to the host's PCPUs. $ sysbench --test=cpu ... (time, lower is better) | VMs | w/o this change | w/ this change | | 2 | 47.8211 +/- 0.0215504 | 47.826900 +/- 0.0077872 | | 6 | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 | | 10 | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 | $ sysbench --test=memory ... (throughput, higher is better) | VMs | w/o this change | w/ this change | | 2 | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 | | 6 | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 | | 10 | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 | $ specjbb2005 ... (throughput, higher is better) | 2 | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 | | 6 | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 | | 10 | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 | Numbers show how the change has either no or very limited impact (specjbb2005 case) or, when it does have some impact, that is a real improvement in performances (sysbench-memory case). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* xen: sched_credit: improve picking up the idle CPU for a VCPUDario Faggioli2012-12-181-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In _csched_cpu_pick() we try to select the best possible CPU for running a VCPU, considering the characteristics of the underlying hardware (i.e., how many threads, core, sockets, and how busy they are). What we want is "the idle execution vehicle with the most idling neighbours in its grouping". In order to achieve it, we select a CPU from the VCPU's affinity, giving preference to its current processor if possible, as the basis for the comparison with all the other CPUs. Problem is, to discount the VCPU itself when computing this "idleness" (in an attempt to be fair wrt its current processor), we arbitrarily and unconditionally consider that selected CPU as idle, even when it is not the case, for instance: 1. If the CPU is not the one where the VCPU is running (perhaps due to the affinity being changed); 2. The CPU is where the VCPU is running, but it has other VCPUs in its runq, so it won't go idle even if the VCPU in question goes. This is exemplified in the trace below: ] 3.466115364 x|------|------| d10v1 22005(2:2:5) 3 [ a 1 8 ] ... ... ... 3.466122856 x|------|------| d10v1 runstate_change d10v1 running->offline 3.466123046 x|------|------| d?v? runstate_change d32767v0 runnable->running ... ... ... ] 3.466126887 x|------|------| d32767v0 28004(2:8:4) 3 [ a 1 8 ] 22005(...) line (the first line) means _csched_cpu_pick() was called on VCPU 1 of domain 10, while it is running on CPU 0, and it choose CPU 8, which is busy ('|'), even if there are plenty of idle CPUs. That is because, as a consequence of changing the VCPU affinity, CPU 8 was chosen as the basis for the comparison, and therefore considered idle (its bit gets unconditionally set in the bitmask representing the idle CPUs). 28004(...) line means the VCPU is woken up and queued on CPU 8's runq, where it waits for a context switch or a migration, in order to be able to execute. This change fixes things by only considering the "guessed" CPU idle if the VCPU in question is both running there and is its only runnable VCPU. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* xen: sched_credit: define and use curr_on_cpu(cpu)Dario Faggioli2012-12-181-8/+7
| | | | | | | | | | To fetch `per_cpu(schedule_data,cpu).curr' in a more readable way. It's in sched-if.h as that is where `struct schedule_data' is declared. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* scheduler: fix rate limit range checkingJan Beulich2012-12-101-13/+3
| | | | | | | | | | | | For one, neither of the two checks permitted for the documented value of zero (disabling the functionality altogether). Second, the range checking of the command line parameter was done by the credit scheduler's initialization code, despite it being a generic scheduler option. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
* xen: sched: generalize scheduling related perfcounter macrosDario Faggioli2012-10-231-66/+57
| | | | | | | | | | | | | | Moving some of them from sched_credit.c to generic scheduler code. This also allows the other schedulers to use perf counters equally easy. This change is mainly preparatory work for what stated above. In fact, it mostly does s/CSCHED_STAT/SCHED_STAT/, and, in general, it implies no functional changes. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* xen: Remove sched_credit_default_yield optionGeorge Dunlap2012-10-011-7/+2
| | | | | | | | | | | The sched_credit_default_yield option was added when the behavior of "SCHEDOP_yield" was changed in 4.1, to allow any users who had problems to revert to the old behavior. The new behavior has been in Xen.org xen since 4.1, and in XenServer even longer, and there is no evidence of anyone having trouble with it. Remove the option. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* scheduler: Implement SCHEDOP sysctl for credit schedulerGeorge Dunlap2012-02-231-0/+53
| | | | | | | Allow tslice_ms and ratelimit_us to be modified. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* scheduler: Print ratelimit in scheduler debug keyGeorge Dunlap2012-02-231-0/+2
| | | | | Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* introduce and use common macros for selecting cpupool based cpumasksJuergen Gross2012-01-241-4/+2
| | | | | | | | There are several instances of the same construct finding the cpumask for a cpupool. Use macros instead. Signed-off-by: juergen.gross@ts.fujitsu.com Committed-by: Keir Fraser <keir@xen.org>
* sched_credit: Use delay to control scheduling frequencyHui Lv2012-01-171-1/+46
| | | | | | | | | | | | | | | | | | | | This patch can improve Xen performance: 1. Basically, the "delay method" can achieve 11% overall performance boost for SPECvirt than original credit scheduler. 2. We have tried 1ms delay and 10ms delay, there is no big difference between these two configurations. (1ms is enough to achieve a good performance) 3. We have compared different load level response time/latency (low, high, peak), "delay method" didn't bring very much response time increase. 4. 1ms delay can reduce 30% context switch at peak performance, where produces the benefits. (int sched_ratelimit_us = 1000 is the recommended setting) Signed-off-by: Hui Lv <hui.lv@intel.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* Rework locking for sched_adjust.Dario Faggioli2012-01-041-3/+7
| | | | | | | | | | | | | | | | | The main idea is to move (as much as possible) locking logic from generic code to the various pluggable schedulers. While at it, the following is also accomplished: - pausing all the non-current VCPUs of a domain while changing its scheduling parameters is not effective in avoiding races and it is prone to deadlock, so that is removed. - sedf needs a global lock for preventing races while adjusting domains' scheduling parameters (as it is for credit and credit2), so that is added. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
* eliminate cpus_xyz()Jan Beulich2011-11-081-1/+1
| | | | | | Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
* credit: allocate CPU masks dynamicallyJan Beulich2011-10-211-16/+27
| | | | | Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
* cpupools: allocate CPU masks dynamicallyJan Beulich2011-10-211-1/+1
| | | | | Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
* allocate CPU sibling and core maps dynamicallyJan Beulich2011-10-211-11/+11
| | | | | | | | | | ... thus reducing the per-CPU data area size back to one page even when building for large NR_CPUS. At once eliminate the old __cpu{mask,list}_scnprintf() helpers. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
* eliminate cpumask accessors referencing NR_CPUSJan Beulich2011-10-211-36/+36
| | | | | | | ... in favor of using the new, nr_cpumask_bits-based ones. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
* introduce and use nr_cpu_ids and nr_cpumask_bitsJan Beulich2011-10-211-1/+1
| | | | | | | | | | | | | | | The former is the runtime equivalent of NR_CPUS (and users of NR_CPUS, where necessary, get adjusted accordingly), while the latter is for the sole use of determining the allocation size when dynamically allocating CPU masks (done later in this series). Adjust accessors to use either of the two to bound their bitmap operations - which one gets used depends on whether accessing the bits in the gap between nr_cpu_ids and nr_cpumask_bits is benign but more efficient. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
* use xzalloc in common codeJan Beulich2011-10-041-8/+4
| | | | | Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
* convert more literal uses of cpumask_t to pointersJan Beulich2011-09-181-1/+1
| | | | | | | This is particularly relevant as the number of CPUs to be supported increases (as recently happened for the default thereof). Signed-off-by: Jan Beulich <jbeulich@suse.com>
* xen,credit1: Add variable timesliceGeorge Dunlap2011-09-131-29/+36
| | | | | | | Add a xen command-line parameter, sched_credit_tslice_ms, to set the timeslice of the credit1 scheduler. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
* Remove direct cpumask_t members from struct vcpu and struct domainJan Beulich2011-04-051-4/+4
| | | | | | | | | | | | | | | The CPU masks embedded in these structures prevent NR_CPUS-independent sizing of these structures. Basic concept (in xen/include/cpumask.h) taken from recent Linux. For scalability purposes, many other uses of cpumask_t should be replaced by cpumask_var_t, particularly local variables of functions. This implies that no functions should have by-value cpumask_t parameters, and that the whole old cpumask interface (cpus_...()) should go away in favor of the new (cpumask_...()) one. Signed-off-by: Jan Beulich <jbeulich@novell.com>
* _csched_cpu_pick(): simplify sched_smt_power_savings dependent conditionJan Beulich2011-03-141-4/+3
| | | | | | | At least to me, using ?: instead of the (a && ...) || (!a && ...) construct is far easier to grok with a single look. Signed-off-by: Jan Beulich <jbeulich@novell.com>
* _csched_cpu_pick(): don't write idle bias more than onceJan Beulich2011-03-141-3/+6
| | | | | | | | For the bias to be really meaningful, it should be updated only when the CPU selected will indeed be returned (and hence used for placing the vCPU in question). Signed-off-by: Jan Beulich <jbeulich@novell.com>
* _csched_cpu_pick(): don't return CPUs outside vCPU's affinity maskJan Beulich2011-03-141-0/+1
| | | | | | | This fixes a fairly blatant bug I introduced in c/s 20377:cff23354d026 - I wonder how this went unnoticed for so long. Signed-off-by: Jan Beulich <jbeulich@novell.com>
* sched_credit: Hold lock while dump scheduler infoKeir Fraser2011-03-051-0/+6
| | | | | | | | | | | | | | | | | | | | | | Dump runq with debug key 'r' may cause dead loop like below: (XEN) active vcpus: (XEN) 1: [1.0] pri=0 flags=0 cpu=0 credit=263 [w=256] (XEN) 2: [0.2] pri=0 flags=0 cpu=5 credit=284 [w=256] (XEN) 3: [0.2] pri=0 flags=0 cpu=5 credit=282 [w=256] ... (XEN) xxxxx: [0.2] pri=0 flags=0 cpu=2 credit=54 [w=256] ... (XEN) xxxxx: [0.2] pri=0 flags=0 cpu=3 credit=-48 [w=256] ... This means the active vcpu 0.2 became non-active with the active list element empty just after it was accessed in the loop '2:'. We should always hold a lock before access scheduler related list, even in the debug purpose dump code. Signed-off-by: Wei Gang <gang.wei@intel.com>
* cpupool: Avoid race when moving cpu between cpupoolsJuergen Gross2011-02-251-1/+2
| | | | | | | | | | | | Moving cpus between cpupools is done under the schedule lock of the moved cpu. When checking a cpu being member of a cpupool this must be done with the lock of that cpu being held. Hot-unplugging of physical cpus might encounter the same problems, but this should happen only very rarely. Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Andre Przywara <andre.przywara@amd.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
* Use bool_t for various boolean variablesKeir Fraser2010-12-241-2/+2
| | | | | | | | | | | ... decreasing cache footprint. As a prerequisite this requires making cmdline_parse() a little more flexible. Also remove a few variables altogether, and adjust sections annotations for several others. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Keir Fraser <keir@xen.org>
* scheduler: Introduce pcpu_schedule_lockKeir Fraser2010-12-241-4/+4
| | | | | | | | | | | Many places in Xen, particularly schedule.c, grab the per-cpu spinlock directly, rather than through vcpu_schedule_lock(). Since the lock pointer may change between the time it's read and the time the lock is successfully acquired, we need to check after acquiring the lock to make sure that the pcpu's lock hasn't changed, due to cpu initialization or cpupool activity. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
* cpupools: Make interface more consistentKeir Fraser2010-10-241-15/+15
| | | | | | | | | | | | | | | | | | | | The current cpupools code interface is a bit inconsistent. This patch addresses this by making the interaction for each vcpu in a pool look like this: alloc_vdata() -- allocates and sets up vcpu data insert_vcpu() -- the vcpu is ready to run in this pool remove_vcpu() -- take the vcpu out of the pool free_vdata() -- delete allocated vcpu data (Previously, remove_vcpu and free_vdata were combined into a "destroy vcpu", and insert_vcpu was only called for idle vcpus.) This also addresses a bug in credit2 which was caused by a misunderstanding of the cpupools interface. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
* sched_credit: Raise bar for inter-socket migrations on mostly-idle systemsKeir Fraser2010-09-201-2/+12
| | | | | | | | | | | | | | | | The credit scheduler ties to keep work balanced, even on a mostly idle system. Unfortunately, if you have one VM burning cpu and another VM idle, the effect is that the busy VM will flip back and forth between sockets. This patch addresses this, by only migrating to a different socket if the number of idle processors is twice that of the socket the vcpu is currently on. This will only affect mostly-idle systems; as the system becomes more busy, other load-balancing code will come into effect. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
* credit1: Make weight per-vcpuKeir Fraser2010-08-131-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | Change the meaning of credit1's "weight" parameter to be per-vcpu, rather than per-VM. At the moment, the "weight" parameter for a VM is set on a per-VM basis. This means that when cpu time is scarce, two VMs with the same weight will be given the same amount of total cpu time, no matter how many vcpus it has. I.e., if a VM has 1 vcpu, that vcpu will get x% of cpu time; if a VM has 2 vcpus, each vcpu will get (x/2)% of the cpu time. I believe this is a counter-intuitive interface. Users often choose to add vcpus; when they do so, it's with the expectation that a VM will need and use more cpu time. In my experience, however, users rarely change the weight parameter. So the normal course of events is for a user to decide a VM needs more processing power, add more cpus, but doesn't change the weight. The VM still gets the same amount of cpu time, but less efficiently allocated (because it's divided). The attached patch changes the meaning of the "weight" parameter, to be per-vcpu. Each vcpu is given the weight. So if you add an extra vcpu, your VM will get more cpu time as well. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
* Credit1: Tweak reset conditionKeir Fraser2010-08-091-1/+3
| | | | | | | | | | | | | | | VMs that don't use their full timeslice are guaranteed to flip back and forth between "active" and "inactive". If we set credit to 0 when setting "inactive", then when the VM comes back to "active" again, it will effectively be behind most other vcpus in credit. This causes the credit1 to effectively discriminate *against* VMs which use less than their full timeslice. Instead of setting credit to 0, divide it in half. This gets rid of some of the system credit while allowing non-cpu-bound VMs to keep some priority advantage. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
* scheduler: Implement yield for credit1Keir Fraser2010-08-091-1/+39
| | | | | | | | | | | | | | | | This patch implements 'yield' for credit1. It does this by attempting to put yielding vcpu behind a single lower-priority vcpu on the runqueue. If no lower-priority vcpus are in the queue, it will go at the back (which if the queue is empty, will also be the front). Runqueues are sorted every 30ms, so that's the longest this priority inversion can happen. For workloads with heavy concurrency hazard, and guest which implement yield-on-spinlock, this patch significantly increases performance and total system throughput. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
* x86: IRQ affinity should track vCPU affinityKeir Fraser2010-06-171-2/+6
| | | | | | | | | | | | | | With IRQs getting bound to the CPU the binding vCPU currently runs on there can result quite a bit of extra cross CPU traffic as soon as that vCPU moves to a different pCPU. Likewise, when a domain re-binds an event channel associated with a pIRQ, that IRQ's affinity should also be adjusted. The open issue is how to break ties for interrupts shared by multiple domains - currently, the last request (at any point in time) is being honored. Signed-off-by: Jan Beulich <jbeulich@novell.com>
* x86: Dynamically allocate percpu data area when a CPU comes online.Keir Fraser2010-05-181-1/+1
| | | | | | At the same time, the data area starts life zeroed. Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
* cpupool: Fix CPU hotplug after recent changes.Keir Fraser2010-05-171-66/+8
| | | | Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
* tasklet: Improve scheduler interaction.Keir Fraser2010-05-111-2/+3
| | | | Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
* scheduler: const-ify references to 'struct scheduler' where possible.Keir Fraser2010-05-041-23/+23
| | | | Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
* cpupools [1/6]: hypervisor changesKeir Fraser2010-04-211-137/+274
| | | | Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
* Implement tasklets as running in VCPU context (sepcifically, idle-VCPU context)Keir Fraser2010-04-191-1/+13
| | | | | | | | ...rather than in softirq context. This is expected to avoid a lot of subtle deadlocks relating to the fact that softirqs can interrupt a scheduled vcpu. Signed-off-by: Keir Fraser <keir.fraser@citrix.com>