| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
Page fault is specially handled not only with exception bitmaps,
but also with consideration of page fault error code mask/match
values. Therefore in nested virtualization case, the two values
need to be synchronized from virtual VMCS to shadow VMCS.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
|
|
|
|
|
|
|
|
|
| |
Use the host value to emulate IA32_VMX_MISC MSR for L1 VMM.
For CR3-target value, we don't support this feature currently and
set the number to zero.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
|
|
|
|
|
|
|
|
|
|
| |
Instead of using a hardcoded domain 0 as the endpoint for the event
channels created in hvm_vcpu_initialise, use the domain ID of the
building domain so that a domain builder in a domain other than dom0 has
the expected access to the event channels.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Jan Beulich <jbeulich@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
c/s 22998:e9fab50d7b61 (and immediately following ones) made it
possible that __get_page_type() returns other than -EINVAL, in
particular -EBUSY. Consequently, the assertion in get_page_type()
should check for only the return values we absolutely don't expect to
see there.
This is XSA-37 / CVE-2013-0154.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
|
|
|
|
|
|
|
|
| |
Re-using "addr" here was a mistake, as it is a 32-bit quantity.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
| |
At least certain Marvell SATA controllers are known to issue bus master
requests with a non-zero function as origin, despite themselves being
single function devices.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
|
|
|
|
|
|
|
|
|
| |
With ordinary requests allowed to come from phantom functions, the
remapping tables ought to be set up to allow for MSI triggers to come
from other than the "real" device too.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Apart from generating device context entries for the base function,
all phantom functions also need context entries to be generated for
them.
In order to distinguish different use cases, a variant of
pci_get_pdev() is being introduced that, even when passed a phantom
function number, would return the underlying actual device.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add an "unknown" device types as well as one for PCI-to-PCIe bridges
(the latter of which other IOMMU code with or without this patch
doesn't appear to handle properly).
Make sure we don't mistake a device for which we can't access its
config space as a legacy PCI device (after all we in fact don't know
how to deal with such a device, and hence shouldn't try to).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
|
|
|
|
|
|
|
| |
... to use a (struct pci_dev *, devfn) pair.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
|
|
|
|
|
|
|
| |
... to use a (struct pci_dev *, devfn) pair.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
|
|
|
|
|
|
|
| |
... to use a (struct pci_dev *, devfn) pair.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
|
|
|
|
|
|
|
| |
... to use a (struct pci_dev *, devfn) pair.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This appears to be a copy paste error from c/s 23861:ec7c81fbe0de.
It is safe, functionally speaking, as both the xen_domctl_assign_device
and xen_domctl_get_device_group structure start with a 'uint32_t
machine_sbdf'. We should however use the correct union structure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
MCE injection and x86_emulator are clearly x86 specific.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We weren't taking the guest mode (CPSR) into account and would always
access the user version of the registers.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|/
|
|
|
|
| |
Signed-off-by: Tim Deegan <tim@xen.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
| |
Do so by simply re-using _show_registers().
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
| |
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The concept is X86 specific.
AFAICT the generic concept here is the number of static physical IRQs
which the current hardware has, so call this nr_static_irqs.
Also using "defined NR_IRQS" as a standin for x86 might have made
sense at one point but its just cleaner to push the necessary
definitions into asm/irq.h.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
| |
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise gcc complains about variables being used when not
initialised when in fact that point is never reached.
There aren't any instances of this in tree right now, I noticed this
while developing another patch.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- move 32-bit specific files into subarch specific arm32 subdirectory.
- move gic.h to xen/include/asm-arm (it is needed from both subarch
and generic code).
- make the appropriate build and config file changes to support
XEN_TARGET_ARCH=arm32.
This prepares us for an eventual 64-bit subarch.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Primarily this is so that they are ordered in the same way as the
mapping from arm64 x0..x31 registers to the arm32 registers, which is
just less confusing for everyone going forward.
It also makes the implementation of select_user_regs in the next patch
slightly simpler.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Run expand(1) over xen/arch/arm/.../*.S
Add emacs local vars block.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[ijc -- stripped trailing whitespace caught by git apply]
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
| |
This is a purely whitespace change.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
| |
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
| |
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
| |
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
| |
Currently unimplemented. Domain teardown in general needs looking at.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
| |
It currently has no callers, so return ENOSYS until such a time as one
arrives.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Callers are VT-d (so x86 specific) and various bits of page offlining
support, which although it looks generic (and is in xen/common) does
things like diving into page_info->count_info which is not generic.
In any case on this is only reachable via XEN_SYSCTL_page_offline_op,
which clearly shouldn't be called on ARM just yet.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
| |
Callers handle the failure gracefully, can be called by
GNTTABOP_transfer, XENMEM_exchange or tmem.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't currently have much concept of wallclock time on ARM (for
either the hypervisor, dom0 or guests). For now just stub everything
out. Specifically domain_set_time_offset, update_vcpu_system_time and
wallclock_time.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
`
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
| |
On ARM we use GIC functionality to inject virtualised real interrupts
for h/w devices rather than evtchn-pirqs.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
| |
Untested.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
| |
Untested, but basically the inverse of arch_set_info_guest.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
| |
It still doesn't do anything useful, but at least it isn't in dummy.S!
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
| |
For now just initialise it as a single online node, which is what
asm-arm/numa.h assumes anyway.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
| |
If we panic before calling init_xen_time then the "Rebooting in 5
seconds" delay ends up calling udelay which uses cntfrq before it has
been initialised resulting in a divide by zero.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
| |
We don't need to manually set the P2M for the vGIC in construct_dom0,
because we have already done it generally for every guest in gicv_setup.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Way Access Filter in recent AMD CPUs may hurt the performance of
some workloads, caused by aliasing issues in the L1 cache.
This patch disables it on the affected CPUs.
The issue is similar to that one of last year:
http://lkml.indiana.edu/hypermail/linux/kernel/1107.3/00041.html
This new patch does not replace the old one, we just need another
quirk for newer CPUs.
The performance penalty without the patch depends on the
circumstances, but is a bit less than the last year's 3%.
The workloads affected would be those that access code from the same
physical page under different virtual addresses, so different
processes using the same libraries with ASLR or multiple instances of
PIE-binaries. The code needs to be accessed simultaneously from both
cores of the same compute unit.
More details can be found here:
http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf
CPUs affected are anything with the core known as Piledriver.
That includes the new parts of the AMD A-Series (aka Trinity) and the
just released new CPUs of the FX-Series (aka Vishera).
The model numbering is a bit odd here: FX CPUs have model 2,
A-Series has model 10h, with possible extensions to 1Fh. Hence the
range of model ids.
Signed-off-by: Andre Przywara <osp@andrep.de>
Add and use MSR_AMD64_IC_CFG. Update the value whenever it is found to
not have all bits set, rather than just when it's zero.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
Committed-by: Jan Beulich <jbeulich@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since the arch-independent do_domctl function now RCU locks the domain
specified by op->domain, pass the struct domain to the arch-specific
domctl function and remove the duplicate per-subfunction locking.
This also removes two get_domain/put_domain call pairs (in
XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device), replacing
them with RCU locking.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Because almost all domctls need to lock the target domain, do this by
default instead of repeating it in each domctl. This is not currently
extended to the arch-specific domctls, but RCU locks are safe to take
recursively so this only causes duplicate but correct locking.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TPR shadow/threshold feature is important to speedup the boot time
for Windows guest. Besides, it is a must feature for certain VMM.
We map virtual APIC page address and TPR threshold from L1 VMCS,
and synch it into shadow VMCS in virtual vmentry.
If TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest, we
inject it into L1 VMM for handling.
Besides, this commit fixes an issue for apic access page, if L1
VMM didn't enable this feature, we need to fill zero into the
shadow VMCS.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
| |
About tickling, and PCPU selection.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
So that it becomes possible to create scheduler specific trace
records, within each scheduler, without worrying about the
overlapping, and also without giving up being able to recognise them
univocally. The latter is deemed as useful, since we can have more
than one scheduler running at the same time, thanks to cpupools.
The event ID is 12 bits, and this change uses the upper 3 of them for
the 'scheduler ID'. This means we're limited to 8 schedulers and to
512 scheduler specific tracing events. Both seem reasonable
limitations as of now.
This also converts the existing credit2 tracing (the only scheduler
generating tracing events up to now) to the new system.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Right now, when a VCPU wakes-up, we check whether it should preempt
what is running on the PCPU, and whether or not the waking VCPU can
be migrated (by tickling some idlers). However, this can result in
suboptimal or even wrong behaviour, as explained here:
http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html
This change, instead, when deciding which PCPU(s) to tickle, upon
VCPU wake-up, considers both what it is likely to happen on the PCPU
where the wakeup occurs,and whether or not there are idlers where
the woken-up VCPU can run. In fact, if there are, we can avoid
interrupting the running VCPU. Only in case there aren't any of
these PCPUs, preemption and migration are the way to go.
This has been tested (on top of the previous change) by running
the following benchmarks inside 2, 6 and 10 VMs, concurrently, on
a shared host, each with 2 VCPUs and 960 MB of memory (host had 16
ways and 12 GB RAM).
1) All VMs had 'cpus="all"' in their config file.
$ sysbench --test=cpu ... (time, lower is better)
| VMs | w/o this change | w/ this change |
| 2 | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 |
| 6 | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 |
| 10 | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 |
$ sysbench --test=memory ... (throughput, higher is better)
| VMs | w/o this change | w/ this change |
| 2 | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 |
| 6 | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 |
| 10 | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 |
$ specjbb2005 ... (throughput, higher is better)
| VMs | w/o this change | w/ this change |
| 2 | 43150.63 +/- 1359.5616 | 43275.427 +/- 606.28185 |
| 6 | 29274.29 +/- 1024.4042 | 29716.189 +/- 1290.1878 |
| 10 | 19061.28 +/- 512.88561 | 19192.599 +/- 605.66058 |
2) All VMs had their VCPUs statically pinned to the host's PCPUs.
$ sysbench --test=cpu ... (time, lower is better)
| VMs | w/o this change | w/ this change |
| 2 | 47.8211 +/- 0.0215504 | 47.826900 +/- 0.0077872 |
| 6 | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 |
| 10 | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 |
$ sysbench --test=memory ... (throughput, higher is better)
| VMs | w/o this change | w/ this change |
| 2 | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 |
| 6 | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 |
| 10 | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 |
$ specjbb2005 ... (throughput, higher is better)
| 2 | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 |
| 6 | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 |
| 10 | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 |
Numbers show how the change has either no or very limited impact
(specjbb2005 case) or, when it does have some impact, that is a
real improvement in performances (sysbench-memory case).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In _csched_cpu_pick() we try to select the best possible CPU for
running a VCPU, considering the characteristics of the underlying
hardware (i.e., how many threads, core, sockets, and how busy they
are). What we want is "the idle execution vehicle with the most
idling neighbours in its grouping".
In order to achieve it, we select a CPU from the VCPU's affinity,
giving preference to its current processor if possible, as the basis
for the comparison with all the other CPUs. Problem is, to discount
the VCPU itself when computing this "idleness" (in an attempt to be
fair wrt its current processor), we arbitrarily and unconditionally
consider that selected CPU as idle, even when it is not the case,
for instance:
1. If the CPU is not the one where the VCPU is running (perhaps due
to the affinity being changed);
2. The CPU is where the VCPU is running, but it has other VCPUs in
its runq, so it won't go idle even if the VCPU in question goes.
This is exemplified in the trace below:
] 3.466115364 x|------|------| d10v1 22005(2:2:5) 3 [ a 1 8 ]
... ... ...
3.466122856 x|------|------| d10v1 runstate_change d10v1
running->offline
3.466123046 x|------|------| d?v? runstate_change d32767v0
runnable->running
... ... ...
] 3.466126887 x|------|------| d32767v0 28004(2:8:4) 3 [ a 1 8 ]
22005(...) line (the first line) means _csched_cpu_pick() was called
on VCPU 1 of domain 10, while it is running on CPU 0, and it choose
CPU 8, which is busy ('|'), even if there are plenty of idle
CPUs. That is because, as a consequence of changing the VCPU affinity,
CPU 8 was chosen as the basis for the comparison, and therefore
considered idle (its bit gets unconditionally set in the bitmask
representing the idle CPUs). 28004(...) line means the VCPU is woken
up and queued on CPU 8's runq, where it waits for a context switch or
a migration, in order to be able to execute.
This change fixes things by only considering the "guessed" CPU idle if
the VCPU in question is both running there and is its only runnable
VCPU.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
|