xen/xen - xen

	Commit message (Collapse)	Author	Age	Files	Lines
*	nested vmx: synchronize page fault error code match and mask	Dongxiao Xu	2013-01-08	1	-0/+12
\| \| \| \| \| \| \| \| \| \|	Page fault is specially handled not only with exception bitmaps, but also with consideration of page fault error code mask/match values. Therefore in nested virtualization case, the two values need to be synchronized from virtual VMCS to shadow VMCS. Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com> Committed-by: Jan Beulich <jbeulich@suse.com>
*	nested vmx: emulate IA32_VMX_MISC MSR	Dongxiao Xu	2013-01-08	2	-1/+4
\| \| \| \| \| \| \| \| \|	Use the host value to emulate IA32_VMX_MISC MSR for L1 VMM. For CR3-target value, we don't support this feature currently and set the number to zero. Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com> Committed-by: Jan Beulich <jbeulich@suse.com>
*	x86/hvm: Bind xen-created event channels to building domain	Daniel De Graaf	2013-01-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Instead of using a hardcoded domain 0 as the endpoint for the event channels created in hvm_vcpu_initialise, use the domain ID of the building domain so that a domain builder in a domain other than dom0 has the expected access to the event channels. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Committed-by: Jan Beulich <jbeulich@suse.com>
*	x86: fix assertion in get_page_type()	Jan Beulich	2013-01-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	c/s 22998:e9fab50d7b61 (and immediately following ones) made it possible that __get_page_type() returns other than -EINVAL, in particular -EBUSY. Consequently, the assertion in get_page_type() should check for only the return values we absolutely don't expect to see there. This is XSA-37 / CVE-2013-0154. Signed-off-by: Jan Beulich <jbeulich@suse.com>
*	x86: compat_show_guest_stack() should not truncate MFN	Jan Beulich	2013-01-07	1	-2/+3
\| \| \| \| \| \| \| \|	Re-using "addr" here was a mistake, as it is a 32-bit quantity. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
*	IOMMU: add option to specify devices behaving like ones using phantom functions	Jan Beulich	2013-01-07	2	-0/+67
\| \| \| \| \| \| \| \| \|	At least certain Marvell SATA controllers are known to issue bus master requests with a non-zero function as origin, despite themselves being single function devices. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
*	VT-d: relax source qualifier for MSI of phantom functions	Jan Beulich	2013-01-07	1	-1/+10
\| \| \| \| \| \| \| \| \|	With ordinary requests allowed to come from phantom functions, the remapping tables ought to be set up to allow for MSI triggers to come from other than the "real" device too. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
*	IOMMU: add phantom function support	Jan Beulich	2013-01-07	7	-19/+167
\| \| \| \| \| \| \| \| \| \| \| \| \|	Apart from generating device context entries for the base function, all phantom functions also need context entries to be generated for them. In order to distinguish different use cases, a variant of pci_get_pdev() is being introduced that, even when passed a phantom function number, would return the underlying actual device. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
*	IOMMU/PCI: consolidate pdev_type() and cache its result for a given device	Jan Beulich	2013-01-07	5	-27/+40
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add an "unknown" device types as well as one for PCI-to-PCIe bridges (the latter of which other IOMMU code with or without this patch doesn't appear to handle properly). Make sure we don't mistake a device for which we can't access its config space as a legacy PCI device (after all we in fact don't know how to deal with such a device, and hence shouldn't try to). Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
*	AMD IOMMU: adjust flush function parameters	Jan Beulich	2013-01-07	3	-9/+9
\| \| \| \| \| \| \|	... to use a (struct pci_dev *, devfn) pair. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
*	VT-d: adjust context map/unmap parameters	Jan Beulich	2013-01-07	3	-25/+23
\| \| \| \| \| \| \|	... to use a (struct pci_dev *, devfn) pair. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
*	IOMMU: adjust add/remove operation parameters	Jan Beulich	2013-01-07	6	-56/+59
\| \| \| \| \| \| \|	... to use a (struct pci_dev *, devfn) pair. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
*	IOMMU: adjust (re)assign operation parameters	Jan Beulich	2013-01-07	4	-70/+46
\| \| \| \| \| \| \|	... to use a (struct pci_dev *, devfn) pair. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
*	merge	Ian Campbell	2013-01-04	2	-5/+5
\|\
\| *	passthrough/domctl: use correct struct in union	Andrew Cooper	2013-01-04	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This appears to be a copy paste error from c/s 23861:ec7c81fbe0de. It is safe, functionally speaking, as both the xen_domctl_assign_device and xen_domctl_get_device_group structure start with a 'uint32_t machine_sbdf'. We should however use the correct union structure. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Committed-by: Jan Beulich <jbeulich@suse.com>
\| *	tools/tests: Restrict some tests to x86 only	Ian Campbell	2012-12-21	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MCE injection and x86_emulator are clearly x86 specific. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
* \|	xen: arm: fix guest register access.	Ian Campbell	2012-12-20	5	-9/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We weren't taking the guest mode (CPSR) into account and would always access the user version of the registers. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Ian Campbell <ian.campbell@citrix.com>
* \|	arm: trim pagetable flag definitions to fit in 80 characters	Tim Deegan	2012-12-20	1	-4/+4
\|/ \| \| \| \| \|	Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	x86: also print CRn register values upon double fault	Jan Beulich	2012-12-20	1	-16/+13
\| \| \| \| \| \| \|	Do so by simply re-using _show_registers(). Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	xen: arm: remove now empty dummy.S	Ian Campbell	2012-12-19	2	-9/+0
\| \| \| \| \| \| \|	Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: remove nr_irqs_gsi from generic code	Ian Campbell	2012-12-19	6	-14/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The concept is X86 specific. AFAICT the generic concept here is the number of static physical IRQs which the current hardware has, so call this nr_static_irqs. Also using "defined NR_IRQS" as a standin for x86 might have made sense at one point but its just cleaner to push the necessary definitions into asm/irq.h. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	libxl: move definition of libxl_domain_config into the IDL	Ian Campbell	2012-12-19	5	-209/+18
\| \| \| \| \| \|	Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: mark early_panic as a noreturn function	Ian Campbell	2012-12-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Otherwise gcc complains about variables being used when not initialised when in fact that point is never reached. There aren't any instances of this in tree right now, I noticed this while developing another patch. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: introduce arm32 as a subarch of arm.	Ian Campbell	2012-12-19	40	-28/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	- move 32-bit specific files into subarch specific arm32 subdirectory. - move gic.h to xen/include/asm-arm (it is needed from both subarch and generic code). - make the appropriate build and config file changes to support XEN_TARGET_ARCH=arm32. This prepares us for an eventual 64-bit subarch. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: reorder registers in struct cpu_user_regs.	Ian Campbell	2012-12-19	4	-7/+11
\| \| \| \| \| \| \| \| \| \| \| \| \|	Primarily this is so that they are ordered in the same way as the mapping from arm64 x0..x31 registers to the arm32 registers, which is just less confusing for everyone going forward. It also makes the implementation of select_user_regs in the next patch slightly simpler. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: remove hard tabs from asm code.	Ian Campbell	2012-12-19	4	-471/+498
\| \| \| \| \| \| \| \| \| \| \|	Run expand(1) over xen/arch/arm/.../*.S Add emacs local vars block. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> [ijc -- stripped trailing whitespace caught by git apply] Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: fix long lines in entry.S	Ian Campbell	2012-12-19	1	-33/+33
\| \| \| \| \| \| \| \|	This is a purely whitespace change. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: implement share_xen_page_with_privileged_guests	Ian Campbell	2012-12-19	2	-3/+6
\| \| \| \| \| \| \|	Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: implement send_timer_event.	Ian Campbell	2012-12-19	2	-1/+7
\| \| \| \| \| \| \|	Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: initialise dom_{xen,io,cow}	Ian Campbell	2012-12-19	4	-2/+31
\| \| \| \| \| \| \|	Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: stub domain_relinquish_resources.	Ian Campbell	2012-12-19	2	-1/+7
\| \| \| \| \| \| \| \| \|	Currently unimplemented. Domain teardown in general needs looking at. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: stub out domain_get_maximum_gpfn	Ian Campbell	2012-12-19	2	-1/+5
\| \| \| \| \| \| \| \| \| \|	It currently has no callers, so return ENOSYS until such a time as one arrives. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: stub page_is_ram_type.	Ian Campbell	2012-12-19	2	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Callers are VT-d (so x86 specific) and various bits of page offlining support, which although it looks generic (and is in xen/common) does things like diving into page_info->count_info which is not generic. In any case on this is only reachable via XEN_SYSCTL_page_offline_op, which clearly shouldn't be called on ARM just yet. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: stub out steal_page.	Ian Campbell	2012-12-19	2	-3/+6
\| \| \| \| \| \| \| \| \| \|	Callers handle the failure gracefully, can be called by GNTTABOP_transfer, XENMEM_exchange or tmem. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: stub out wallclock time.	Ian Campbell	2012-12-19	2	-5/+18
\| \| \| \| \| \| \| \| \| \| \| \| \|	We don't currently have much concept of wallclock time on ARM (for either the hypervisor, dom0 or guests). For now just stub everything out. Specifically domain_set_time_offset, update_vcpu_system_time and wallclock_time. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> ` Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: stub out pirq related functions.	Ian Campbell	2012-12-19	2	-4/+29
\| \| \| \| \| \| \| \| \| \|	On ARM we use GIC functionality to inject virtualised real interrupts for h/w devices rather than evtchn-pirqs. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: implement arch_vcpu_reset.	Ian Campbell	2012-12-19	2	-1/+5
\| \| \| \| \| \| \| \| \|	Untested. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: implement arch_get_info_guest	Ian Campbell	2012-12-19	2	-1/+17
\| \| \| \| \| \| \| \| \|	Untested, but basically the inverse of arch_set_info_guest. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: make smp_send_state_dump a real function	Ian Campbell	2012-12-19	2	-3/+6
\| \| \| \| \| \| \| \| \|	It still doesn't do anything useful, but at least it isn't in dummy.S! Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: define node_online_map.	Ian Campbell	2012-12-19	3	-2/+4
\| \| \| \| \| \| \| \| \| \|	For now just initialise it as a single online node, which is what asm-arm/numa.h assumes anyway. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen: arm: Call init_xen_time earlier	Ian Campbell	2012-12-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	If we panic before calling init_xen_time then the "Rebooting in 5 seconds" delay ends up calling udelay which uses cntfrq before it has been initialised resulting in a divide by zero. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	xen/arm: do not map vGIC twice for dom0	Stefano Stabellini	2012-12-19	1	-2/+0
\| \| \| \| \| \| \| \| \|	We don't need to manually set the P2M for the vGIC in construct_dom0, because we have already done it generally for every guest in gicv_setup. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
*	x86, amd: Disable way access filter on Piledriver CPUs	Andre Przywara	2012-12-19	2	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Way Access Filter in recent AMD CPUs may hurt the performance of some workloads, caused by aliasing issues in the L1 cache. This patch disables it on the affected CPUs. The issue is similar to that one of last year: http://lkml.indiana.edu/hypermail/linux/kernel/1107.3/00041.html This new patch does not replace the old one, we just need another quirk for newer CPUs. The performance penalty without the patch depends on the circumstances, but is a bit less than the last year's 3%. The workloads affected would be those that access code from the same physical page under different virtual addresses, so different processes using the same libraries with ASLR or multiple instances of PIE-binaries. The code needs to be accessed simultaneously from both cores of the same compute unit. More details can be found here: http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf CPUs affected are anything with the core known as Piledriver. That includes the new parts of the AMD A-Series (aka Trinity) and the just released new CPUs of the FX-Series (aka Vishera). The model numbering is a bit odd here: FX CPUs have model 2, A-Series has model 10h, with possible extensions to 1Fh. Hence the range of model ids. Signed-off-by: Andre Przywara <osp@andrep.de> Add and use MSR_AMD64_IC_CFG. Update the value whenever it is found to not have all bits set, rather than just when it's zero. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Committed-by: Jan Beulich <jbeulich@suse.com>
*	xen/arch/*: add struct domain parameter to arch_do_domctl	Daniel De Graaf	2012-12-18	6	-411/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since the arch-independent do_domctl function now RCU locks the domain specified by op->domain, pass the struct domain to the arch-specific domctl function and remove the duplicate per-subfunction locking. This also removes two get_domain/put_domain call pairs (in XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device), replacing them with RCU locking. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com> Committed-by: Keir Fraser <keir@xen.org>
*	xen: lock target domain in do_domctl common code	Daniel De Graaf	2012-12-18	1	-209/+59
\| \| \| \| \| \| \| \| \| \| \|	Because almost all domctls need to lock the target domain, do this by default instead of repeating it in each domctl. This is not currently extended to the arch-specific domctls, but RCU locks are safe to take recursively so this only causes duplicate but correct locking. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Jan Beulich <jbeulich@suse.com> Committed-by: Keir Fraser <keir@xen.org>
*	nested vmx: nested TPR shadow/threshold emulation	Dongxiao Xu	2012-12-18	2	-3/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TPR shadow/threshold feature is important to speedup the boot time for Windows guest. Besides, it is a must feature for certain VMM. We map virtual APIC page address and TPR threshold from L1 VMCS, and synch it into shadow VMCS in virtual vmentry. If TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest, we inject it into L1 VMM for handling. Besides, this commit fixes an issue for apic access page, if L1 VMM didn't enable this feature, we need to fill zero into the shadow VMCS. Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com> Committed-by: Keir Fraser <keir@xen.org>
*	xen: sched_credit: add some tracing	Dario Faggioli	2012-12-18	1	-1/+34
\| \| \| \| \| \| \| \|	About tickling, and PCPU selection. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
*	xen: tracing: introduce per-scheduler trace event IDs	Dario Faggioli	2012-12-18	2	-12/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	So that it becomes possible to create scheduler specific trace records, within each scheduler, without worrying about the overlapping, and also without giving up being able to recognise them univocally. The latter is deemed as useful, since we can have more than one scheduler running at the same time, thanks to cpupools. The event ID is 12 bits, and this change uses the upper 3 of them for the 'scheduler ID'. This means we're limited to 8 schedulers and to 512 scheduler specific tracing events. Both seem reasonable limitations as of now. This also converts the existing credit2 tracing (the only scheduler generating tracing events up to now) to the new system. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Committed-by: Keir Fraser <keir@xen.org>
*	xen: sched_credit: improve tickling of idle CPUs	Dario Faggioli	2012-12-18	2	-40/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now, when a VCPU wakes-up, we check whether it should preempt what is running on the PCPU, and whether or not the waking VCPU can be migrated (by tickling some idlers). However, this can result in suboptimal or even wrong behaviour, as explained here: http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html This change, instead, when deciding which PCPU(s) to tickle, upon VCPU wake-up, considers both what it is likely to happen on the PCPU where the wakeup occurs,and whether or not there are idlers where the woken-up VCPU can run. In fact, if there are, we can avoid interrupting the running VCPU. Only in case there aren't any of these PCPUs, preemption and migration are the way to go. This has been tested (on top of the previous change) by running the following benchmarks inside 2, 6 and 10 VMs, concurrently, on a shared host, each with 2 VCPUs and 960 MB of memory (host had 16 ways and 12 GB RAM). 1) All VMs had 'cpus="all"' in their config file. $ sysbench --test=cpu ... (time, lower is better) \| VMs \| w/o this change \| w/ this change \| \| 2 \| 50.078467 +/- 1.6676162 \| 49.673667 +/- 0.0094321 \| \| 6 \| 63.259472 +/- 0.1137586 \| 61.680011 +/- 1.0208723 \| \| 10 \| 91.246797 +/- 0.1154008 \| 90.396720 +/- 1.5900423 \| $ sysbench --test=memory ... (throughput, higher is better) \| VMs \| w/o this change \| w/ this change \| \| 2 \| 485.56333 +/- 6.0527356 \| 487.83167 +/- 0.7602850 \| \| 6 \| 401.36278 +/- 1.9745916 \| 409.96778 +/- 3.6761092 \| \| 10 \| 294.43933 +/- 0.8064945 \| 302.49033 +/- 0.2343978 \| $ specjbb2005 ... (throughput, higher is better) \| VMs \| w/o this change \| w/ this change \| \| 2 \| 43150.63 +/- 1359.5616 \| 43275.427 +/- 606.28185 \| \| 6 \| 29274.29 +/- 1024.4042 \| 29716.189 +/- 1290.1878 \| \| 10 \| 19061.28 +/- 512.88561 \| 19192.599 +/- 605.66058 \| 2) All VMs had their VCPUs statically pinned to the host's PCPUs. $ sysbench --test=cpu ... (time, lower is better) \| VMs \| w/o this change \| w/ this change \| \| 2 \| 47.8211 +/- 0.0215504 \| 47.826900 +/- 0.0077872 \| \| 6 \| 62.689122 +/- 0.0877173 \| 62.764539 +/- 0.3882493 \| \| 10 \| 90.321097 +/- 1.4803867 \| 89.974570 +/- 1.1437566 \| $ sysbench --test=memory ... (throughput, higher is better) \| VMs \| w/o this change \| w/ this change \| \| 2 \| 550.97667 +/- 2.3512355 \| 550.87000 +/- 0.8140792 \| \| 6 \| 443.15000 +/- 5.7471797 \| 454.01056 +/- 8.4373466 \| \| 10 \| 313.89233 +/- 1.3237493 \| 321.81167 +/- 0.3528418 \| $ specjbb2005 ... (throughput, higher is better) \| 2 \| 49591.057 +/- 952.93384 \| 49594.195 +/- 799.57976 \| \| 6 \| 33538.247 +/- 1089.2115 \| 33671.758 +/- 1077.6806 \| \| 10 \| 21927.870 +/- 831.88742 \| 21891.131 +/- 563.37929 \| Numbers show how the change has either no or very limited impact (specjbb2005 case) or, when it does have some impact, that is a real improvement in performances (sysbench-memory case). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Committed-by: Keir Fraser <keir@xen.org>
*	xen: sched_credit: improve picking up the idle CPU for a VCPU	Dario Faggioli	2012-12-18	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In _csched_cpu_pick() we try to select the best possible CPU for running a VCPU, considering the characteristics of the underlying hardware (i.e., how many threads, core, sockets, and how busy they are). What we want is "the idle execution vehicle with the most idling neighbours in its grouping". In order to achieve it, we select a CPU from the VCPU's affinity, giving preference to its current processor if possible, as the basis for the comparison with all the other CPUs. Problem is, to discount the VCPU itself when computing this "idleness" (in an attempt to be fair wrt its current processor), we arbitrarily and unconditionally consider that selected CPU as idle, even when it is not the case, for instance: 1. If the CPU is not the one where the VCPU is running (perhaps due to the affinity being changed); 2. The CPU is where the VCPU is running, but it has other VCPUs in its runq, so it won't go idle even if the VCPU in question goes. This is exemplified in the trace below: ] 3.466115364 x\|------\|------\| d10v1 22005(2:2:5) 3 [ a 1 8 ] ... ... ... 3.466122856 x\|------\|------\| d10v1 runstate_change d10v1 running->offline 3.466123046 x\|------\|------\| d?v? runstate_change d32767v0 runnable->running ... ... ... ] 3.466126887 x\|------\|------\| d32767v0 28004(2:8:4) 3 [ a 1 8 ] 22005(...) line (the first line) means _csched_cpu_pick() was called on VCPU 1 of domain 10, while it is running on CPU 0, and it choose CPU 8, which is busy ('\|'), even if there are plenty of idle CPUs. That is because, as a consequence of changing the VCPU affinity, CPU 8 was chosen as the basis for the comparison, and therefore considered idle (its bit gets unconditionally set in the bitmask representing the idle CPUs). 28004(...) line means the VCPU is woken up and queued on CPU 8's runq, where it waits for a context switch or a migration, in order to be able to execute. This change fixes things by only considering the "guessed" CPU idle if the VCPU in question is both running there and is its only runnable VCPU. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Committed-by: Keir Fraser <keir@xen.org>