xen/xen - xen

	Commit message (Collapse)	Author	Age	Files	Lines
*	xl: allow specifying a default gatewaydev in xl.conf	Roger Pau Monne	2013-03-15	5	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a new global option in the xl configuration file called "vif.default.gatewaydev", that is used to specify the default gatewaydev to use when none is passed in the vif specification. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Tested-by: Ulf Kreutzberg <ulf.kreutzberg@hosteurope.de> Cc: Ulf Kreutzberg <ulf.kreutzberg@hosteurope.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
*	xl/libxl: add gatewaydev/netdev to vif specification	Roger Pau Monne	2013-03-15	5	-2/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This option is used by the vif-route hotplug script. A new more descriptive name is used, "gatewaydev", but "netdev" is also supported as a deprecated backwards compatible option. This option was supported in the past, according to http://wiki.xen.org/wiki/Vif-route, so we should also support it in libxl. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Tested-by: Ulf Kreutzberg <ulf.kreutzberg@hosteurope.de> Cc: Ulf Kreutzberg <ulf.kreutzberg@hosteurope.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
*	x86/mm: avoid undefined behavior in IS_NIL()	Xi Wang	2013-03-15	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since pointer overflow is undefined behavior in C, some compilers such as clang optimize away the check !((ptr) + 1) in the macro IS_NIL(). This patch fixes the issue by casting the pointer type to uintptr_t, the operations of which are well-defined. Signed-off-by: Xi Wang <xi@mit.edu> With that, we also need to avoid the overflow in NIL(). Note that either part of the change results in the respective macros to become unsuitable for use with "void". Signed-off-by: Jan Beulich <jbeulich@suse.com>
*	tools: libxl: unbreak build after ec41430ef6a7	Ian Campbell	2013-03-14	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	libxl_create.c: In function ‘libxl__domain_build_info_setdefault’: libxl_create.c:109: error: ‘info’ undeclared (first use in this function) libxl_create.c:109: error: (Each undeclared identifier is reported only once libxl_create.c:109: error: for each function it appears in.) cc1: warnings being treated as errors libxl_create.c:108: error: suggest explicit braces to avoid ambiguous ‘else’ libxl_create.c: At top level: libxl_create.c:141: error: expected identifier or ‘(’ before ‘if’ ... Fix is to insert the missing opening brace and s/info/b_info/ in one spot. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
*	x86: extend diagnostics for "No irq handler for vector" messages	Jan Beulich	2013-03-14	2	-9/+23
\| \| \| \| \| \| \| \| \|	By storing the inverted IRQ number in vector_irq[], we may be able to spot which IRQ a vector was used for most recently, thus hopefully permitting to understand why these messages trigger on certain systems. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	x86/mem_access: check for errors in p2m->set_entry().	Tim Deegan	2013-03-14	1	-7/+18
\| \| \| \| \| \| \| \|	These calls ought always to succeed. Assert that they do rather than ignoring the return value. Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Aravindh Puthiyaparambil <aravindh@virtuata.com>
*	x86/mem_sharing: check for errors in p2m->set_entry().	Tim Deegan	2013-03-14	1	-4/+8
\| \| \| \| \| \| \| \| \|	This call ought always to succeed. Assert that it does rather than ignoring the return value. Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Acked-by: Jan Beulich <jbeulich@suse.com>
*	x86/ept: check for errors in a few callers of ept_set_entry.	Tim Deegan	2013-03-14	1	-5/+15
\| \| \| \| \| \| \| \| \|	AFAICT in all these cases we have the p2m lock and have just checked that the p2m trie is populated so the call should succeed. Make it explicit with ASSERT() rather than just ignoring the result. Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
*	x86/mm: warn if we ever run out of shadow/hap pool for p2m/lgd ops.	Tim Deegan	2013-03-14	3	-1/+15
\| \| \| \| \| \| \| \| \| \| \|	Even if the error propagates up through the p2m ops to the caller, it'll look like ENOMEM, which won't be obviously a shadow-pool problem. Warn on the console, once per domain. Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
*	x86/mm: use bool_t for flags in shadow-pagetable structs	Tim Deegan	2013-03-14	1	-11/+11
\| \| \| \| \| \| \|	and reshuffle the domain struct to pack a little better. Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
*	libxl: use qemu-xen (upstream QEMU) as device model by default	Stefano Stabellini	2013-03-13	5	-9/+25
\| \| \| \| \| \|	Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
*	libxl: move check for existence of qemuu device model	Ian Jackson	2013-03-13	1	-5/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The stat in libxl__domain_build_info_setdefault's default device model logic works to fall back to qemu-xen-traditional whenever the executable for qemu-xen is not found. We are going to use qemu-xen-traditional in more cases, so break this check out into its own if statement. Also add a pair of braces to make the if() statement symmetrical. Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
*	libxl: move libxl_device_action to idl	Roger Pau Monne	2013-03-13	6	-24/+25
\| \| \| \| \| \| \| \|	Move to idl for ease of expansion and auto-generated functions. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
*	libxl: remove double check in NetBSD hotplug	Roger Pau Monne	2013-03-13	1	-4/+0
\| \| \| \| \| \| \| \| \|	Remove a duplicated check performed in libxl__get_hotplug_script_info for NetBSD Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
*	libxl: don't launch more than one tapdisk process for each disk	Roger Pau Monne	2013-03-13	1	-7/+10
\| \| \| \| \| \| \| \| \| \| \|	When adding a disk don't launch multiple tapdisk instances for the same disk, if transaction fails in device_disk_add reuse the same tapdisk for further tries instead of creating a new instance each time a transaction fails. Reported-by: Darren Shepherd <darren.s.shepherd@gmail.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> Tested-by: Darren Shepherd <darren.s.shepherd@gmail.com>
*	xen: arm: create dom0 DTB /hypervisor/ node dynamically.	Ian Campbell	2013-03-13	1	-2/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I initially added hypervisor-new and confirmed via /proc/device-model that the content is the same before changing it to drop and replace an existing node. NB: There is an ambiguity in the compatibility property. linux/arch/arm/boot/dts/xenvm-4.2.dts says "xen,xen-4.2" while Documentation/devicetree/bindings/arm/xen.txt says "xen,xen-4.3". I have used the actual hypervisor version as discussed in http://marc.info/?l=xen-devel&m=135963416631423 Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
*	xen: strip xen, multiboot-module nodes from dom0 device tree	Ian Campbell	2013-03-13	1	-2/+30
\| \| \| \| \| \| \|	These nodes are used by Xen to find the initial modules. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
*	xen: arm: parse modules from DT during early boot.	Ian Campbell	2013-03-13	2	-1/+78
\| \| \| \| \| \| \| \| \|	The bootloader should populate /chosen/modules/module@<N>/ for each module it wishes to pass to the hypervisor. The content of these nodes is described in docs/misc/arm/device-tree/booting.txt Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
*	dtb: correct handling of #address-cells and #size-cells.	Ian Campbell	2013-03-13	3	-7/+17
\| \| \| \| \| \| \| \|	If a node does not have #*-cells then the parent's value should be used. Currently we were asssuming zero which is useless. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
*	xen: correct BITS_PER_EVTCHN_WORD on arm	Ian Campbell	2013-03-12	3	-2/+7
\| \| \| \| \| \| \|	This is always 64-bit on ARM, not BITS_PER_LONG Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
*	coverage: fix on ARM	Ian Campbell	2013-03-12	3	-18/+11
\| \| \| \| \| \| \| \| \| \|	Use a list of pointers to simplify the handling of 32- vs 64-bit. Also on ARM the section name is ".init_array" and not ".ctors". Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Keir Fraser <keir@xen.org> [ ijc -- tweak whitespace per Frediano's comment ]
*	x86/MCA: suppress bank clearing for certain injected events	Jan Beulich	2013-03-12	2	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As the bits indicating validity of the ADDR and MISC bank MSRs may be injected in a way that isn't consistent with what the underlying hardware implements (while the bank must be valid for injection to work, the auxiliary MSRs may not be implemented - and hence cause #GP upon access - if the hardware never sets the corresponding valid bits. Consequently we need to do the clearing writes only if no value was interposed for the respective MSR (which also makes sense the other way around: there's no point in clearing a hardware register when all data read came from software). Of course this all requires the injection tool to do things in a consistent way (but that had been a requirement before already). Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ren Yongjie <yongjie.ren@intel.com> Acked-by: Liu Jinsong <jinsong.liu@intel.com>
*	vpmu intel: pass through cpuid bits when BTS is enabled	Dietmar Hahn	2013-03-12	1	-0/+4
\| \| \| \| \| \| \| \| \|	This patch passes the orginal cpuid bits for X86_FEATURE_DTES64 (64-bit DS Area) and X86_FEATURE_DSCPL (CPL Qualified Debug Store) to the guest when the BTS feature is switched on. I forgot this when I did this BTS emulation. Signed-off-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
*	powernow: add fixups for AMD P-state figures	Konrad Rzeszutek Wilk	2013-03-12	1	-6/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the Linux kernel, these two git commits: - f594065faf4f9067c2283a34619fc0714e79a98d ACPI: Add fixups for AMD P-state figures - 9855d8ce41a7801548a05d844db2f46c3e810166 ACPI: Check MSR valid bit before using P-state frequencies Try to fix the the issue that "some AMD systems may round the frequencies in ACPI tables to 100MHz boundaries. We can obtain the real frequencies from MSRs, so add a quirk to fix these frequencies up on AMD systems." (from f594065..) In discussion (around 9855d8..) "it turned out that indeed real HW/BIOSes may choose to not set the valid bit and thus mark the P-state as invalid. So this could be considered a fix for broken BIOSes." (from 9855d8..) which is great for Linux. Unfortunatly the Linux kernel, when it tries to do the RDMSR under Xen it fails to get the right value (it gets zero) as Xen traps it and returns zero. Hence when dom0 uploads the P-states they will be unmodified and we should take care of updating the frequencies with the right values. I've tested it under Dell Inc. PowerEdge T105 /0RR825, BIOS 1.3.2 08/20/2008 where this quirk can be observed (x86 == 0x10, model == 2). Also on other AMD (x86 == 0x12, A8-3850; x86 = 0x14, AMD E-350) to make sure the quirk is not applied there. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: stefan.bader@canonical.com Do the MSR access here (and while at it, also the one reading MSR_PSTATE_CUR_LIMIT) on the target CPU, and bound the loop over amd_fixup_frequency() by max_hw_pstate (matching the one in powernow_cpufreq_cpu_init()). Signed-off-by: Jan Beulich <jbeulich@suse.com>
*	mmu: Introduce XENMEM_claim_pages (subop of memory ops)	Dan Magenheimer	2013-03-11	8	-5/+188
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When guests memory consumption is volatile (multiple guests ballooning up/down) we are presented with the problem of being able to determine exactly how much memory there is for allocation of new guests without negatively impacting existing guests. Note that the existing models (xapi, xend) drive the memory consumption from the tool-stack and assume that the guest will eventually hit the memory target. Other models, such as the dynamic memory utilized by tmem, do this differently - the guest drivers the memory consumption (up to the d->max_pages ceiling). With dynamic memory model, the guest frequently can balloon up and down as it sees fit. This presents the problem to the toolstack that it does not know atomically how much free memory there is (as the information gets stale the moment the d->tot_pages information is provided to the tool-stack), and hence when starting a guest can fail during the memory creation process. Especially if the process is done in parallel. In a nutshell what we need is a atomic value of all domains tot_pages during the allocation of guests. Naturally holding a lock for such a long time is unacceptable. Hence the goal of this hypercall is to attempt to atomically and very quickly determine if there are sufficient pages available in the system and, if so, "set aside" that quantity of pages for future allocations by that domain. Unlike an existing hypercall such as increase_reservation or populate_physmap, specific physical pageframes are not assigned to the domain because this cannot be done sufficiently quickly (especially for very large allocations in an arbitrarily fragmented system) and so the existing mechanisms result in classic time-of-check-time-of-use (TOCTOU) races. One can think of claiming as similar to a "lazy" allocation, but subsequent hypercalls are required to do the actual physical pageframe allocation. Note that one of effects of this hypercall is that from the perspective of other running guests - suddenly there is a new guest occupying X amount of pages. This means that when we try to balloon up they will hit the system-wide ceiling of available free memory (if the total sum of the existing d->max_pages >= host memory). This is OK - as that is part of the overcommit. What we DO NOT want to do is dictate their ceiling should be (d->max_pages) as that is risky and can lead to guests OOM-ing. It is something the guest needs to figure out. In order for a toolstack to "get" information about whether a domain has a claim and, if so, how large, and also for the toolstack to measure the total system-wide claim, a second subop has been added and exposed through domctl and libxl (see "xen: XENMEM_claim_pages: xc"). == Alternative solutions == There has been a variety of discussion whether the problem hypercall is solving can be done in user-space, such as: - For all the existing guest, set their d->max_pages temporarily to d->tot_pages and create the domain. This forces those domains to stay at their current consumption level (fyi, this is what the tmem freeze call is doing). The disadvantage of this is that needlessly forces the guests to stay at the memory usage instead of allowing it to decide the optimal target. - Account only using d->max_pages of how much free memory there is. This ignores ballooning changes and any over-commit scenario. This is similar to the scenario where the sum of all d->max_pages (and the one to be allocated now) on the host is smaller than the available free memory. As such it ignores the over-commit problem. - Provide a ring/FIFO along with event channel to notify an userspace daemon of guests memory consumption. This daemon can then provide up-to-date information to the toolstack of how much free memory there is. This duplicates what the hypervisor is already doing and introduced latency issues and catching breath for the toolstack as there might be millions of these updates on heavily used machine. There might not be any quiescent state ever and the toolstack will heavily consume CPU cycles and not ever provide up-to-date information. It has been noted that this claim mechanism solves the underlying problem (slow failure of domain creation) for a large class of domains but not all, specifically not handling (but also not making the problem worse for) PV domains that specify the "superpages" flag, and 32-bit PV domains on large RAM systems. These will be addressed at a later time. Code overview: Though the hypercall simply does arithmetic within locks, some of the semantics in the code may be a bit subtle. The key variables (d->unclaimed_pages and total_unclaimed_pages) starts at zero if no claim has yet been staked for any domain. (Perhaps a better name is "claimed_but_not_yet_possessed" but that's a bit unwieldy.) If no claim hypercalls are executed, there should be no impact on existing usage. When a claim is successfully staked by a domain, it is like a watermark but there is no record kept of the size of the claim. Instead, d->unclaimed_pages is set to the difference between d->tot_pages and the claim. When d->tot_pages increases or decreases, d->unclaimed_pages atomically decreases or increases. Once d->unclaimed_pages reaches zero, the claim is satisfied and d->unclaimed pages stays at zero -- unless a new claim is subsequently staked. The systemwide variable total_unclaimed_pages is always the sum of d->unclaimed_pages, across all domains. A non-domain- specific heap allocation will fail if total_unclaimed_pages exceeds free (plus, on tmem enabled systems, freeable) pages. Claim semantics could be modified by flags. The initial implementation had three flag, which discerns whether the caller would like tmem freeable pages to be considered in determining whether or not the claim can be successfully staked. This in later patches was removed and there are no flags. A claim can be cancelled by requesting a claim with the number of pages being zero. A second subop returns the total outstanding claimed pages systemwide. Note: Save/restore/migrate may need to be modified, else it can be documented that all claims are cancelled. This patch of the proposed XENMEM_claim_pages hypercall/subop, takes into account review feedback from Jan and Keir and IanC and Matthew Daley, plus some fixes found via runtime debugging. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Keir Fraser <keir@xen.org>
*	credit2: Reset until the front of the runqueue is positive	George Dunlap	2013-03-11	1	-8/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Under normal circumstances, snext->credit should never be less than -CSCHED_MIN_TIMER. However, under some circumstances, a vcpu with low credits may be allowed to run long enough that its credits are actually less than -CSCHED_CREDIT_INIT. (Instances have been observed, for example, where a vcpu with 200us of credit was allowed to run for 11ms, giving it -10.8ms of credit. Thus it was still negative even after the reset.) If this is the case for snext, we simply want to keep moving everyone up until it is in the black again. This fair because none of the other vcpus want to run at the moment. Rather than loop, just detect how many times we want to add CSCHED_CREDIT_INIT. Try to avoid integer divides and multiplies in the common case. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
*	credit2: Fix erronous ASSERT	George Dunlap	2013-03-11	1	-24/+17
\| \| \| \| \| \| \|	In order to avoid high-frequency cpu migration, vcpus may in fact be scheduled slightly out-of-order. Account for this situation properly. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
*	x86/vPMU: change Intel model numbers from decimal to hex	Konrad Rzeszutek Wilk	2013-03-08	1	-14/+14
\| \| \| \| \| \|	Suggested-by: "Nakajima, Jun" <jun.nakajima@intel.com> Suggested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
*	x86/vPMU: add missing Merom, Westmere, and Nehalem models	Konrad Rzeszutek Wilk	2013-03-08	1	-2/+13
\| \| \| \| \| \| \| \| \| \|	Mainly 22 (Merom-L); 30 (Nehelem); and 37, 44 (Westmere). A comprehensive list is available at: http://software.intel.com/en-us/articles/intel-architecture-and-processor-identification-with-cpuid-model-and-family-numbers Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
*	x86/vPMU: provide comments for which Intel model is what	Konrad Rzeszutek Wilk	2013-03-08	1	-10/+10
\| \| \| \| \|	Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Tim Deegan <tim@xen.org>
*	MAINTAINERS: Update my mail address	Christoph Egger	2013-03-08	1	-2/+1
\| \| \| \| \| \|	Remove myself as AMD SVM maintainer. Signed-off-by: Christoph Egger <chegger@amazon.de>
*	remove Andre from the SVM maintainers list	Andre Przywara	2013-03-08	1	-1/+0
\| \| \| \|	Signed-off-by: Andre Przywara <andre.przywara@calxeda.com>
*	AMD: update MAINTAINERS file	Suravee Suthikulpanit	2013-03-08	1	-1/+5
\| \| \| \| \| \|	Adding AMD engineers to the list of AMD-specific components' maintainers. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
*	x86/MSI: add mechanism to fully protect MSI-X table from PV guest accesses	Jan Beulich	2013-03-08	5	-69/+153
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds two new physdev operations for Dom0 to invoke when resource allocation for devices is known to be complete, so that the hypervisor can arrange for the respective MMIO ranges to be marked read-only before an eventual guest getting such a device assigned even gets started, such that it won't be able to set up writable mappings for these MMIO ranges before Xen has a chance to protect them. This also addresses another issue with the code being modified here, in that so far write protection for the address ranges in question got set up only once during the lifetime of a device (i.e. until either system shutdown or device hot removal), while teardown happened when the last interrupt was disposed of by the guest (which at least allowed the tables to be writable when the device got assigned to a second guest [instance] after the first terminated). Signed-off-by: Jan Beulich <jbeulich@suse.com>
*	sched: always ask the scheduler to re-place the vcpu when the affinity changes	George Dunlap	2013-03-08	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's probably a good idea to re-evaluate placement whenever the affinity changes. This additionally has the benefit of removing scheduler-specific exceptions introduced in git c/s e6a6fd63. The conditionals surrounding vcpu_migrate() are left pending a re-work of the logic to avoid the common case calling vcpu_migrate() twice (once here, and once in context_saved(). Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
*	tools/xenconsoled: Initialise pointers before trying to use them	Andrew Cooper	2013-03-07	1	-11/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a regression introduced by "Switch from select() to poll() in xenconsoled's IO loop." hg c/s 26405:7359c3122c5d git cc5434c933153c4b8812d1df901f8915c22830a8 which results in reliable segfaults during VM power operations. Switch to calloc(3) in an effort to avoid similar problems with changes in the future. Signed-off-by: Marcus Granado <marcus.granado@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
*	fix domain unlocking in some xsm error paths	Matthew Daley	2013-03-06	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	A couple of xsm error/access-denied code paths in hypercalls neglect to unlock a previously locked domain. Fix by ensuring the domains are unlocked correctly. Signed-off-by: Matthew Daley <mattjd@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	change arguments of do_kexec_op and compat_set_timer_op prototypes	Robbie VanVossen	2013-03-06	3	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	... to match the actual functions. Signed-off-by: Robbie VanVossen <robert.vanvossen@dornerworks.com> Also make sure the source files defining these symbols include the header declaring them (had we done so, the problem would have been noticed long ago). Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	x86/shadow: don't use PV LDT area for cross-pages access emulation	Jan Beulich	2013-03-05	1	-19/+8
\| \| \| \| \| \| \| \| \| \|	As of 703ac3a ("x86: introduce create_perdomain_mapping()"), the page tables for this range don't get set up anymore for non-PV guests. And the way this was done was marked as a hack rather than a proper mechanism anyway. Use vmap() instead. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
*	xentrace: fix off-by-one in calculate_tbuf_size	Olaf Hering	2013-03-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit "xentrace: reduce trace buffer size to something mfn_offset can reach" contains an off-by-one bug. max_mfn_offset needs to be reduced by exactly the value of t_info_first_offset. If the system has two cpus and the number of requested trace pages is very large, the final number of trace pages + the offset will not fit into a short. As a result the variable offset in alloc_trace_bufs() will wrap while allocating buffers for the second cpu. Later share_xen_page_with_privileged_guests() will be called with a wrong page and the ASSERT in this function triggers. If the ASSERT is ignored by running a non-dbg hypervisor the asserts in xentrace itself trigger because "cons" is not aligned because the very last trace page for the second cpu is a random mfn. Thanks to Jan for the quick analysis. Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
*	credit2: track residual from divisions done during accounting	George Dunlap	2013-03-04	1	-7/+15
\| \| \| \| \| \| \| \| \| \| \| \|	This should help with under-accounting of vCPU-s running for extremly short periods of time, but becoming runnable again at a high frequency. Don't bother subtracting the residual from the runtime, as it can only ever add up to one nanosecond, and will end up being debited during the next reset interval anyway. Original-patch-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
*	credit2: Avoid extra c2t calcuation in csched_runtime	George Dunlap	2013-03-04	1	-11/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	csched_runtime() needs to call the ct2() function to change credits into time. The c2t() function, however, is expensive, as it requires an integer division. c2t() was being called twice, once for the main vcpu's credit and once for the difference between its credit and the next in the queue. But this is unnecessary; by calculating in "credit" first, we can make it so that we just do one conversion later in the algorithm. This also adds more documentation describing the intended algorithm, along with a relevant assertion.. The effect of the new code should be the same as the old code. Spotted-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
*	credit1: Use atomic bit operations for the flags structure	George Dunlap	2013-03-04	1	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The flags structure is not protected by locks (or more precisely, it is protected using an inconsistent set of locks); we therefore need to make sure that all accesses are atomic-safe. This is particulary important in the case of the PARKED flag, which if clobbered while changing the YIELD bit will leave a vcpu wedged in an offline state. Using the atomic bitops also requires us to change the size of the "flags" element. Spotted-by: Igor Pavlikevich <ipavlikevich@gmail.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
*	x86: make x86_mcinfo_reserve() clear its result buffer	Jan Beulich	2013-03-04	4	-12/+6
\| \| \| \| \| \| \| \| \| \| \| \|	... instead of all but one of its callers. Also adjust the corresponding sizeof() expressions to specify the pointed-to type of the result variable rather than the literal type (so that a type change of the variable will imply the size to get adjusted too). Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
*	x86: reduce irq_cpustat_t's __softirq_pending to 32 bits	Jan Beulich	2013-03-04	1	-1/+1
\| \| \| \| \| \| \| \| \|	Assembly code was already only accessing the low 32 bits of it, and we're far away from using all 32 bits of it. Noticed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	x86: don't rely on __softirq_pending to be the first field in irq_cpustat_t	Jan Beulich	2013-03-04	5	-13/+14
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is even more so as the field doesn't have a comment to that effect in the structure definition. Once modifying the respective assembly code, also convert the IRQSTAT_shift users to do a 32-bit shift only (as we won't support 48M CPUs any time soon) and use "cmpl" instead of "testl" when checking the field (both reducing code size). Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	x86: defer processing events on the NMI exit path	Jan Beulich	2013-03-04	2	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \|	Otherwise, we may end up in the scheduler, keeping NMIs masked for a possibly unbounded period of time (until whenever the next IRET gets executed). Enforce timely event processing by sending a self IPI. Of course it's open for discussion whether to always use the straight exit path from handle_ist_exception. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	SEDF: avoid gathering vCPU-s on pCPU0	Jan Beulich	2013-03-04	2	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was incompatible with the SEDF scheduler: Any vCPU using VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux guests) ends up on pCPU0 after that call. Obviously, running all PV guests' (and namely Dom0's) vCPU-s on pCPU0 causes problems for those guests rather sooner than later. So the main thing that was clearly wrong (and bogus from the beginning) was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced by a construct that prefers to put back the vCPU on the pCPU that it got launched on. However, there's one more glitch: When reducing the affinity of a vCPU temporarily, and then widening it again to a set that includes the pCPU that the vCPU was last running on, the generic scheduler code would not force a migration of that vCPU, and hence it would forever stay on the pCPU it last ran on. Since that can again create a load imbalance, the SEDF scheduler wants a migration to happen regardless of it being apparently unnecessary. Of course, an alternative to checking for SEDF explicitly in vcpu_set_affinity() would be to introduce a flags field in struct scheduler, and have SEDF set a "always-migrate-on-affinity-change" flag. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	x86: make certain memory sub-ops return valid values	Jan Beulich	2013-03-04	3	-6/+12
\| \| \| \| \| \| \| \| \| \| \|	When a domain's shared info field "max_pfn" is zero, domain_get_maximum_gpfn() so far returned ULONG_MAX, which do_memory_op() in turn converted to -1 (i.e. -EPERM). Make the former always return a sensible number (i.e. zero if the field was zero) and have the latter no longer truncate return values. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
*	fix compat memory exchange op splitting	Jan Beulich	2013-03-01	1	-1/+1
\| \| \| \| \| \| \| \| \|	A shift with a negative count was erroneously used here, yielding undefined behavior. Reported-by: Xi Wang <xi@mit.edu> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>