xen/xen - xen

	Commit message (Collapse)	Author	Age	Files	Lines
*	update Xen version to 4.2.2RELEASE-4.2.2	Jan Beulich	2013-04-23	2	-3/+3
\|
*	libxl: Fix SEGV in network-attach	Ian Jackson	2013-04-18	1	-1/+2
\| \| \| \| \| \| \| \| \|	When "device/vif" directory exists but is empty l!=NULL, but nb==0, so l[nb-1] is invalid. Add missing check. Signed-off-by: Marek Marczykowski <marmarek@invisiblethingslab.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
*	x86: fix various issues with handling guest IRQs	Jan Beulich	2013-04-18	9	-30/+93
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- properly revoke IRQ access in map_domain_pirq() error path - don't permit replacing an in use IRQ - don't accept inputs in the GSI range for MAP_PIRQ_TYPE_MSI - track IRQ access permission in host IRQ terms, not guest IRQ ones (and with that, also disallow Dom0 access to IRQ0) This is CVE-2013-1919 / XSA-46. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> master commit: 545607eb3cfeb2abf5742d1bb869734f317fcfe5 master date: 2013-04-18 16:11:23 +0200
*	x86: clear EFLAGS.NT in SYSENTER entry path	Jan Beulich	2013-04-18	3	-3/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	... as it causes problems if we happen to exit back via IRET: In the course of trying to handle the fault, the hypervisor creates a stack frame by hand, and uses PUSHFQ to set the respective EFLAGS field, but expects to be able to IRET through that stack frame to the second portion of the fixup code (which causes a #GP due to the stored EFLAGS having NT set). And even if this worked (e.g if we cleared NT in that path), it would then (through the fail safe callback) cause a #GP in the guest with the SYSENTER handler's first instruction as the source, which in turn would allow guest user mode code to crash the guest kernel. Inject a #GP on the fake (NULL) address of the SYSENTER instruction instead, just like in the case where the guest kernel didn't register a corresponding entry point. This is CVE-2013-1917 / XSA-44. Reported-by: Andrew Cooper <andrew.cooper3@citirx.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> master commit: fdac9515607b757c044e7ef0d61b1453ef999b08 master date: 2013-04-18 16:00:35 +0200
*	iommu/crash: Interrupt remapping is also disabled on crash	Andrew Cooper	2013-04-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes a regression side-effect caused by: IOMMU: properly check whether interrupt remapping is enabled git: fae0372140befb88d890a30704a8ec058c902af8 hg: 26742:e1ec14bad0cb On the crash path in nmi_shootdown_cpus(), we shut down the IOMMU, then disable the IOAPIC. On systems which support interrupt remapping, the variable iommu_intremap remains set, meaning that disable_IO_APIC() issues interrupt remapping invalidate requests. IOAPIC interrupt remapping used to be conditional on iommu_enabled, but is now conditional on iommu_intremap, following the above changeset. This behaviour can be fixed by also indicating that interrupt remapping is not enabled after shutting down the IOMMU. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> master commit: 53fd1d8458de01169dfb56feb315f02c2b521a86 master date: 2013-04-16 10:34:32 +0200
*	x86: don't pass negative time to gtime_to_gtsc()	Jan Beulich	2013-04-18	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	scale_delta(), which is being called by that function, doesn't cope with that. Also print a warning message, so hopefully we can eventually figure why occasionally a negative value results from the calculation in the first place. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> master commit: eb60be3dd870aecfa47bed1118069680389c15f7 master date: 2013-04-11 12:07:55 +0200
*	tools/blktap2: Handle read/write interrupts in blktap2 control plane.	Dr. Greg Wettstein	2013-04-15	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following patch: tools: Retry blktap2 tapdisk message on interrupt. Addressed a long standing regression with the blktap2 control plane. An interruption of the select system call would prematurely terminate the message sequence needed to properly shutdown a blktap2 tapdisk instance. Ian Jackson correctly noted that the read and write systems calls responsible for receiving and sending the control messages could also return EINTR resulting in similar effects. While this regression was not noted in field testing this patch adds support to re-start the calls to provide a technically complete implementation of control plane management in the presence of signals. Signed-off-by: Dr. Greg Wettstein <xen@wind.enjellic.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> (cherry picked from commit a5c800142cfc82159fcb85b47116cf296caebcc5)
*	libxl: don't launch more than one tapdisk process for each disk	Ian Jackson	2013-04-15	1	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When adding a disk don't launch multiple tapdisk instances for the same disk, if transaction fails in device_disk_add reuse the same tapdisk for further tries instead of creating a new instance each time a transaction fails. Reported-by: Darren Shepherd <darren.s.shepherd@gmail.com> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com> Tested-by: Darren Shepherd <darren.s.shepherd@gmail.com> Backport-requested-by: Pasi Karkkainen <pasik@iki.fi> (cherry picked from commit ec398660e89ca18bb8d061d5047d682bd383778a) Conflicts: tools/libxl/libxl.c Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
*	update Xen version to 4.2.2-rc24.2.2-rc2	Jan Beulich	2013-04-12	2	-3/+3
\|
*	tools: Retry blktap2 tapdisk message on interrupt.	Dr. Greg Wettstein	2013-04-11	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Re-start blktap2 IPC select call on interrupt. We hunted this miserable bug for a long time. The teardown of a blktap2 tapdisk instance is being carried out inconsistently up to and including the 4.2.1 release. The problem appears to be a classic 'Heisenbug' which disappears if a single function call is added to the tapdisk shutdown path. It is likely this bug has been in existence for the life of the blktap2 code. Control messages to manipulate a tapdisk instance are sent over a UNIX domain socket. A select call is used on both the read and write paths to wait on I/O and to set a timeout for the transmission and reception of the control plane messages. The existing code fails receipt or transmission of the control message on any type of error return from the select call. The xl control process receives an interrupt while waiting in the select call which in turn causes an error return with SIGINT as the return code. This prematurely terminates the teardown of the tapdisk instance leaving it in various states of shutdown. Since multiple messages are needed to implement a full teardown the tapdisk instance can be left in various states ranging from fully connected to only the minor being left allocated. The fix is straight forward. Check the return code from the select call and re-try read or write of the control message if errno is sent to EINTR. The problem manifests itself in the read path but there appears to be little reason to not add the fix to the write path as well. Both paths appear to be cut-and-paste copies of each other. Signed-off-by: Dr. Greg Wettstein <greg@enjellic.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> (cherry picked from commit 6cffb2b469a55032a2900ccb8776c0082f346758)
*	libxl: run libxl__arch_domain_create() much earlier.	Ian Jackson	2013-04-09	4	-11/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Among other things, arch_domain_create() sets the shadow(/hap/p2m) memory allocation, which must happen after vcpus are assigned (or the shadow op will fail) but before memory is allocated (or we might run out of p2m memory). libxl__build_pre(), which already sets similar things like maxmem, semes like a reasonable spot for it. That needed a bit of plumbing to get the right datastructure from the caller. As a side-effect, the return code from libxl__arch_domain_create() is no longer ignored. This bug was analysed in: From: "Jan Beulich" <JBeulich@xxxxxxxx> "Re: [Xen-devel] [xen-unstable test] 16788: regressions - FAIL" Date: Mon, 04 Mar 2013 16:34:53 +0000 http://lists.xen.org/archives/html/xen-devel/2013-03/msg00191.html Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Tim Deegan <tim@xen.org> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> (Cherry-picked from 650354dbc2626b643c12873275ca67782f1382c8.) Conflicts: tools/libxl/libxl_dom.c Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
*	x86/mm/shadow: spurious warning when unmapping xenheap pages.	Tim Deegan	2013-04-09	2	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \|	Xenheap pages will always have an extra typecount, taken in share_xen_page_with_guest(), which doesn't come from a shadow PTE. Adjust the warning in sh_remove_all_mappings() to account for it. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Tim Deegan <tim@xen.org> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> master commit: cfc515dabe91e3d6c690c68c6a669d6d77fb7ac4 master date: 2013-04-04 10:14:30 +0100
*	VMX: Always disable SMEP when guest is in non-paging mode	Stefan Bader	2013-04-09	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit e7dda8ec9fc9020e4f53345cdbb18a2e82e54a65 VMX: disable SMEP feature when guest is in non-paging mode disabled the SMEP bit if a guest VCPU was using HAP and was not in paging mode. However I could observe VCPUs getting stuck in the trampoline after the following patch in the Linux kernel changed the way CR4 gets set up: x86, realmode: read cr4 and EFER from kernel for 64-bit trampoline The change will set CR4 from already set flags which includes the SMEP bit. On bare metal this does not matter as the CPU is in non- paging mode at that time. But Xen seems to use the emulated non- paging mode regardless of HAP (I verified that on the guests I was seeing the issue, HAP was not used). Therefor it seems right to unset the SMEP bit for a VCPU that is not in paging-mode, regardless of its HAP usage. Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Dongxiao Xu <dongxiao.xu@intel.com> master commit: 0d2e673a763bc7c2ddf97fed074eb691d325ecc5 master date: 2013-04-04 10:37:19 +0200
*	x86/S3: Restore broken vcpu affinity on resume	Ben Guthro	2013-04-09	4	-3/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When in SYS_STATE_suspend, and going through the cpu_disable_scheduler path, save a copy of the current cpu affinity, and mark a flag to restore it later. Later, in the resume process, when enabling nonboot cpus restore these affinities. Signed-off-by: Ben Guthro <benjamin.guthro@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org> master commit: 41e71c2607e036f1ac00df898b8f4acb2d4df7ee master date: 2013-04-02 09:52:32 +0200
*	x86: irq_move_cleanup_interrupt() must ignore legacy vectors	Jan Beulich	2013-04-09	2	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since the main loop in the function includes legacy vectors, and since vector_irq[] gets set up for legacy vectors regardless of whether those get handled through the IO-APIC, it must not do anything on this vector range. In fact, we should never get past the move_cleanup_count check for IRQs not handled through the IO-APIC. Adding a respective assertion woulkd make those iterations more expensive (due to the lock acquire). For such an assertion to not have false positives we however ought to suppress setting up IRQ2 as an 8259A interrupt (which wasn't correct anyway), which is being done here despite the assertion not actually getting added. Furthermore, there's no point iterating over the vectors past LAST_HIPRIORITY_VECTOR, so terminate the loop accordingly. Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org> master commit: af699220ad6d111ba76fc3040342184e423cc9a1 master date: 2013-04-02 08:30:03 +0200
*	defer event channel bucket pointer store until after XSM checks	Jan Beulich	2013-04-05	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Otherwise a dangling pointer can be left, which would cause subsequent memory corruption as soon as the space got re-allocated for some other purpose. This is CVE-2013-1920 / XSA-47. Reported-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> master commit: 99b9ab0b3e7f0e7e5786116773cb7b746f3fab87 master date: 2013-04-05 09:59:03 +0200
*	hvm: Clean up vlapic_reg_write() error propagation.	Keir Fraser	2013-04-02	1	-12/+10
\| \| \| \| \| \| \| \| \|	In particular, correctly propagate errors through vlapic_apicv_write() and hvm_x2apic_msr_write(). Signed-off-by: Keir Fraser <keir@xen.org> master changeset: 5082cc19524b6687ef1bc0a717538d75aae7cd00 master date: 2013-03-28 20:16:37 +0000
*	x86/EFI: permit setting variable with non-zero attributes	Jan Beulich	2013-04-02	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \|	This must have been a copy-and-paste mistake - get_variable uses op->misc as output only, and wants to make sure it's zero for future extensibility. For set_variable, this is an input though, and hence the check is wrong. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: 78380c34dfeb27da3d0222bcb7232c5d8e2f5b30 master date: 2013-03-27 08:46:28 +0100
*	x86: reserve pages when SandyBridge integrated graphics	Xudong Hao	2013-04-02	5	-1/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SNB graphics devices have a bug that prevent them from accessing certain memory ranges, namely anything below 1M and in the pages listed in the table. Xen does not initialize below 1MB to heap, i.e. below 1MB pages don't be allocated, so it's unnecessary to reserve memory below the 1 MB mark that has not already been reserved. So reserve those pages listed in the table at xen boot if set detect a SNB gfx device on the CPU to avoid GPU hangs. Signed-off-by: Xudong Hao <xudong.hao@intel.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: db537fe3023bf157b85c8246782cb72a6f989b31 master date: 2013-03-26 14:22:07 +0100
*	ACPI: fix APEI related table size checking	Huang Ying	2013-04-02	1	-3/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Huang Ying's machine: erst_tab->header_length == sizeof(struct acpi_table_einj) but Yinghai reported that on his machine, erst_tab->header_length == sizeof(struct acpi_table_einj) - sizeof(struct acpi_table_header) To make erst table size checking code works on all systems, both testing are treated as PASS. Same situation applies to einj_tab->header_length, so corresponding table size checking is changed in similar way too. Originally-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Huang Ying <ying.huang@intel.com> - use switch() for better readability - add comment explaining why a formally invalid size it also being accepted - check erst_tab->header.length before even looking at erst_tab->header_length - prefer sizeof(*erst_tab) over sizeof(struct acpi_table_erst) Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: 915ef37d7cc8fcac5b37eb0b40c693754fcd12ab master date: 2012-10-16 17:26:36 +0200
*	ACPI, APEI: Add apei_exec_run_optional	Huang Ying	2013-04-02	2	-4/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some actions in APEI ERST and EINJ tables are optional, for example, ACPI_EINJ_BEGIN_OPERATION action is used to do some preparation for error injection, and firmware may choose to do nothing here. While some other actions are mandatory, for example, firmware must provide ACPI_EINJ_GET_ERROR_TYPE implementation. Original implementation treats all actions as optional (that is, can have no instructions), that may cause issue if firmware does not provide some mandatory actions. To fix this, this patch adds apei_exec_run_optional, which should be used for optional actions. The original apei_exec_run should be used for mandatory actions. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> master changeset: 72af01bf6f7489e54ad59270222a29d3e8c501d1 master date: 2013-03-22 12:46:25 +0100
*	ACPI/APEI: Unlock apei_iomaps_lock on error path	Andrew Cooper	2013-04-02	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This causes deadlocks during early boot on hardware with broken/buggy APEI implementations, such as a Dell Poweredge 2950 with the latest currently available BIOS. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Don't use goto or another special error path, as handling the error case in normal flow is quite simple. Signed-off-by: Jan Beulich <jbeulich@suse.com> master changeset: 0611689d9153227831979c7bafe594214b8505a3 master date: 2013-03-22 09:43:38 +0100
*	ACPI/ERST: Name table in otherwise opaque error messages	Andrew Cooper	2013-04-02	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Fix spelling and lower severities. Signed-off-by: Jan Beulich <jbeulich@suse.com> master changeset: 759847e44401176401e86e7c55b644cb9f93c781 master date: 2013-03-20 10:02:52 +0100
*	ACPI/APEI: fix ERST MOVE_DATA instruction implementation	Huang Ying	2013-04-02	1	-4/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The src_base and dst_base fields in apei_exec_context are physical address, so they should be ioremaped before being used in ERST MOVE_DATA instruction. Reported-by: Javier Martinez Canillas <martinez.javier@gmail.com> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Huang Ying <ying.huang@intel.com> Replace use of ioremap() by __acpi_map_table()/set_fixmap(). Fix error handling. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: df2cf6a726b815fafa12e503c9a36707c3962f22 master date: 2012-10-17 14:12:06 +0200
*	AMD IOMMU: allow disabling only interrupt remapping when certain IVRS ↵	Jan Beulich	2013-04-02	2	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	consistency checks fail After some more thought on the XSA-36 and specifically the comments we got regarding disabling the IOMMU in this situation altogether making things worse instead of better, I came to the conclusion that we can actually restrict the action in affected cases to just disabling interrupt remapping. That doesn't make the situation worse than prior to the XSA-36 fixes (where interrupt remapping didn't really protect domains from one another), but allows at least DMA isolation to still be utilized. To do so, disabling of interrupt remapping must be explicitly requested on the command line - respective checks will then be skipped. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Suravee Suthikulanit <suravee.suthikulpanit@amd.com> master changeset: 92b8bc03bd4b582cb524db51494d0dba7607e7ac master date: 2013-03-25 16:55:22 +0100
*	VT-d: deal with 5500/5520/X58 errata	Malcolm Crossley	2013-04-02	2	-0/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	http://www.intel.com/content/www/us/en/chipsets/5520-and-5500-chipset-ioh-specification-update.html Stepping B-3 has two errata (#47 and #53) related to Interrupt remapping, to which the workaround is for the BIOS to completely disable interrupt remapping. These errata are fixed in stepping C-2. Unfortunately this chipset stepping is very common and many BIOSes are not disabling interrupt remapping on this stepping . We can detect this in Xen and prevent Xen from using the problematic interrupt remapping feature. The Intel 5500/5520/X58 chipset does not support VT-d Extended Interrupt Mode(EIM). This means the iommu_supports_eim() check always fails and so x2apic mode cannot be enabled in Xen before this quirk disables the interrupt remapping feature. Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Gate the function call to check the quirk on interrupt remapping being requested to get enabled, and upon failure disable the IOMMU to be in line with what the changes for XSA-36 (plus follow-ups) did. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com> master changeset: 6890cebc6a987d0e896f5d23a8de11a3934101cf master date: 2013-03-25 14:31:27 +0100
*	IOMMU: properly check whether interrupt remapping is enabled	Jan Beulich	2013-04-02	4	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	... rather than the IOMMU as a whole. That in turn required to make sure iommu_intremap gets properly cleared when the respective initialization fails (or isn't being done at all). Along with making sure interrupt remapping doesn't get inconsistently enabled on some IOMMUs and not on others in the VT-d code, this in turn allowed quite a bit of cleanup on the VT-d side (removed from the backport). Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com> master changeset: fae0372140befb88d890a30704a8ec058c902af8 master date: 2013-03-25 14:28:31 +0100
*	VT-d: Enumerate IOMMUs when listing capabilities	Andrew Cooper	2013-04-02	1	-1/+2
\| \| \| \| \| \| \| \|	This saves N identical console log lines on a multi-iommu server. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> master changeset: 32861c537781ac94bf403fb778505c3679b85f67 master date: 2013-03-20 10:02:26 +0100
*	AMD/IOMMU: Process softirqs while building dom0 iommu mappings	Andrew Cooper	2013-04-02	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recent changes which have made their way into xen-4.2 stable have pushed the runtime of construct_dom0() over 5 seconds, which has caused regressions in XenServer testing because of our 5 second watchdog. The root cause is that amd_iommu_dom0_init() does not process softirqs and in particular the nmi_timer which causes the watchdog to decide that no useful progress is being made. This patch adds periodic calls to process_pending_softirqs() at the same interval as the Intel variant of this function. The server which was failing with the watchdog test now boots reliably with a timeout of 1 second. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> master changeset: 0f7b6f91ac1bbfd33b23c291b14874b9561909d2 master date: 2013-03-20 10:00:01 +0100
*	x86/MCA: suppress bank clearing for certain injected events	Jan Beulich	2013-04-02	2	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As the bits indicating validity of the ADDR and MISC bank MSRs may be injected in a way that isn't consistent with what the underlying hardware implements (while the bank must be valid for injection to work, the auxiliary MSRs may not be implemented - and hence cause #GP upon access - if the hardware never sets the corresponding valid bits. Consequently we need to do the clearing writes only if no value was interposed for the respective MSR (which also makes sense the other way around: there's no point in clearing a hardware register when all data read came from software). Of course this all requires the injection tool to do things in a consistent way (but that had been a requirement before already). Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ren Yongjie <yongjie.ren@intel.com> Acked-by: Liu Jinsong <jinsong.liu@intel.com> master changeset: b0583c0e64cc8bb6229c95c3304fdac2051f79b3 master date: 2013-03-12 15:53:30 +0100
*	powernow: add fixups for AMD P-state figures	Konrad Rzeszutek Wilk	2013-04-02	1	-6/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the Linux kernel, these two git commits: - f594065faf4f9067c2283a34619fc0714e79a98d ACPI: Add fixups for AMD P-state figures - 9855d8ce41a7801548a05d844db2f46c3e810166 ACPI: Check MSR valid bit before using P-state frequencies Try to fix the the issue that "some AMD systems may round the frequencies in ACPI tables to 100MHz boundaries. We can obtain the real frequencies from MSRs, so add a quirk to fix these frequencies up on AMD systems." (from f594065..) In discussion (around 9855d8..) "it turned out that indeed real HW/BIOSes may choose to not set the valid bit and thus mark the P-state as invalid. So this could be considered a fix for broken BIOSes." (from 9855d8..) which is great for Linux. Unfortunatly the Linux kernel, when it tries to do the RDMSR under Xen it fails to get the right value (it gets zero) as Xen traps it and returns zero. Hence when dom0 uploads the P-states they will be unmodified and we should take care of updating the frequencies with the right values. I've tested it under Dell Inc. PowerEdge T105 /0RR825, BIOS 1.3.2 08/20/2008 where this quirk can be observed (x86 == 0x10, model == 2). Also on other AMD (x86 == 0x12, A8-3850; x86 = 0x14, AMD E-350) to make sure the quirk is not applied there. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: stefan.bader@canonical.com Do the MSR access here (and while at it, also the one reading MSR_PSTATE_CUR_LIMIT) on the target CPU, and bound the loop over amd_fixup_frequency() by max_hw_pstate (matching the one in powernow_cpufreq_cpu_init()). Signed-off-by: Jan Beulich <jbeulich@suse.com> master changeset: 1d80765b504b34b63a42a63aff4291e07e29f0c5 master date: 2013-03-12 15:34:22 +0100
*	update Xen version to 4.2.2-rc14.2.2-rc1	Jan Beulich	2013-03-20	2	-3/+3
\|
*	x86/MSI: add mechanism to fully protect MSI-X table from PV guest accesses	Jan Beulich	2013-03-12	5	-69/+153
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds two new physdev operations for Dom0 to invoke when resource allocation for devices is known to be complete, so that the hypervisor can arrange for the respective MMIO ranges to be marked read-only before an eventual guest getting such a device assigned even gets started, such that it won't be able to set up writable mappings for these MMIO ranges before Xen has a chance to protect them. This also addresses another issue with the code being modified here, in that so far write protection for the address ranges in question got set up only once during the lifetime of a device (i.e. until either system shutdown or device hot removal), while teardown happened when the last interrupt was disposed of by the guest (which at least allowed the tables to be writable when the device got assigned to a second guest [instance] after the first terminated). Signed-off-by: Jan Beulich <jbeulich@suse.com> master changeset: 4245d331e0e75de8d1bddbbb518f3a8ce6d0bb7e master date: 2013-03-08 14:05:34 +0100
*	fix domain unlocking in some xsm error paths	Matthew Daley	2013-03-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	A couple of xsm error/access-denied code paths in hypercalls neglect to unlock a previously locked domain. Fix by ensuring the domains are unlocked correctly. Signed-off-by: Matthew Daley <mattjd@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: 9581c4f9a55372a21e759cd449cb676d0e8feddb master date: 2013-03-06 17:10:26 +0100
*	xentrace: fix off-by-one in calculate_tbuf_size	Olaf Hering	2013-03-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit "xentrace: reduce trace buffer size to something mfn_offset can reach" contains an off-by-one bug. max_mfn_offset needs to be reduced by exactly the value of t_info_first_offset. If the system has two cpus and the number of requested trace pages is very large, the final number of trace pages + the offset will not fit into a short. As a result the variable offset in alloc_trace_bufs() will wrap while allocating buffers for the second cpu. Later share_xen_page_with_privileged_guests() will be called with a wrong page and the ASSERT in this function triggers. If the ASSERT is ignored by running a non-dbg hypervisor the asserts in xentrace itself trigger because "cons" is not aligned because the very last trace page for the second cpu is a random mfn. Thanks to Jan for the quick analysis. Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> master changeset: d9fb28ae6d41c8201482948660e52889481830dd master date: 2013-03-04 13:42:17 +0100
*	credit1: Use atomic bit operations for the flags structure	George Dunlap	2013-03-12	1	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The flags structure is not protected by locks (or more precisely, it is protected using an inconsistent set of locks); we therefore need to make sure that all accesses are atomic-safe. This is particulary important in the case of the PARKED flag, which if clobbered while changing the YIELD bit will leave a vcpu wedged in an offline state. Using the atomic bitops also requires us to change the size of the "flags" element. Spotted-by: Igor Pavlikevich <ipavlikevich@gmail.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> master changeset: be6507509454adf3bb5a50b9406c88504e996d5a master date: 2013-03-04 13:37:39 +0100
*	x86: defer processing events on the NMI exit path	Jan Beulich	2013-03-12	3	-11/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Otherwise, we may end up in the scheduler, keeping NMIs masked for a possibly unbounded period of time (until whenever the next IRET gets executed). Enforce timely event processing by sending a self IPI. Of course it's open for discussion whether to always use the straight exit path from handle_ist_exception. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: d463b005bbd6475ed930a302821efe239e1b2cf9 master date: 2013-03-04 10:19:34 +0100
*	SEDF: avoid gathering vCPU-s on pCPU0	Jan Beulich	2013-03-12	2	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The introduction of vcpu_force_reschedule() in 14320:215b799fa181 was incompatible with the SEDF scheduler: Any vCPU using VCPUOP_stop_periodic_timer (e.g. any vCPU of half way modern PV Linux guests) ends up on pCPU0 after that call. Obviously, running all PV guests' (and namely Dom0's) vCPU-s on pCPU0 causes problems for those guests rather sooner than later. So the main thing that was clearly wrong (and bogus from the beginning) was the use of cpumask_first() in sedf_pick_cpu(). It is being replaced by a construct that prefers to put back the vCPU on the pCPU that it got launched on. However, there's one more glitch: When reducing the affinity of a vCPU temporarily, and then widening it again to a set that includes the pCPU that the vCPU was last running on, the generic scheduler code would not force a migration of that vCPU, and hence it would forever stay on the pCPU it last ran on. Since that can again create a load imbalance, the SEDF scheduler wants a migration to happen regardless of it being apparently unnecessary. Of course, an alternative to checking for SEDF explicitly in vcpu_set_affinity() would be to introduce a flags field in struct scheduler, and have SEDF set a "always-migrate-on-affinity-change" flag. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
*	x86: make certain memory sub-ops return valid values	Jan Beulich	2013-03-12	3	-6/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	When a domain's shared info field "max_pfn" is zero, domain_get_maximum_gpfn() so far returned ULONG_MAX, which do_memory_op() in turn converted to -1 (i.e. -EPERM). Make the former always return a sensible number (i.e. zero if the field was zero) and have the latter no longer truncate return values. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> master changeset: 7ffc9779aa5120c5098d938cb88f69a1dda9a0fe master date: 2013-03-04 10:16:04 +0100
*	fix compat memory exchange op splitting	Jan Beulich	2013-03-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	A shift with a negative count was erroneously used here, yielding undefined behavior. Reported-by: Xi Wang <xi@mit.edu> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: 53decd322157e922cac2988e07da6d39538c8033 master date: 2013-03-01 16:59:49 +0100
*	Avoid stale pointer when moving domain to another cpupool	Juergen Gross	2013-03-12	1	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \|	When a domain is moved to another cpupool the scheduler private data pointers in vcpu and domain structures must never point to an already freed memory area. While at it, simplify sched_init_vcpu() by using DOM2OP instead VCPU2OP. Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com> master changeset: 482300def7d08e773ccd2a0d978bcb9469fdd810 master date: 2013-02-28 14:56:45 +0000
*	vmx: fix handling of NMI VMEXIT.	Tim Deegan	2013-03-12	4	-2/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Call do_nmi() directly and explicitly re-enable NMIs rather than raising an NMI through the APIC. Since NMIs are disabled after the VMEXIT, the raised NMI would be blocked until the next IRET instruction (i.e. the next real interrupt, or after scheduling a PV guest) and in the meantime the guest will spin taking NMI VMEXITS. Also, handle NMIs before re-enabling interrupts, since if we handle an interrupt (and therefore IRET) before calling do_nmi(), we may end up running the NMI handler with NMIs enabled. Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> master changeset: 7dd3b06ff031c9a8c727df16c5def2afb382101c master date: 2013-02-28 14:00:18 +0000
*	QEMU_UPSTREAM_REVISION update	Ian Jackson	2013-03-08	1	-1/+1
\|
*	x86/setup: don't relocate the VGA hole.	Tim Deegan	2013-03-08	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Copying the contents of the VGA hole is at best pointless and at worst dangerous. Booting Xen on Xen, it causes a very long delay as each byte is referred to qemu. Since we were already discarding the first 1MB of the relocated area, just avoid copying it in the first place. Reported-by: Jon Ludlam <jonathan.ludlam@eu.citrix.com> Signed-off-by: Tim Deegan <tim@xen.org> master changeset: 0b76ce20de85ad7c23c47ee3275020859b91d46b master date: 2013-02-14 12:20:58 +0000
*	x86: fix CMCI injection	Jan Beulich	2013-03-07	6	-12/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes the wrong use of literal vector 0xF7 with an "int" instruction (invalidated by 25113:14609be41f36) and the fact that doing the injection via a software interrupt was never valid anyway (because cmci_interrupt() acks the LAPIC, which does the wrong thing if the interrupt didn't get delivered though it). In order to do latter, the patch introduces send_IPI_self(), at once removing two opend coded uses of "genapic" in the IRQ handling code. Reported-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Yongjie Ren <yongjie.ren@intel.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: 2f8c55ccefe49bb526df0eaf5fa9b7b788422208 master date: 2013-02-26 10:15:56 +0100
*	IOMMU, AMD Family15h Model10-1Fh erratum 746 Workaround	Suravee Suthikulpanit	2013-03-07	1	-0/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The IOMMU may stop processing page translations due to a perceived lack of credits for writing upstream peripheral page service request (PPR) or event logs. If the L2B miscellaneous clock gating feature is enabled the IOMMU does not properly register credits after the log request has completed, leading to a potential system hang. BIOSes are supposed to disable L2B micellaneous clock gating by setting L2_L2B_CK_GATE_CONTROL[CKGateL2BMiscDisable](D0F2xF4_x90[2]) = 1b. This patch corrects that for those which do not enable this workaround. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> master changeset: 0f8adcb2a7183bea5063f6fffba7d7e1aa14fc84 master date: 2013-02-26 10:14:53 +0100
*	x86: fix null pointer dereference in intel_get_extended_msrs()	Xi Wang	2013-03-07	1	-1/+1
\| \| \| \| \| \| \| \| \|	`memset(&mc_ext, 0, ...)' leads to a buffer overflow and a subsequent null pointer dereference. Replace `&mc_ext' with `mc_ext'. Signed-off-by: Xi Wang <xi@mit.edu> master changeset: c40e24a8ef74f9d0ee59dd9b8ca890be08b0b874 master date: 2013-02-25 12:44:25 +0100
*	honor ACPI v4 FADT flags	Jan Beulich	2013-03-07	6	-6/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- force use of physical APIC mode if indicated so (as we don't support xAPIC cluster mode, the respective flag is taken to force physical mode too) - don't use MSI if indicated so (implies no IOMMU) Both can be overridden on the command line, for the MSI case this at once adds a new command line option allowing to turn off PCI MSI (IOMMU and HPET are unaffected by this). Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> master changeset: 992fdf6f46252a459c6b1b8d971b2c71f01460f8 master date: 2013-02-22 11:56:54 +0100
*	x86/nhvm: properly clean up after failure to set up all vCPU-s	Jan Beulich	2013-03-07	2	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Otherwise we may leak memory when setting up nHVM fails half way. This implies that the individual destroy functions will have to remain capable (in the VMX case they first need to be made so, following 26486:7648ef657fe7 and 26489:83a3fa9c8434) of being called for a vCPU that the corresponding init function was never run on. Once at it, also remove a redundant check from the corresponding parameter validation code. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Tested-by: Olaf Hering <olaf@aepfle.de> master changeset: 17281aea1a9a10f1ee165c6e6a2921a67b7b1df2 master date: 2013-02-22 11:21:38 +0100
*	x86/mm: Take the p2m lock even in shadow mode.	Tim Deegan	2013-03-07	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The reworking of p2m lookups to use get_gfn()/put_gfn() left the shadow code not taking the p2m lock, even in cases where the p2m would be updated (i.e. PoD). In many cases, shadow code doesn't need the exclusion that get_gfn()/put_gfn() provides, as it has its own interlocks against p2m updates, but this is taking things too far, and can lead to crashes in the PoD code. Now that most shadow-code p2m lookups are done with explicitly unlocked accessors, or with the get_page_from_gfn() accessor, which is often lock-free, we can just turn this locking on. The remaining locked lookups are in sh_page_fault() (in a path that's almost always already serializing on the paging lock), and in emulate_map_dest() (which can probably be updated to use get_page_from_gfn()). They're not addressed here but may be in a follow-up patch. Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> master changeset: a15d87475ed95840dba693ab0a56d0b48a215cbc master date: 2013-02-21 15:16:20 +0000