| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As it currently stands, the string "domain_crash_sync called from entry.S" is
not helpful at identifying why the domain was crashed, and a debug build of
Xen doesn't help the matter
This patch improves the information printed, by pointing to where the crash
decision was made.
Specific improvements include:
* Moving the ascii string "domain_crash_sync called from entry.S\n" away from
some semi-hot code cache lines.
* Moving the printk into C code (especially as this_cpu() is miserable to use
in assembly code)
* Undo the previous confusing situation of having the
domain_crash_synchronous() as a macro in C code, yet a global symbol in
assembly code.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
| |
Also clean up some cases of misused/opencoded ENTRY()
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... as it causes problems if we happen to exit back via IRET: In the
course of trying to handle the fault, the hypervisor creates a stack
frame by hand, and uses PUSHFQ to set the respective EFLAGS field, but
expects to be able to IRET through that stack frame to the second
portion of the fixup code (which causes a #GP due to the stored EFLAGS
having NT set).
And even if this worked (e.g if we cleared NT in that path), it would
then (through the fail safe callback) cause a #GP in the guest with the
SYSENTER handler's first instruction as the source, which in turn would
allow guest user mode code to crash the guest kernel.
Inject a #GP on the fake (NULL) address of the SYSENTER instruction
instead, just like in the case where the guest kernel didn't register
a corresponding entry point.
This is CVE-2013-1917 / XSA-44.
Reported-by: Andrew Cooper <andrew.cooper3@citirx.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is even more so as the field doesn't have a comment to that effect
in the structure definition.
Once modifying the respective assembly code, also convert the
IRQSTAT_shift users to do a 32-bit shift only (as we won't support 48M
CPUs any time soon) and use "cmpl" instead of "testl" when checking the
field (both reducing code size).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise, we may end up in the scheduler, keeping NMIs masked for a
possibly unbounded period of time (until whenever the next IRET gets
executed). Enforce timely event processing by sending a self IPI.
Of course it's open for discussion whether to always use the straight
exit path from handle_ist_exception.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
| |
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Experimentally, certain crash kernels will triple fault very early
after starting if started with NMIs disabled. This was discovered
when experimenting with a debug keyhandler which deliberately created
a reentrant NMI, causing stack corruption.
Because of this discovered bug, and that the future changes to the NMI
handling will make the kexec path more fragile, take the time now to
bullet-proof the kexec behaviour to be safer in more circumstances.
This patch adds three new low level routines:
* nmi_crash
This is a special NMI handler for using during a kexec crash.
* enable_nmis
This function enables NMIs by executing an iret-to-self, to
disengage the hardware NMI latch.
* trap_nop
This is a no op handler which irets immediately. It is not
declared
with ENTRY() to avoid the extra alignment overhead.
And adds three new IDT entry helper routines:
* _write_gate_lower
This is a substitute for using cmpxchg16b to update a 128bit
structure at once. It assumes that the top 64 bits are unchanged
(and ASSERT()s the fact) and performs a regular write on the lower
64 bits.
* _set_gate_lower
This is functionally equivalent to the already present
_set_gate(), except it uses _write_gate_lower rather than updating
both 64bit values.
* _update_gate_addr_lower
This is designed to update an IDT entry handler only, without
altering any other settings in the entry. It also uses
_write_gate_lower.
The IDT entry helpers are required because:
* Is it unsafe to attempt a disable/update/re-enable cycle on the
NMI or MCE IDT entries.
* We need to be able to update NMI handlers without changing the IST
entry.
As a result, the new behaviour of the kexec_crash path is:
nmi_shootdown_cpus() will:
* Disable the crashing cpus NMI/MCE interrupt stack tables.
Disabling the stack tables removes race conditions which would
lead
to corrupt exception frames and infinite loops. As this pcpu is
never planning to execute a sysret back to a pv vcpu, the update
is
safe from a security point of view.
* Swap the NMI trap handlers.
The crashing pcpu gets the nop handler, to prevent it getting
stuck in
an NMI context, causing a hang instead of crash. The non-crashing
pcpus all get the nmi_crash handler which is designed never to
return.
do_nmi_crash() will:
* Save the crash notes and shut the pcpu down.
There is now an extra per-cpu variable to prevent us from
executing this multiple times. In the case where we reenter
midway through, attempt the whole operation again in preference to
not completing it in the first place.
* Set up another NMI at the LAPIC.
Even when the LAPIC has been disabled, the ID and command
registers are still usable. As a result, we can deliberately
queue up a new NMI to re-interrupt us later if NMIs get unlatched.
Because of the call to __stop_this_cpu(), we have to hand craft
self_nmi() to be safe from General Protection Faults.
* Fall into infinite loop.
machine_kexec() will:
* Swap the MCE handlers to be a nop.
We cannot prevent MCEs from being delivered when we pass off to
the crash kernel, and the less Xen context is being touched the
better.
* Explicitly enable NMIs.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Minor style changes.
Signed-off-by: Keir Fraser <keir@xen.org>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... and make restore conditional not only upon having saved the state,
but also upon whether saved state was actually modified (and register
values are known to have been preserved).
Note that RBP is unconditionally considered a volatile register (i.e.
irrespective of CONFIG_FRAME_POINTER), since the RBP handling would
become overly complicated due to the need to save/restore it on the
compat mode hypercall path [6th argument].
Note further that for compat mode code paths, saving/restoring R8...R15
is entirely unnecessary - we don't allow those guests to enter 64-bit
mode, and hence they have no way of seeing these registers' contents
(and there consequently also is no information leak, except if the
context saving domctl would be considered such).
Finally, note that this may not properly deal with gdbstub's needs, yet
(but if so, I can't really suggest adjustments, as I don't know that
code).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
| |
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
| |
Tracing functions that don't check tb_init_done are (by convention)
prefixed with __.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
| |
The use of "or" in GET_CPUINFO_FIELD so far wasn't ideal, as it doesn't
lend itself to folding this operation with a possibly subsequent one
(e.g. the well known mov+add=lea conversion). Split out the sub-
operations, and shorten assembly code slightly with this.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
| |
This utilizes the fact that the two bytes of interest are adjacent to
one another and that the resulting 16-bit values of interest are within
a contiguous range of numbers.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was set to zero immediately before the #GP injection code, since
SYSENTER doesn't really have a return address.
Reported-by: Ian Campbell <Ian.Campbell@citrix.com>
Furthermore, UREGS_cs and UREGS_rip don't need to be written a second
time, as the PUSHes above already can/do take care of putting in place
the intended values.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Check for non-canonical guest RIP before attempting to execute sysret.
If sysret is executed with a non-canonical value in RCX, Intel CPUs
take the fault in ring0, but we will necessarily already have switched
to the the user's stack pointer.
This is a security vulnerability, XSA-7 / CVE-2012-0217.
Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tested-by: Ian Campbell <Ian.Campbell@citrix.com>
Acked-by: Keir Fraser <keir.xen@gmail.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
|
|
|
|
|
| |
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Likely through copy'n'paste, all three instances of guest MCE
processing jumped to the wrong place (where NMI processing code
correctly jumps to) when MCE-s are temporarily masked (due to one
currently being processed by the guest). A nested, unmasked NMI should
get delivered immediately, however.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
| |
(I spotted this copy-and-paste mistake only when backporting c/s
25200:80f4113be500 to 4.1 and 4.0.)
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When guest use of sysenter (64-bit PV guest) or syscall (32-bit PV
guest) gets converted into a GP fault (due to no callback having got
registered), we must
- honor the GP fault handler's request the keep enabled or mask event
delivery
- not allow TBF_EXCEPTION to remain set past the generation of the
(guest) exception in the vCPU's trap_bounce.flags, as that would
otherwise allow for the next exception occurring in guest mode,
should it happen to get handled in Xen itself, to nevertheless get
bounced to the guest kernel.
Also, just like compat mode syscall handling already did, native mode
sysenter handling should, when converting to #GP, subtract 2 from the
RIP present in the frame so that the guest's GP fault handler would
see the fault pointing to the offending instruction instead of past it.
Finally, since those exception generating code blocks needed to be
modified anyway, convert them to make use of UNLIKELY_{START,END}().
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
| |
ENTRY() already includes an ALIGN.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
|
|
|
| |
#PF is, in all "normal" usage models, the only potentially high
frequency (and hence performance sensitive) exception. Thus make it
the fall-through case into handle_exception (rather than
divide_error for x86-32 and not having one at all for x86-64).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the absence of VT-d interrupt remapping support, a device can send
arbitrary APIC messages to host CPUs. One class of attack that results
is to confuse the hypervisor by delivering asynchronous interrupts to
vectors that are expected to handle only synchronous
traps/exceptions.
We block this class of attack by:
(1) setting APIC.TPR=0x10, to block all interrupts below vector
0x20. This blocks delivery to all architectural exception vectors.
(2) checking APIC.ISR[vec] for vectors 0x80 (fast syscall) and 0x82
(hypercall). In these cases we BUG if we detect we are handling a
hardware interrupt -- turning a potentially more severe infiltration
into a straightforward system crash (i.e, DoS).
Thanks to Invisible Things Lab <http://www.invisiblethingslab.com>
for discovery and detailed investigation of this attack.
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is accomplished by splitting the guest_context member, which by
itself is larger than a page on x86-64. Quite a number of fields of
this structure is completely meaningless for HVM guests, and thus a
new struct pv_vcpu gets introduced, which is being overlaid with
struct hvm_vcpu in struct arch_vcpu. The one member that is mostly
responsible for the large size is trap_ctxt, which now gets allocated
separately (unless fitting on the same page as struct arch_vcpu, as is
currently the case for x86-32), and only for non-hvm, non-idle
domains.
This change pointed out a latent problem in arch_set_info_guest(),
which is permitted to be called on already initialized vCPU-s, but
so far copied the new state into struct arch_vcpu without (in this
case) actually going through all the necessary accounting/validation
steps. The logic gets changed so that the pieces that bypass
accounting
will at least be verified to be no different from the currently active
bits, and the whole change will fail in case they are. The logic does
*not* get adjusted here to do full error recovery, that is, partially
modified state continues to not get unrolled in case of failure.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
| |
This fixes the build with perfc=y.
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... thus allowing to make the entries half their current size. Rather
than adjusting all instances to the new layout, abstract the
construction the table entries via a macro (paralleling a similar one
in recent Linux).
Also change the name of the section (to allow easier detection of
missed cases) and merge the final resulting output sections into
.data.read_mostly.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
| |
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
|
|
| |
... by moving the respective code out of line (into sub-section 1 of
the particular section). A few other branches could be eliminated
altogether.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
| |
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
| |
Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
|
|
|
|
|
|
| |
Largely this involves avoiding assumptions about 'struct cpu_info'.
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This attempts to address all the concerns raised in
http://lists.xensource.com/archives/html/xen-devel/2009-11/msg01195.html,
but I'm nevertheless still not convinced that all aspects of the
injection handling really work reliably. In particular, while the
patch here on top of the fixes for the problems menioned in the
referenced mail also adds code to keep send_guest_trap() from
injecting multiple events at a time, I don't think the is the right
mechanism - it should be possible to handle NMI/MCE nested within
each other.
Another fix on top of the ones for the earlier described problems is
that the vCPU affinity restore logic didn't account for software
injected NMIs - these never set cpu_affinity_tmp, but due to it most
likely being different from cpu_affinity it would have got restored
(to a potentially random value) nevertheless.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
|
| |
Make various data items const or __read_mostly where
possible/reasonable.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
| |
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
|
|
| |
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
|
|
| |
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
|
|
|
|
|
|
| |
This prevents it appearing in crash traces, where it can be a bit confusing.
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
|
|
|
| |
specified as not disabling event delivery, just like any other vector.
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
|
|
| |
Looking at the Linux patch as an example, it adds more code and
complexity than it removes, for no obvious performance benefit.
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
|
|
| |
in 64-bit pv guests and 32on64.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
|
| |
Based on a patch by Yosuke Iwamatsu <y-iwamatsu@ab.jp.nec.com>
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
|
| |
userspace to guest kernel. The flags are saved on the guest kernel
stack anyway, but some guests rely on %r11 instead.
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
| |
From: George Dunlap <gdunlap@xensource.com>
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
|
| |
do_device_not_available(), following naming convection for all other C
exception handlers.
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
| |
Signed-off-by: George Coker <gscoker@alpha.ncsc.mil>
|
|
|
|
| |
Signed-off-by: George Coker <gscoker@alpha.ncsc.mil>
|
|
|
|
| |
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
| |
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
|
|
|
|
|
| |
Properly handle MCE (connecting the exisiting, but so far unused
vendor specific handlers). HVM guests don't own CR4.MCE (and hence
can't suppress the exception) anymore, preventing silent machine
shutdown.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
| |
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
| |
Fixes suggested by Jan Beulich.
Signed-off-by: Keir Fraser <keir@xensource.com>
|
|
|
|
|
|
|
|
| |
disable interrupts before exiting to guest context.
Also sprinkle about some assertions about interrupt-enable status.
Signed-off-by: Keir Fraser <keir@xensource.com>
|