x86/kexec: Change NMI and MCE handling on kexec path

Experimentally, certain crash kernels will triple fault very early after starting if started with NMIs disabled. This was discovered when experimenting with a debug keyhandler which deliberately created a reentrant NMI, causing stack corruption. Because of this discovered bug, and that the future changes to the NMI handling will make the kexec path more fragile, take the time now to bullet-proof the kexec behaviour to be safer in more circumstances. This patch adds three new low level routines: * nmi_crash This is a special NMI handler for using during a kexec crash. * enable_nmis This function enables NMIs by executing an iret-to-self, to disengage the hardware NMI latch. * trap_nop This is a no op handler which irets immediately. It is not declared with ENTRY() to avoid the extra alignment overhead. And adds three new IDT entry helper routines: * _write_gate_lower This is a substitute for using cmpxchg16b to update a 128bit structure at once. It assumes that the top 64 bits are unchanged (and ASSERT()s the fact) and performs a regular write on the lower 64 bits. * _set_gate_lower This is functionally equivalent to the already present _set_gate(), except it uses _write_gate_lower rather than updating both 64bit values. * _update_gate_addr_lower This is designed to update an IDT entry handler only, without altering any other settings in the entry. It also uses _write_gate_lower. The IDT entry helpers are required because: * Is it unsafe to attempt a disable/update/re-enable cycle on the NMI or MCE IDT entries. * We need to be able to update NMI handlers without changing the IST entry. As a result, the new behaviour of the kexec_crash path is: nmi_shootdown_cpus() will: * Disable the crashing cpus NMI/MCE interrupt stack tables. Disabling the stack tables removes race conditions which would lead to corrupt exception frames and infinite loops. As this pcpu is never planning to execute a sysret back to a pv vcpu, the update is safe from a security point of view. * Swap the NMI trap handlers. The crashing pcpu gets the nop handler, to prevent it getting stuck in an NMI context, causing a hang instead of crash. The non-crashing pcpus all get the nmi_crash handler which is designed never to return. do_nmi_crash() will: * Save the crash notes and shut the pcpu down. There is now an extra per-cpu variable to prevent us from executing this multiple times. In the case where we reenter midway through, attempt the whole operation again in preference to not completing it in the first place. * Set up another NMI at the LAPIC. Even when the LAPIC has been disabled, the ID and command registers are still usable. As a result, we can deliberately queue up a new NMI to re-interrupt us later if NMIs get unlatched. Because of the call to __stop_this_cpu(), we have to hand craft self_nmi() to be safe from General Protection Faults. * Fall into infinite loop. machine_kexec() will: * Swap the MCE handlers to be a nop. We cannot prevent MCEs from being delivered when we pass off to the crash kernel, and the less Xen context is being touched the better. * Explicitly enable NMIs. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Minor style changes. Signed-off-by: Keir Fraser <keir@xen.org> Committed-by: Keir Fraser <keir@xen.org>
author: Andrew Cooper <andrew.cooper3@citrix.com> 2012-12-13 14:39:31 +0000
committer: Andrew Cooper <andrew.cooper3@citrix.com> 2012-12-13 14:39:31 +0000
commit: 77ad1faa6b6147770feece6a59e21b28c0e0788f (patch)
tree: 66bd6e20dce259f6eb59b728ee4759ce877a1388 /xen/arch/x86/crash.c
parent: fd91a2a662bc59677e0f217423a7a155d5465886 (diff)
download: xen-77ad1faa6b6147770feece6a59e21b28c0e0788f.tar.gz
xen-77ad1faa6b6147770feece6a59e21b28c0e0788f.tar.bz2
xen-77ad1faa6b6147770feece6a59e21b28c0e0788f.zip
1 files changed, 104 insertions, 15 deletions
diff --git a/xen/arch/x86/crash.c b/xen/arch/x86/crash.c
index 88cba19a40..b6c0a7f569 100644
--- a/xen/arch/x86/crash.c
+++ b/xen/arch/x86/crash.c
@@ -32,41 +32,130 @@
 
 static atomic_t waiting_for_crash_ipi;
 static unsigned int crashing_cpu;
+static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
 
-static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
+/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing. */
+void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
 {
-    /* Don't do anything if this handler is invoked on crashing cpu.
-     * Otherwise, system will completely hang. Crashing cpu can get
-     * an NMI if system was initially booted with nmi_watchdog parameter.
+    int cpu = smp_processor_id();
+
+    /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
+    ASSERT(cpu != crashing_cpu);
+
+    /* Save crash information and shut down CPU.  Attempt only once. */
+    if ( !this_cpu(crash_save_done) )
+    {
+        /* Disable the interrupt stack table for the MCE handler.  This
+         * prevents race conditions between clearing MCIP and receving a
+         * new MCE, during which the exception frame would be clobbered
+         * and the MCE handler fall into an infinite loop.  We are soon
+         * going to disable the NMI watchdog, so the loop would not be
+         * caught.
+         *
+         * We do not need to change the NMI IST, as the nmi_crash
+         * handler is immue to corrupt exception frames, by virtue of
+         * being designed never to return.
+         *
+         * This update is safe from a security point of view, as this
+         * pcpu is never going to try to sysret back to a PV vcpu.
+         */
+        set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
+
+        kexec_crash_save_cpu();
+        __stop_this_cpu();
+
+        this_cpu(crash_save_done) = 1;
+        atomic_dec(&waiting_for_crash_ipi);
+    }
+
+    /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
+     * back to its boot state, so we are unable to rely on the regular
+     * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
+     * (The likely scenario is that we have reverted from x2apic mode to
+     * xapic, at which point #GPFs will occur if we use the apic_*
+     * functions)
+     *
+     * The ICR and APIC ID of the LAPIC are still valid even during
+     * software disable (Intel SDM Vol 3, 10.4.7.2).  As a result, we
+     * can deliberately queue up another NMI at the LAPIC which will not
+     * be delivered as the hardware NMI latch is currently in effect.
+     * This means that if NMIs become unlatched (e.g. following a
+     * non-fatal MCE), the LAPIC will force us back here rather than
+     * wandering back into regular Xen code.
      */
-    if ( cpu == crashing_cpu )
-        return 1;
-    local_irq_disable();
+    switch ( current_local_apic_mode() )
+    {
+        u32 apic_id;
 
-    kexec_crash_save_cpu();
+    case APIC_MODE_X2APIC:
+        apic_id = apic_rdmsr(APIC_ID);
 
-    __stop_this_cpu();
+        apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL
+                   | ((u64)apic_id << 32));
+        break;
+
+    case APIC_MODE_XAPIC:
+        apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
 
-    atomic_dec(&waiting_for_crash_ipi);
+        while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
+            cpu_relax();
+
+        apic_mem_write(APIC_ICR2, apic_id << 24);
+        apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
+        break;
+
+    default:
+        break;
+    }
 
     for ( ; ; )
         halt();
-
-    return 1;
 }
 
 static void nmi_shootdown_cpus(void)
 {
     unsigned long msecs;
+    int i, cpu = smp_processor_id();
 
     local_irq_disable();
 
-    crashing_cpu = smp_processor_id();
+    crashing_cpu = cpu;
     local_irq_count(crashing_cpu) = 0;
 
     atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
-    /* Would it be better to replace the trap vector here? */
-    set_nmi_callback(crash_nmi_callback);
+
+    /* Change NMI trap handlers.  Non-crashing pcpus get nmi_crash which
+     * invokes do_nmi_crash (above), which cause them to write state and
+     * fall into a loop.  The crashing pcpu gets the nop handler to
+     * cause it to return to this function ASAP.
+     */
+    for ( i = 0; i < nr_cpu_ids; i++ )
+    {
+        if ( idt_tables[i] == NULL )
+            continue;
+
+        if ( i == cpu )
+        {
+            /*
+             * Disable the interrupt stack tables for this cpu's MCE and NMI 
+             * handlers, and alter the NMI handler to have no operation.  
+             * Disabling the stack tables prevents stack corruption race 
+             * conditions, while changing the handler helps prevent cascading 
+             * faults; we are certainly going to crash by this point.
+             *
+             * This update is safe from a security point of view, as this pcpu 
+             * is never going to try to sysret back to a PV vcpu.
+             */
+            _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
+            set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE);
+        }
+        else
+        {
+            /* Do not update stack table for other pcpus. */
+            _update_gate_addr_lower(&idt_tables[i][TRAP_nmi], &nmi_crash);
+        }
+    }
+
     /* Ensure the new callback function is set before sending out the NMI. */
     wmb();
author	Andrew Cooper <andrew.cooper3@citrix.com>	2012-12-13 14:39:31 +0000
committer	Andrew Cooper <andrew.cooper3@citrix.com>	2012-12-13 14:39:31 +0000
commit	77ad1faa6b6147770feece6a59e21b28c0e0788f (patch)
tree	66bd6e20dce259f6eb59b728ee4759ce877a1388 /xen/arch/x86/crash.c
parent	fd91a2a662bc59677e0f217423a7a155d5465886 (diff)
download	xen-77ad1faa6b6147770feece6a59e21b28c0e0788f.tar.gz xen-77ad1faa6b6147770feece6a59e21b28c0e0788f.tar.bz2 xen-77ad1faa6b6147770feece6a59e21b28c0e0788f.zip