aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* tools/ocaml: clean META filesVincent Bernardoff2013-04-171-1/+1
| | | | | | | | As META files are generated from META.in files, they should be cleaned by clean rules. Signed-off-by: Vincent Bernardoff <vincent.bernardoff@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into stagingIan Campbell2013-04-172-9/+10
|\
| * x86: allow VCPUOP_register_vcpu_info to work again on PVHVM guestsKonrad Rzeszutek Wilk2013-04-172-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For details on the hypercall please see commit c58ae69360ccf2495a19bf4ca107e21cf873c75b (VCPUOP_register_vcpu_info) and the c/s 23143 (git commit 6b063a4a6f44245a727aa04ef76408b2e00af9c7) (x86: move pv-only members of struct vcpu to struct pv_vcpu) that introduced the regression. The current code allows the PVHVM guest to make this hypercall. But for PVHVM guest it always returns -EINVAL (-22) for Xen 4.2 and above. Xen 4.1 and earlier worked. The reason is that the check in map_vcpu_info would fail at: if ( v->arch.vcpu_info_mfn != INVALID_MFN ) The reason is that the vcpu_info_mfn for PVHVM guests ends up by defualt with the value of zero (introduced by c/s 23143). The code in vcpu_initialise which initialized vcpu_info_mfn to a valid value (INVALID_MFN), would never be called for PVHVM: if ( is_hvm_domain(d) ) { rc = hvm_vcpu_initialise(v); goto done; } v->arch.pv_vcpu.vcpu_info_mfn = INVALID_MFN; while previously it would be: v->arch.vcpu_info_mfn = INVALID_MFN; [right at the start of the function in Xen 4.1] This fixes the problem with Linux advertising this error: register_vcpu_info failed: err=-22 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | docs: rearrange and update NUMA placement documentationDario Faggioli2013-04-171-11/+83
| | | | | | | | | | | | | | | | | | To include the new concept of NUMA aware scheduling and describe its impact. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
* | xl: add node-affinity to the output of `xl list`Dario Faggioli2013-04-172-63/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Node-affinity is now something that is under (some) control of the user, so show it upon request as part of the output of `xl list' by the `-n' option. Re the patch, the print_bitmap() related hunk is _mostly_ code motion, although there is a very minor change in the code, basically to allow using the function for printing both cpu and node bitmaps (as, in case all bits are sets, it used to print "any cpu", which doesn't fit the nodemap case). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* | libxl: automatic placement deals with node-affinityDario Faggioli2013-04-172-17/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Which basically means the following two things: 1) during domain creation, it is the node-affinity of the domain --rather than the vcpu-affinities of its VCPUs-- that is affected by automatic placement; 2) during automatic placement, when counting how many VCPUs are already "bound" to a placement candidate (as part of the process of choosing the best candidate), both vcpu-affinity and node-affinity are considered. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
* | libxl: optimize the calculation of how many VCPUs can run on a candidateDario Faggioli2013-04-171-22/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For choosing the best NUMA placement candidate, we need to figure out how many VCPUs are runnable on each of them. That requires going through all the VCPUs of all the domains and check their affinities. With this change, instead of doing the above for each candidate, we do it once for all, populating an array while counting. This way, when we later are evaluating candidates, all we need is summing up the right elements of the array itself. This reduces the complexity of the overall algorithm, as it moves a potentially expensive operation (for_each_vcpu_of_each_domain {}) outside from the core placement loop, so that it is performed only once instead of (potentially) tens or hundreds of times. More specifically, we go from a worst case computation time complaxity of: O(2^n_nodes) * O(n_domains*n_domain_vcpus) To, with this change: O(n_domains*n_domains_vcpus) + O(2^n_nodes) = O(2^n_nodes) (with n_nodes<=16, otherwise the algorithm suggests partitioning with cpupools and does not even start.) Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* | libxl: allow for explicitly specifying node-affinityDario Faggioli2013-04-175-0/+39
| | | | | | | | | | | | | | | | | | | | By introducing a nodemap in libxl_domain_build_info and providing the get/set methods to deal with it. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* | libxc: allow for explicitly specifying node-affinityDario Faggioli2013-04-172-0/+103
| | | | | | | | | | | | | | | | | | By providing the proper get/set interface and wiring them to the new domctl-s from the previous commit. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
* | xen: allow for explicitly specifying node-affinityDario Faggioli2013-04-1714-21/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it possible to pass the node-affinity of a domain to the hypervisor from the upper layers, instead of always being computed automatically. Note that this also required generalizing the Flask hooks for setting and getting the affinity, so that they now deal with both vcpu and node affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Keir Fraser <keir@xen.org>
* | xen: sched_credit: let the scheduler know about node-affinityDario Faggioli2013-04-172-153/+319
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As vcpu-affinity tells where VCPUs must run, node-affinity tells where they prefer to. While respecting vcpu-affinity remains mandatory, node-affinity is not that strict, it only expresses a preference, although honouring it will bring significant performance benefits (especially as compared to not having any affinity at all). This change modifies the VCPUs load balancing algorithm (for the credit scheduler only), introducing a two steps logic. During the first step, we use both the vcpu-affinity and the node-affinity masks (by looking at their intersection). The aim is giving precedence to the PCPUs where the domain prefers to run, as expressed by its node-affinity (with the intersection with the vcpu-afinity being necessary in order to avoid running a VCPU where it never should). If that fails in finding a valid PCPU, the node-affinity is just ignored and, in the second step, we fall back to using cpu-affinity only. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org>
* | xen: sched_credit: when picking, make sure we get an idle one, if anyDario Faggioli2013-04-171-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pcpu picking algorithm treats two threads of a SMT core the same. More specifically, if one is idle and the other one is busy, they both will be assigned a weight of 1. Therefore, when picking begins, if the first target pcpu is the busy thread (and if there are no other idle pcpu than its sibling), that will never change. This change fixes this by ensuring that, before entering the core of the picking algorithm, the target pcpu is an idle one (if there is an idle pcpu at all, of course). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org>
* | xen, libxc: introduce xc_nodemap_tDario Faggioli2013-04-173-1/+38
| | | | | | | | | | | | | | | | | | And its handling functions, following suit from xc_cpumap_t. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Keir Fraser <keir@xen.org>
* | xen, libxc: rename xenctl_cpumap to xenctl_bitmapDario Faggioli2013-04-1715-49/+71
|/ | | | | | | | | | | | | | | | | | More specifically: 1. replaces xenctl_cpumap with xenctl_bitmap 2. provides bitmap_to_xenctl_bitmap and the reverse; 3. re-implement cpumask_to_xenctl_bitmap with bitmap_to_xenctl_bitmap and the reverse; Other than #3, no functional changes. Interface only slightly afected. This is in preparation of introducing NUMA node-affinity maps. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Keir Fraser <keir@xen.org>
* xl: Fix 'free_memory' to include outstanding_claims value.Konrad Rzeszutek Wilk2013-04-164-32/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Updating to make it clear that free_memory reported by 'xl info' is influenced by the outstanding claim value. That is the free memory that will be available to the host once all outstanding claims have been completed. This modifies the behavior that the patch titled "xl: 'xl info' print outstanding claims if enabled (claim_mode=1 in xl.conf)" had - which reported the outstanding claims and nothing else. The free_pages as reported by the hypervisor is the currently available count of pages on the heap. The outstanding pages is the total amount of pages reserved for guests (so not taken from the heap yet). As guests are being populated the memory from the heap shrinks and the outstanding count of pages decreases. The total memory used for guests increases. As the available count of pages on the heap and outstanding claims are intertwined, report the amount of free memory available to be a combination of that. That is free heap memory minus the outstanding pages. We also make some odd choices in reporting. By default we will only display 'outstanding_claims' if the claim_mode is enabled in the global configuration file. However, if there are outstanding claims, we will ignore the claim_mode and report these values. Suggested-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* xl: 'xl claims' print outstanding per domain claimsKonrad Rzeszutek Wilk2013-04-165-6/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is similar to "xl: 'xl info' print outstanding claims if enabled (claim_mode=1 in xl.conf)" which exposes the global claim value. This patch provides the value of the currently outstanding pages claimed for each domains. This is per domain value which is added to the global claim value which influences the hypervisors' MM system. When a claim call is done, a reservation for a specific amount of pages is set (and this patch lists said number) and also a global value is incremented. This global value is then reduced as the domain's memory is populated and eventually reaches zero. The toolstack (libxc) also sets the domain's claim to zero when the population of memory has completed as an extra step. Any call to destroy the domain will also set the domain's claim to zero. If the reservation cannot be meet the guest creation fails immediately instead of taking seconds or minutes (depending on the size of the guest) while the toolstack populates memory. See patch: "xl: Implement XENMEM_claim_pages support via 'claim_mode' global config" for details on how it is implemented. The value fluctuates quite often so the value is stale once it is provided to the user-space. However it is useful for diagnostic purposes. It is printed irregardless of global "claim_mode" option in xl.conf(5). That is b/c the user might have enabled, launched a guest, and then disabled the option - and we should still report the correct outstanding claim value. The 'man xl' shows the details of this argument. The output is close to what 'xl list' looks like: Name ID Mem VCPUs State Time(s) Claimed Domain-0 0 2047 4 r----- 19.7 0 OL5 2 2048 1 --p--- 0.0 847 OL6 3 1024 4 r----- 5.9 0 Windows_XP 4 2047 1 --p--- 0.0 1989 [In which it can be seen that the OL5 guest still has 847MB of claimed memory (out of the total 2048MB where 1191MB has been allocated to the guest).] Please note that the 'Mem' column has the cumulative value of outstanding claims and the total amount of memory that has been allocated to the guest. [v1: claims, not claim-list] [v2: Add outstanding and current memkb in the output list] [v3: Clairy docs and relax some checks] [v4: Removed comments about guest config memory being the same as 'Mem'] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* xl: export 'outstanding_pages' value from xcinfoKonrad Rzeszutek Wilk2013-04-162-0/+2
| | | | | | | | | | | | | | | | | | This patch provides the value of the currently outstanding pages claimed for a specific domain. This is a value that influences the global outstanding claims value (See patch: "xl: 'xl info' print outstanding claims if enabled") returned via xc_domain_get_outstanding_pages hypercall. This domain value decrements as the memory is populated for the guest and eventually reaches zero. With this patch it is possible to utilize this field. Acked-by: Ian Campbell <ian.campbell@citrix.com> [v2: s/unclaimed/outstanding/ per Tim's suggestion] [v3: Don't use SXP printout file per Ian's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* xc: export outstanding_pages value in xc_dominfo structure.Dan Magenheimer2013-04-162-0/+2
| | | | | | | | | | | | | | | | | | This patch provides the value of the currently outstanding pages claimed for a specific domain. This is a value that influences the global outstanding claims value (See patch: "xl: 'xl info' print outstanding claims if enabled") returned via xc_domain_get_outstanding_pages hypercall. This domain value decrements as the memory is populated for the guest and eventually reaches zero. This patch is neccessary for "xl: export 'outstanding_pages' value from xcinfo" patch. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> [v2: s/unclaimed_pages/outstanding_pages/ per Tim's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* xl: 'xl info' print outstanding claims if enabled (claim_mode=1 in xl.conf)Konrad Rzeszutek Wilk2013-04-164-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch provides the value of the currently outstanding pages claimed for all domains. This is a total global value that influences the hypervisors' MM system. When a claim call is done, a reservation for a specific amount of pages is set and also a global value is incremented. This global value is then reduced as the domain's memory is populated and eventually reaches zero. The toolstack (libxc) also sets the domain's claim to zero when the population of memory has completed as an extra step. Any call to destroy the domain will also set the domain's claim to zero. If the reservation cannot be meet the guest creation fails immediately instead of taking seconds or minutes (depending on the size of the guest) while the toolstack populates memory. See patch: "xl: Implement XENMEM_claim_pages support via 'claim_mode' global config" for details on how it is implemented. The value fluctuates quite often so the value is stale once it is provided to the user-space. However it is useful for diagnostic purposes. It is only printed when the global "claim_mode" option in xl.conf(5) is set to enabled (1). The 'man xl' shows the details of this item. [v1: s/unclaimed/outstanding/] [v2: Made libxl_get_claiminfo return just MemKB suggested by Ian Campbell] [v3: Made libxl_get_claininfo return MemMB to conform to the other values printed] [v4: Improvements suggested by Ian Jackson, also added docs to xl.pod.1] [v5: Clarify how claims are cancelled, split >72 characters - Ian Jackson] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* xl: Implement XENMEM_claim_pages support via 'claim_mode' global configKonrad Rzeszutek Wilk2013-04-169-3/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The XENMEM_claim_pages hypercall operates per domain and it should be used system wide. As such this patch introduces a global configuration option 'claim_mode' that by default is disabled. If this option is enabled then when a guest is created there will be an guarantee that there is memory available for the guest. This is an particularly acute problem on hosts with memory over-provisioned guests that use tmem and have self-balloon enabled (which is the default option for them). The self-balloon mechanism can deflate/inflate the balloon quickly and the amount of free memory (which 'xl info' can show) is stale the moment it is printed. When claim is enabled a reservation for the amount of memory ('memory' in guest config) is set, which is then reduced as the domain's memory is populated and eventually reaches zero. If the reservation cannot be meet the guest creation fails immediately instead of taking seconds/minutes (depending on the size of the guest) while the guest is populated. Note that to enable tmem type guests, one needs to provide 'tmem' on the Xen hypervisor argument and as well on the Linux kernel command line. There are two boolean options: (0) No claim is made. Memory population during guest creation will be attempted as normal and may fail due to memory exhaustion. (1) Normal memory and freeable pool of ephemeral pages (tmem) is used when calculating whether there is enough memory free to launch a guest. This guarantees immediate feedback whether the guest can be launched due to memory exhaustion (which can take a long time to find out if launching massively huge guests) and in parallel. [v1: Removed own claim_mode type, using just bool, improved docs, all per Ian's suggestion] [v2: Updated the comments] [v3: Rebase on top 733b9c524dbc2bec318bfc3588ed1652455d30ec (xl: add vif.default.script)] [v4: Fixed up comments] [v5: s/global_claim_mode/claim_mode/] [v6: Ian Jackson's feedback: use libxl_defbool, better comments, etc] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* xc: use XENMEM_claim_pages hypercall during guest creation.Dan Magenheimer2013-04-166-4/+70
| | | | | | | | | | | | | | | | | | | | | | We add an extra parameter to the structures passed to the PV routine (arch_setup_meminit) and HVM routine (setup_guest) that determines whether the claim hypercall is to be done. The contents of the 'claim_enabled' is defined as an 'int' in case the hypercall expands in the future with extra flags (for example for per-NUMA allocation). For right now the proper values are: 0 to disable it or 1 to enable it. If the hypervisor does not support this function, the xc_domain_claim_pages and xc_domain_get_outstanding_pages will silently return 0 (and set errno to zero). Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> [v2: Updated per Ian's recommendations] [v3: Added support for out-of-sync hypervisor] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* VTD: Remove the check for reserved device scope typeYang Zhang2013-04-162-3/+8
| | | | | | | | | | | | | | Though we only have four valid types now, the new type may be added in future. It's better to remove the check and only deal with the type that we can recognize. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@Intel.com> Acked-by: Keir Fraser <keir@xen.org> Add log message for this case. Signed-off-by: Jan Beulich <jbeulich@suse.com>
* iommu/crash: Interrupt remapping is also disabled on crashAndrew Cooper2013-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | This fixes a regression side-effect caused by: IOMMU: properly check whether interrupt remapping is enabled git: fae0372140befb88d890a30704a8ec058c902af8 hg: 26742:e1ec14bad0cb On the crash path in nmi_shootdown_cpus(), we shut down the IOMMU, then disable the IOAPIC. On systems which support interrupt remapping, the variable iommu_intremap remains set, meaning that disable_IO_APIC() issues interrupt remapping invalidate requests. IOAPIC interrupt remapping used to be conditional on iommu_enabled, but is now conditional on iommu_intremap, following the above changeset. This behaviour can be fixed by also indicating that interrupt remapping is not enabled after shutting down the IOMMU. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
* x86/AMD: Dump AMD VPMU infoBoris Ostrovsky2013-04-151-1/+41
| | | | | | Dump VPMU registers on AMD in the 'q' keyhandler. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
* x86/AMD: Clean up context_update() in AMD VPMU codeBoris Ostrovsky2013-04-151-11/+17
| | | | | | | | | | Clean up context_update() in AMD VPMU code. Rename restore routine to "load" to be consistent with Intel code and with arch_vpmu_ops names Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/VPMU: Save/restore VPMU only when necessaryBoris Ostrovsky2013-04-157-27/+157
| | | | | | | | | | | | | VPMU doesn't need to always be saved during context switch. If we are comming back to the same processor and no other VPCU has run here we can simply continue running. This is especailly useful on Intel processors where Global Control MSR is stored in VMCS, thus not requiring us to stop the counters during save operation. On AMD we need to explicitly stop the counters but we don't need to save them. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/VPMU: Factor out VPMU common codeBoris Ostrovsky2013-04-155-72/+47
| | | | | | | | | Factor out common code from SVM amd VMX into VPMU. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
* x86/VPMU: Add Haswell supportBoris Ostrovsky2013-04-151-0/+1
| | | | | | | Initialize VPMU on Haswell CPUs. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/AMD: Stop counters on VPMU saveBoris Ostrovsky2013-04-151-16/+6
| | | | | | | | | | | | Stop the counters during VPMU save operation since they shouldn't be running when VPCU that controls them is not. This also makes it unnecessary to check for overflow in context_restore() Set LVTPC vector before loading the context during vpmu_restore(). Otherwise it is possible to trigger an interrupt without proper vector. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/AMD: Load context when attempting to read VPMU MSRsBoris Ostrovsky2013-04-151-1/+20
| | | | | | | | | | | | | | | | Load context (and mark it as LOADED) on any MSR access. This will allow us to always read the most up-to-date value of an MSR: guest may write into an MSR without enabling it (thus not marking the context as RUNNING) and then be migrated. Without first loading the context reading this MSR from HW will not match the pervious write since registers will not be loaded into HW in amd_vpmu_load(). In addition, we should be saving the context when it is LOADED, not RUNNING --- otherwise we need to save it any time it becomes non-RUNNING, which may be a frequent occurrence. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/AMD: Do not intercept access to performance counters MSRsBoris Ostrovsky2013-04-152-1/+44
| | | | | | | | Access to performance counters and reads of event selects don't need to always be intercepted. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/AMD: Allow more fine-grained control of VMCB MSR Permission MapBoris Ostrovsky2013-04-152-10/+13
| | | | | | | | | Currently VMCB's MSRPM can be updated to either intercept both reads and writes to an MSR or not intercept neither. In some cases we may want to be more selective and intercept one but not the other. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* IOMMU: allow MSI message to IRTE propagation to failJan Beulich2013-04-1510-40/+72
| | | | | | | | | | | | With the need to allocate multiple contiguous IRTEs for multi-vector MSI, the chance of failure here increases. While on the AMD side there's no allocation of IRTEs at present at all (and hence no way for this allocation to fail, which is going to change with a later patch in this series), VT-d already ignores an eventual error here, which this patch fixes. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
* MAINTAINERS: Add myself as XSM maintainerDaniel De Graaf2013-04-151-0/+8
| | | | | | Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
* stubdom/grub: send kernel measurements to vTPMDaniel De Graaf2013-04-125-10/+89
| | | | | | | | | This allows a domU with an arbitrary kernel and initrd to take advantage of the static root of trust provided by a vTPM. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Acked-by: Matthew Fioravante <matthew.fioravante@jhuapl.edu>
* stubdom/vtpm: constrain locality by XSM labelDaniel De Graaf2013-04-121-2/+74
| | | | | | | | | | | This adds the ability for a vTPM to constrain what localities a given client domain can use based on its XSM label. For example: locality=user_1:vm_r:domU_t=0,1,2 locality=user_1:vm_r:watcher_t=5 An arbitrary prefix can be matched by using a '*'. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* stubdom/vtpm: support multiple backendsDaniel De Graaf2013-04-121-12/+2
| | | | Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* stubdom/vtpm: make state save operation atomicDaniel De Graaf2013-04-121-14/+60
| | | | | | | | | | This changes the save format of the vtpm stubdom to include two copies of the saved data: one active, and one inactive. When saving the state, data is written to the inactive slot before updating the key and hash saved with the TPM Manager, which determines the active slot when the vTPM starts up. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* stubdom/vtpm: Support locality fieldDaniel De Graaf2013-04-123-1/+52
| | | | | | | | | | | | The vTPM protocol now contains a field allowing the locality of a command to be specified; pass this to the TPM when processing a packet. While the locality is not currently checked for validity, a binding between locality and some distinguishing feature of the client domain (such as the XSM label) will need to be defined in order to properly support a multi-client vTPM. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Matthew Fioravante <matthew.fioravante@jhuapl.edu>
* stubdom/vtpm: correct the buffer size returned by TPM_CAP_PROP_INPUT_BUFFERDaniel De Graaf2013-04-122-0/+14
| | | | | | | | The vtpm2 ABI supports packets of up to 4088 bytes by default; expose this property though the TPM's interface so clients do not attempt to send larger packets. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* mini-os/tpmback: add tpmback_get_peercontextDaniel De Graaf2013-04-124-0/+36
| | | | | | | This allows the XSM label of the TPM's client domain to be retrieved. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
* mini-os/tpmback: Replace UUID field with opaque pointerDaniel De Graaf2013-04-124-7/+43
| | | | | | | | Instead of only recording the UUID field, which may not be of interest to all tpmback implementations, provide a user-settable opaque pointer associated with the tpmback instance. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* mini-os/tpmback: set up callbacks before enumerationDaniel De Graaf2013-04-124-41/+6
| | | | | | | | | | | | The open/close callbacks in tpmback cannot be properly initalized in order to catch the initial enumeration events because init_tpmback clears the callbacks and then asynchronously starts the enumeration of existing tpmback devices. Fix this by passing the callbacks to init_tpmback so they can be installed before enumeration. This also removes the unused callbacks for suspend and resume. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* mini-os/tpm{back, front}: Allow device repoensDaniel De Graaf2013-04-122-3/+34
| | | | | | | | Allow the vtpm device to be disconnected and reconnected so that a bootloader (like pv-grub) can submit measurements and return the vtpm device to its initial state before booting the target kernel. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* mini-os/tpm{back, front}: Change shared page ABIDaniel De Graaf2013-04-125-152/+232
| | | | | | | | | | | | | | | | | | | | | This changes the vTPM shared page ABI from a copy of the Xen network interface to a single-page interface that better reflects the expected behavior of a TPM: only a single request packet can be sent at any given time, and every packet sent generates a single response packet. This protocol change should also increase efficiency as it avoids mapping and unmapping grants when possible. The vtpm xenbus device now requires a feature-protocol-v2 node in xenstore to avoid conflicts with existing (xen-patched) kernels supporting the old interface. While the contents of the shared page have been defined to allow packets larger than a single page (actually 4088 bytes) by allowing the client to add extra grant references, the mapping of these extra references has not been implemented; a feature node in xenstore may be used in the future to indicate full support for the multi-page protocol. Most uses of the TPM should not require this feature. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Cc: Jan Beulich <JBeulich@suse.com>
* libxl: beautify json with YAJL2M A Young2013-04-121-1/+5
| | | | | | | | xl list -l should produce readable output when built with yajl2 so it is compatible with the xendomains script. Signed-off-by: Michael Young <m.a.young@durham.ac.uk> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* tools+stubdom: install under /usr/local by default.Ian Campbell2013-04-128-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the hotplug scripts have been fixed to remove hardcoded paths lets try this again. From 26470:acaf29203cf9: This is the defacto (or FHS mandated?) standard location for software built from source, in order to avoid clashing with packaged software which is installed under /usr/bin etc. I think there is benefit in having Xen's install behave more like the majority of other OSS software out there. The major downside here is in the transition from 4.2 to 4.3 where people who have built from source will innevitably discover breakage because 4.3 no longer overwrites stuff in /usr like it used to so they pickup old stale bits from /usr instead of new stuff from /usr/local. Packages will use ./configure --prefix=/usr or whatever helper macro their package manager gives them. I have confirmed that doing this results in the same list of installed files as before this patch was applied. The hypervisor remains in /boot/ and there is no intention to move it. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
* libxl: regenerate libxlu cfg parser after 9e14c516b6cbIan Campbell2013-04-122-30/+30
| | | | | | Fixup whitespace alignment while I'm there. Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
* Switch to poll() in cxenstored's IO loopWei Liu2013-04-112-60/+133
| | | | | | | | | | | | | | | | | Poll() can support more file descriptors than select(). We've done this for xenconsoled, now do this for cxenstored as well. The code is taken from xenconsoled and modified to adapt to cxenstored. Note that poll() semantic is a bit different from select(). In Linux, if a fd is set in IN/OUT fd_set and error occurs inside select(), this fd is still considered readable / writable, and it is set in the returned IN/OUT fd_set. So in later handle_input / handle_output, the connection will eventually be talloc_free'ed(). After switching to poll(), we should take care of any error right away, making the code clearer. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* mini-os: implement poll(2)Wei Liu2013-04-112-1/+117
| | | | | | | | It is just a wrapper around select(2). This implementation mimics Linux's do_poll. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>