aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mini-os/x86-64 entry: code clean-ups; no functional changesXu Zhang2013-04-221-30/+31
| | | | | | | Make use of tabs and spaces consistent in arch/x86/x86_64.S Signed-off-by: Xu Zhang <xzhang@cs.uic.edu> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
* VMX: adjust correct table when there's no posted interrupt supportJan Beulich2013-04-191-2/+2
| | | | | | | The caller of start_vmx() will overwrite hvm_funcs, so there's no point in adjusting that table in start_vmx(). Signed-off-by: Jan Beulich <jbeulich@suse.com>
* x86/S3: Fix cpu pool scheduling after suspend/resumeBen Guthro2013-04-192-11/+46
| | | | | | | | | | | | | | | | | | | | | | | | | This review is another S3 scheduler problem with the system_state variable introduced with the following changeset: http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=269f543ea750ed567d18f2e819e5d5ce58eda5c5 Specifically, the cpu_callback function that takes the CPU down during suspend, and back up during resume. We were seeing situations where, after S3, only CPU0 was in cpupool0. Guest performance suffered greatly, since all vcpus were only on a single pcpu. Guests under high CPU load showed the problem much more quickly than an idle guest. Removing this if condition forces the CPUs to go through the expected online/offline state, and be properly scheduled after S3. This also includes a necessary partial change proposed earlier by Tomasz Wroblewski here: http://lists.xen.org/archives/html/xen-devel/2013-01/msg02206.html It should also resolve the issues discussed in this thread: http://lists.xen.org/archives/html/xen-devel/2012-11/msg01801.html Signed-off-by: Ben Guthro <benjamin.guthro@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
* x86: remove IS_PRIV bypass on IRQ checkDaniel De Graaf2013-04-191-20/+9
| | | | | | | | | | This prevents a process in dom0 from granting a domU access to an IRQ without adding the IRQ to the domU's list of permitted IRQs. This operation currently succeeds in dom0 but would fail if the device model were running in a stubdom, so making the failure consistent should ease debugging of the device-model stubdoms. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* x86: remove IS_PRIV access check bypassesDaniel De Graaf2013-04-182-6/+3
| | | | | | | | | | | Several domctl functions dealing with rangesets contain a short-circuit bypass if the domain is privileged. Since the construction of domain 0 permits access to all I/O ranges, the call to irq_access_permitted will normally return true even without the IS_PRIV check, and the presence of the IS_PRIV check prevents the creation of a privileged domain without access to specific devices or IO memory ranges. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
* Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into stagingIan Campbell2013-04-1810-30/+100
|\
| * x86: fix various issues with handling guest IRQsJan Beulich2013-04-189-30/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | - properly revoke IRQ access in map_domain_pirq() error path - don't permit replacing an in use IRQ - don't accept inputs in the GSI range for MAP_PIRQ_TYPE_MSI - track IRQ access permission in host IRQ terms, not guest IRQ ones (and with that, also disallow Dom0 access to IRQ0) This is CVE-2013-1919 / XSA-46. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
| * x86: clear EFLAGS.NT in SYSENTER entry pathJan Beulich2013-04-181-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... as it causes problems if we happen to exit back via IRET: In the course of trying to handle the fault, the hypervisor creates a stack frame by hand, and uses PUSHFQ to set the respective EFLAGS field, but expects to be able to IRET through that stack frame to the second portion of the fixup code (which causes a #GP due to the stored EFLAGS having NT set). And even if this worked (e.g if we cleared NT in that path), it would then (through the fail safe callback) cause a #GP in the guest with the SYSENTER handler's first instruction as the source, which in turn would allow guest user mode code to crash the guest kernel. Inject a #GP on the fake (NULL) address of the SYSENTER instruction instead, just like in the case where the guest kernel didn't register a corresponding entry point. This is CVE-2013-1917 / XSA-44. Reported-by: Andrew Cooper <andrew.cooper3@citirx.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
* | arm: vgic: fix race in vgic_vcpu_inject_irqIan Campbell2013-04-181-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | The initial check for a still pending interrupt (!list_empty(&n->inflight)) needs to be covered by the vgic lock to avoid trying to insert the IRQ into the inflight list simultaneously on 2 pCPUS. Expand the area covered by the lock appropriately. Also consolidate the unlocks on the exit path into one location. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
* | arm64: correct secondary CPU bringupIan Campbell2013-04-181-1/+1
| | | | | | | | | | | | | | The current cpuid is held in x22 not x12. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
* | arm: gic: implement IPIs using SGI mechanismIan Campbell2013-04-184-18/+105
|/ | | | | Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
* VMX: Use posted interrupt to deliver virutal interruptYang Zhang2013-04-181-2/+6
| | | | | | | | | | Deliver virtual interrupt through posted way if posted interrupt is enabled. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
* x86/HVM: Call vlapic_set_irq() to delivery virtual interruptYang Zhang2013-04-184-11/+11
| | | | | | | | | Move kick_vcpu into vlapic_set_irq. And call it to deliver virtual interrupt instead set vIRR directly. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
* VMX: Add posted interrupt supportingYang Zhang2013-04-186-18/+114
| | | | | | | | | Add the supporting of using posted interrupt to deliver interrupt. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
* VMX: Turn on posted interrupt bit in vmcsYang Zhang2013-04-184-0/+20
| | | | | | | | | Turn on posted interrupt for vcpu if posted interrupt is avaliable. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
* VMX: Detect posted interrupt capabilityYang Zhang2013-04-182-1/+17
| | | | | | | | | Check whether the Hardware supports posted interrupt capability. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)
* libxl: properly initialize device structuresDaniel De Graaf2013-04-171-3/+5
| | | | | | | | | This avoids returning unallocated memory in the libxl_device_vtpm structure in libxl_device_vtpm_list, and uses libxl_device_nic_init instead of memset when initializing libxl_device_nics. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* libxl: postpone backend name resolutionDaniel De Graaf2013-04-1710-330/+359
| | | | | | | | | | | | | | | | | | | | This adds a backend_domname field in libxl devices that contain a backend_domid field, allowing either a domid or a domain name to be specified in the configuration structures. The domain name is resolved into a domain ID in the _setdefault function when adding the device. This change allows the backend of the block devices to be specified (which previously required passing the libxl_ctx down into the block device parser), and will simplify specification of backend domains in other users of libxl. The check on run_hotplug_scripts in parse_config_data is removed because it is a duplicate of the one in libxl__device_nic_setdefault, and is removed here because it no longer has the resolved domain ID to check. Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> [ ijc -- reran flex ]
* docs: Don't use non-portable cp -d flag unnecessarily (no links are involved)Patrick Welche2013-04-171-3/+3
| | | | | | | | cp -d is a GNU extension, and only useful for links, which don't seem to exist in the directories being copied, so just remove said flag. Signed-off-by: Patrick Welche <prlw1@cam.ac.uk> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* configure: test(1) uses = not == for string comparisonPatrick Welche2013-04-174-15/+15
| | | | | | | | Avoids a bash-ism. Signed-off-by: Patrick Welche <prlw1@cam.ac.uk> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* tools/ocaml: clean META filesVincent Bernardoff2013-04-171-1/+1
| | | | | | | | As META files are generated from META.in files, they should be cleaned by clean rules. Signed-off-by: Vincent Bernardoff <vincent.bernardoff@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into stagingIan Campbell2013-04-172-9/+10
|\
| * x86: allow VCPUOP_register_vcpu_info to work again on PVHVM guestsKonrad Rzeszutek Wilk2013-04-172-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For details on the hypercall please see commit c58ae69360ccf2495a19bf4ca107e21cf873c75b (VCPUOP_register_vcpu_info) and the c/s 23143 (git commit 6b063a4a6f44245a727aa04ef76408b2e00af9c7) (x86: move pv-only members of struct vcpu to struct pv_vcpu) that introduced the regression. The current code allows the PVHVM guest to make this hypercall. But for PVHVM guest it always returns -EINVAL (-22) for Xen 4.2 and above. Xen 4.1 and earlier worked. The reason is that the check in map_vcpu_info would fail at: if ( v->arch.vcpu_info_mfn != INVALID_MFN ) The reason is that the vcpu_info_mfn for PVHVM guests ends up by defualt with the value of zero (introduced by c/s 23143). The code in vcpu_initialise which initialized vcpu_info_mfn to a valid value (INVALID_MFN), would never be called for PVHVM: if ( is_hvm_domain(d) ) { rc = hvm_vcpu_initialise(v); goto done; } v->arch.pv_vcpu.vcpu_info_mfn = INVALID_MFN; while previously it would be: v->arch.vcpu_info_mfn = INVALID_MFN; [right at the start of the function in Xen 4.1] This fixes the problem with Linux advertising this error: register_vcpu_info failed: err=-22 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | docs: rearrange and update NUMA placement documentationDario Faggioli2013-04-171-11/+83
| | | | | | | | | | | | | | | | | | To include the new concept of NUMA aware scheduling and describe its impact. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
* | xl: add node-affinity to the output of `xl list`Dario Faggioli2013-04-172-63/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Node-affinity is now something that is under (some) control of the user, so show it upon request as part of the output of `xl list' by the `-n' option. Re the patch, the print_bitmap() related hunk is _mostly_ code motion, although there is a very minor change in the code, basically to allow using the function for printing both cpu and node bitmaps (as, in case all bits are sets, it used to print "any cpu", which doesn't fit the nodemap case). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* | libxl: automatic placement deals with node-affinityDario Faggioli2013-04-172-17/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Which basically means the following two things: 1) during domain creation, it is the node-affinity of the domain --rather than the vcpu-affinities of its VCPUs-- that is affected by automatic placement; 2) during automatic placement, when counting how many VCPUs are already "bound" to a placement candidate (as part of the process of choosing the best candidate), both vcpu-affinity and node-affinity are considered. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
* | libxl: optimize the calculation of how many VCPUs can run on a candidateDario Faggioli2013-04-171-22/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For choosing the best NUMA placement candidate, we need to figure out how many VCPUs are runnable on each of them. That requires going through all the VCPUs of all the domains and check their affinities. With this change, instead of doing the above for each candidate, we do it once for all, populating an array while counting. This way, when we later are evaluating candidates, all we need is summing up the right elements of the array itself. This reduces the complexity of the overall algorithm, as it moves a potentially expensive operation (for_each_vcpu_of_each_domain {}) outside from the core placement loop, so that it is performed only once instead of (potentially) tens or hundreds of times. More specifically, we go from a worst case computation time complaxity of: O(2^n_nodes) * O(n_domains*n_domain_vcpus) To, with this change: O(n_domains*n_domains_vcpus) + O(2^n_nodes) = O(2^n_nodes) (with n_nodes<=16, otherwise the algorithm suggests partitioning with cpupools and does not even start.) Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* | libxl: allow for explicitly specifying node-affinityDario Faggioli2013-04-175-0/+39
| | | | | | | | | | | | | | | | | | | | By introducing a nodemap in libxl_domain_build_info and providing the get/set methods to deal with it. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
* | libxc: allow for explicitly specifying node-affinityDario Faggioli2013-04-172-0/+103
| | | | | | | | | | | | | | | | | | By providing the proper get/set interface and wiring them to the new domctl-s from the previous commit. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
* | xen: allow for explicitly specifying node-affinityDario Faggioli2013-04-1714-21/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it possible to pass the node-affinity of a domain to the hypervisor from the upper layers, instead of always being computed automatically. Note that this also required generalizing the Flask hooks for setting and getting the affinity, so that they now deal with both vcpu and node affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Keir Fraser <keir@xen.org>
* | xen: sched_credit: let the scheduler know about node-affinityDario Faggioli2013-04-172-153/+319
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As vcpu-affinity tells where VCPUs must run, node-affinity tells where they prefer to. While respecting vcpu-affinity remains mandatory, node-affinity is not that strict, it only expresses a preference, although honouring it will bring significant performance benefits (especially as compared to not having any affinity at all). This change modifies the VCPUs load balancing algorithm (for the credit scheduler only), introducing a two steps logic. During the first step, we use both the vcpu-affinity and the node-affinity masks (by looking at their intersection). The aim is giving precedence to the PCPUs where the domain prefers to run, as expressed by its node-affinity (with the intersection with the vcpu-afinity being necessary in order to avoid running a VCPU where it never should). If that fails in finding a valid PCPU, the node-affinity is just ignored and, in the second step, we fall back to using cpu-affinity only. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org>
* | xen: sched_credit: when picking, make sure we get an idle one, if anyDario Faggioli2013-04-171-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pcpu picking algorithm treats two threads of a SMT core the same. More specifically, if one is idle and the other one is busy, they both will be assigned a weight of 1. Therefore, when picking begins, if the first target pcpu is the busy thread (and if there are no other idle pcpu than its sibling), that will never change. This change fixes this by ensuring that, before entering the core of the picking algorithm, the target pcpu is an idle one (if there is an idle pcpu at all, of course). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org>
* | xen, libxc: introduce xc_nodemap_tDario Faggioli2013-04-173-1/+38
| | | | | | | | | | | | | | | | | | And its handling functions, following suit from xc_cpumap_t. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Keir Fraser <keir@xen.org>
* | xen, libxc: rename xenctl_cpumap to xenctl_bitmapDario Faggioli2013-04-1715-49/+71
|/ | | | | | | | | | | | | | | | | | More specifically: 1. replaces xenctl_cpumap with xenctl_bitmap 2. provides bitmap_to_xenctl_bitmap and the reverse; 3. re-implement cpumask_to_xenctl_bitmap with bitmap_to_xenctl_bitmap and the reverse; Other than #3, no functional changes. Interface only slightly afected. This is in preparation of introducing NUMA node-affinity maps. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Acked-by: Keir Fraser <keir@xen.org>
* xl: Fix 'free_memory' to include outstanding_claims value.Konrad Rzeszutek Wilk2013-04-164-32/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Updating to make it clear that free_memory reported by 'xl info' is influenced by the outstanding claim value. That is the free memory that will be available to the host once all outstanding claims have been completed. This modifies the behavior that the patch titled "xl: 'xl info' print outstanding claims if enabled (claim_mode=1 in xl.conf)" had - which reported the outstanding claims and nothing else. The free_pages as reported by the hypervisor is the currently available count of pages on the heap. The outstanding pages is the total amount of pages reserved for guests (so not taken from the heap yet). As guests are being populated the memory from the heap shrinks and the outstanding count of pages decreases. The total memory used for guests increases. As the available count of pages on the heap and outstanding claims are intertwined, report the amount of free memory available to be a combination of that. That is free heap memory minus the outstanding pages. We also make some odd choices in reporting. By default we will only display 'outstanding_claims' if the claim_mode is enabled in the global configuration file. However, if there are outstanding claims, we will ignore the claim_mode and report these values. Suggested-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* xl: 'xl claims' print outstanding per domain claimsKonrad Rzeszutek Wilk2013-04-165-6/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is similar to "xl: 'xl info' print outstanding claims if enabled (claim_mode=1 in xl.conf)" which exposes the global claim value. This patch provides the value of the currently outstanding pages claimed for each domains. This is per domain value which is added to the global claim value which influences the hypervisors' MM system. When a claim call is done, a reservation for a specific amount of pages is set (and this patch lists said number) and also a global value is incremented. This global value is then reduced as the domain's memory is populated and eventually reaches zero. The toolstack (libxc) also sets the domain's claim to zero when the population of memory has completed as an extra step. Any call to destroy the domain will also set the domain's claim to zero. If the reservation cannot be meet the guest creation fails immediately instead of taking seconds or minutes (depending on the size of the guest) while the toolstack populates memory. See patch: "xl: Implement XENMEM_claim_pages support via 'claim_mode' global config" for details on how it is implemented. The value fluctuates quite often so the value is stale once it is provided to the user-space. However it is useful for diagnostic purposes. It is printed irregardless of global "claim_mode" option in xl.conf(5). That is b/c the user might have enabled, launched a guest, and then disabled the option - and we should still report the correct outstanding claim value. The 'man xl' shows the details of this argument. The output is close to what 'xl list' looks like: Name ID Mem VCPUs State Time(s) Claimed Domain-0 0 2047 4 r----- 19.7 0 OL5 2 2048 1 --p--- 0.0 847 OL6 3 1024 4 r----- 5.9 0 Windows_XP 4 2047 1 --p--- 0.0 1989 [In which it can be seen that the OL5 guest still has 847MB of claimed memory (out of the total 2048MB where 1191MB has been allocated to the guest).] Please note that the 'Mem' column has the cumulative value of outstanding claims and the total amount of memory that has been allocated to the guest. [v1: claims, not claim-list] [v2: Add outstanding and current memkb in the output list] [v3: Clairy docs and relax some checks] [v4: Removed comments about guest config memory being the same as 'Mem'] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* xl: export 'outstanding_pages' value from xcinfoKonrad Rzeszutek Wilk2013-04-162-0/+2
| | | | | | | | | | | | | | | | | | This patch provides the value of the currently outstanding pages claimed for a specific domain. This is a value that influences the global outstanding claims value (See patch: "xl: 'xl info' print outstanding claims if enabled") returned via xc_domain_get_outstanding_pages hypercall. This domain value decrements as the memory is populated for the guest and eventually reaches zero. With this patch it is possible to utilize this field. Acked-by: Ian Campbell <ian.campbell@citrix.com> [v2: s/unclaimed/outstanding/ per Tim's suggestion] [v3: Don't use SXP printout file per Ian's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* xc: export outstanding_pages value in xc_dominfo structure.Dan Magenheimer2013-04-162-0/+2
| | | | | | | | | | | | | | | | | | This patch provides the value of the currently outstanding pages claimed for a specific domain. This is a value that influences the global outstanding claims value (See patch: "xl: 'xl info' print outstanding claims if enabled") returned via xc_domain_get_outstanding_pages hypercall. This domain value decrements as the memory is populated for the guest and eventually reaches zero. This patch is neccessary for "xl: export 'outstanding_pages' value from xcinfo" patch. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> [v2: s/unclaimed_pages/outstanding_pages/ per Tim's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* xl: 'xl info' print outstanding claims if enabled (claim_mode=1 in xl.conf)Konrad Rzeszutek Wilk2013-04-164-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch provides the value of the currently outstanding pages claimed for all domains. This is a total global value that influences the hypervisors' MM system. When a claim call is done, a reservation for a specific amount of pages is set and also a global value is incremented. This global value is then reduced as the domain's memory is populated and eventually reaches zero. The toolstack (libxc) also sets the domain's claim to zero when the population of memory has completed as an extra step. Any call to destroy the domain will also set the domain's claim to zero. If the reservation cannot be meet the guest creation fails immediately instead of taking seconds or minutes (depending on the size of the guest) while the toolstack populates memory. See patch: "xl: Implement XENMEM_claim_pages support via 'claim_mode' global config" for details on how it is implemented. The value fluctuates quite often so the value is stale once it is provided to the user-space. However it is useful for diagnostic purposes. It is only printed when the global "claim_mode" option in xl.conf(5) is set to enabled (1). The 'man xl' shows the details of this item. [v1: s/unclaimed/outstanding/] [v2: Made libxl_get_claiminfo return just MemKB suggested by Ian Campbell] [v3: Made libxl_get_claininfo return MemMB to conform to the other values printed] [v4: Improvements suggested by Ian Jackson, also added docs to xl.pod.1] [v5: Clarify how claims are cancelled, split >72 characters - Ian Jackson] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* xl: Implement XENMEM_claim_pages support via 'claim_mode' global configKonrad Rzeszutek Wilk2013-04-169-3/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The XENMEM_claim_pages hypercall operates per domain and it should be used system wide. As such this patch introduces a global configuration option 'claim_mode' that by default is disabled. If this option is enabled then when a guest is created there will be an guarantee that there is memory available for the guest. This is an particularly acute problem on hosts with memory over-provisioned guests that use tmem and have self-balloon enabled (which is the default option for them). The self-balloon mechanism can deflate/inflate the balloon quickly and the amount of free memory (which 'xl info' can show) is stale the moment it is printed. When claim is enabled a reservation for the amount of memory ('memory' in guest config) is set, which is then reduced as the domain's memory is populated and eventually reaches zero. If the reservation cannot be meet the guest creation fails immediately instead of taking seconds/minutes (depending on the size of the guest) while the guest is populated. Note that to enable tmem type guests, one needs to provide 'tmem' on the Xen hypervisor argument and as well on the Linux kernel command line. There are two boolean options: (0) No claim is made. Memory population during guest creation will be attempted as normal and may fail due to memory exhaustion. (1) Normal memory and freeable pool of ephemeral pages (tmem) is used when calculating whether there is enough memory free to launch a guest. This guarantees immediate feedback whether the guest can be launched due to memory exhaustion (which can take a long time to find out if launching massively huge guests) and in parallel. [v1: Removed own claim_mode type, using just bool, improved docs, all per Ian's suggestion] [v2: Updated the comments] [v3: Rebase on top 733b9c524dbc2bec318bfc3588ed1652455d30ec (xl: add vif.default.script)] [v4: Fixed up comments] [v5: s/global_claim_mode/claim_mode/] [v6: Ian Jackson's feedback: use libxl_defbool, better comments, etc] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* xc: use XENMEM_claim_pages hypercall during guest creation.Dan Magenheimer2013-04-166-4/+70
| | | | | | | | | | | | | | | | | | | | | | We add an extra parameter to the structures passed to the PV routine (arch_setup_meminit) and HVM routine (setup_guest) that determines whether the claim hypercall is to be done. The contents of the 'claim_enabled' is defined as an 'int' in case the hypercall expands in the future with extra flags (for example for per-NUMA allocation). For right now the proper values are: 0 to disable it or 1 to enable it. If the hypervisor does not support this function, the xc_domain_claim_pages and xc_domain_get_outstanding_pages will silently return 0 (and set errno to zero). Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> [v2: Updated per Ian's recommendations] [v3: Added support for out-of-sync hypervisor] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
* VTD: Remove the check for reserved device scope typeYang Zhang2013-04-162-3/+8
| | | | | | | | | | | | | | Though we only have four valid types now, the new type may be added in future. It's better to remove the check and only deal with the type that we can recognize. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@Intel.com> Acked-by: Keir Fraser <keir@xen.org> Add log message for this case. Signed-off-by: Jan Beulich <jbeulich@suse.com>
* iommu/crash: Interrupt remapping is also disabled on crashAndrew Cooper2013-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | This fixes a regression side-effect caused by: IOMMU: properly check whether interrupt remapping is enabled git: fae0372140befb88d890a30704a8ec058c902af8 hg: 26742:e1ec14bad0cb On the crash path in nmi_shootdown_cpus(), we shut down the IOMMU, then disable the IOAPIC. On systems which support interrupt remapping, the variable iommu_intremap remains set, meaning that disable_IO_APIC() issues interrupt remapping invalidate requests. IOAPIC interrupt remapping used to be conditional on iommu_enabled, but is now conditional on iommu_intremap, following the above changeset. This behaviour can be fixed by also indicating that interrupt remapping is not enabled after shutting down the IOMMU. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
* x86/AMD: Dump AMD VPMU infoBoris Ostrovsky2013-04-151-1/+41
| | | | | | Dump VPMU registers on AMD in the 'q' keyhandler. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
* x86/AMD: Clean up context_update() in AMD VPMU codeBoris Ostrovsky2013-04-151-11/+17
| | | | | | | | | | Clean up context_update() in AMD VPMU code. Rename restore routine to "load" to be consistent with Intel code and with arch_vpmu_ops names Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/VPMU: Save/restore VPMU only when necessaryBoris Ostrovsky2013-04-157-27/+157
| | | | | | | | | | | | | VPMU doesn't need to always be saved during context switch. If we are comming back to the same processor and no other VPCU has run here we can simply continue running. This is especailly useful on Intel processors where Global Control MSR is stored in VMCS, thus not requiring us to stop the counters during save operation. On AMD we need to explicitly stop the counters but we don't need to save them. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/VPMU: Factor out VPMU common codeBoris Ostrovsky2013-04-155-72/+47
| | | | | | | | | Factor out common code from SVM amd VMX into VPMU. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
* x86/VPMU: Add Haswell supportBoris Ostrovsky2013-04-151-0/+1
| | | | | | | Initialize VPMU on Haswell CPUs. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/AMD: Stop counters on VPMU saveBoris Ostrovsky2013-04-151-16/+6
| | | | | | | | | | | | Stop the counters during VPMU save operation since they shouldn't be running when VPCU that controls them is not. This also makes it unnecessary to check for overflow in context_restore() Set LVTPC vector before loading the context during vpmu_restore(). Otherwise it is possible to trigger an interrupt without proper vector. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>
* x86/AMD: Load context when attempting to read VPMU MSRsBoris Ostrovsky2013-04-151-1/+20
| | | | | | | | | | | | | | | | Load context (and mark it as LOADED) on any MSR access. This will allow us to always read the most up-to-date value of an MSR: guest may write into an MSR without enabling it (thus not marking the context as RUNNING) and then be migrated. Without first loading the context reading this MSR from HW will not match the pervious write since registers will not be loaded into HW in amd_vpmu_load(). In addition, we should be saving the context when it is LOADED, not RUNNING --- otherwise we need to save it any time it becomes non-RUNNING, which may be a frequent occurrence. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>