| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- MMUEXT_SET_LDT should behave as similarly to the LLDT instruction as
possible: fail only if the base address is non-canonical
- instead LDT descriptor accesses should fault if the descriptor
address ends up being non-canonical (by ensuring this we at once
avoid reading an entry from the mach-to-phys table and consider it a
page table entry)
- fault propagation on using LDT selectors must distinguish #PF and #GP
(the latter must be raised for a non-canonical descriptor address,
which also applies to several other uses of propagate_page_fault(),
and hence the problem is being fixed there)
- map_ldt_shadow_page() should properly wrap addresses for 32-bit VMs
At once remove the odd invokation of map_ldt_shadow_page() from the
MMUEXT_SET_LDT handler: There's nothing really telling us that the
first LDT page is going to be preferred over others.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that the direct map area can extend all the way up to almost the
end of address space, this is wasteful.
Also fold two almost redundant messages in SRAT parsing into one.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the original patch 7 of the series addressing XSA-45 I mistakenly
took the addition of the call to get_page_light() in alloc_page_type()
to cover two decrements that would happen: One for the PGT_partial bit
that is getting set along with the call, and the other for the page
reference the caller hold (and would be dropping on its error path).
But of course the additional page reference is tied to the PGT_partial
bit, and hence any caller of a function that may leave
->arch.old_guest_table non-NULL for error cleanup purposes has to make
sure a respective page reference gets retained.
Similar issues were then also spotted elsewhere: In effect all callers
of get_page_type_preemptible() need to deal with errors in similar
ways. To make sure error handling can work this way without leaking
page references, a respective assertion gets added to that function.
This is CVE-2013-1432 / XSA-58.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This drops the "preemptible" parameters from various functions where
now they can't (or shouldn't, validated by assertions) be run in non-
preemptible mode anymore, to prove that manipulations of at least L3
and L4 page tables and page table entries are now always preemptible,
i.e. the earlier patches actually fulfill their purpose of fixing the
resulting security issue.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are BIOSes that want to map the IO-APIC MMIO region from some
ACPI method(s), and there is at least one BIOS flavor that wants to
use this mapping to clear an RTE's mask bit. While we can't allow the
latter, we can permit reads and simply drop write attempts, leveraging
the already existing infrastructure introduced for dealing with AMD
IOMMUs' representation as PCI devices.
This fixes an interrupt setup problem on a system where _CRS evaluation
involved the above described BIOS/ACPI behavior, and is expected to
also deal with a boot time crash of pv-ops Linux upon encountering the
same kind of system.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
| |
... as dropping the old page tables may take significant amounts of
time.
This is part of CVE-2013-1918 / XSA-45.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... as it may take significant amounts of time.
The function, being moved to mm.c as the better home for it anyway, and
to avoid having to make a new helper function there non-static, is
given a "preemptible" parameter temporarily (until, in a subsequent
patch, its other caller is also being made capable of dealing with
preemption).
This is part of CVE-2013-1918 / XSA-45.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SNB graphics devices have a bug that prevent them from accessing certain
memory ranges, namely anything below 1M and in the pages listed in the
table.
Xen does not initialize below 1MB to heap, i.e. below 1MB pages don't be
allocated, so it's unnecessary to reserve memory below the 1 MB mark
that has not already been reserved.
So reserve those pages listed in the table at xen boot if set detect a
SNB gfx device on the CPU to avoid GPU hangs.
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since pointer overflow is undefined behavior in C, some compilers such
as clang optimize away the check !((ptr) + 1) in the macro IS_NIL().
This patch fixes the issue by casting the pointer type to uintptr_t,
the operations of which are well-defined.
Signed-off-by: Xi Wang <xi@mit.edu>
With that, we also need to avoid the overflow in NIL().
Note that either part of the change results in the respective macros to
become unsuitable for use with "void".
Signed-off-by: Jan Beulich <jbeulich@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... using the new per-domain mapping management functions, adding
destroy_perdomain_mapping() to the previously introduced pair.
Rather than using an order-1 Xen heap allocation, use (currently 2)
individual domain heap pages to populate space in the per-domain
mapping area.
Also fix a benign off-by-one mistake in is_compat_arg_xlat_range().
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... as well as free_perdomain_mappings(), and use them to carry out the
existing per-domain mapping setup/teardown. This at once makes the
setup of the first sub-range PV domain specific (with idle domains also
excluded), as the GDT/LDT mapping area is needed only for those.
Also fix an improperly scaled BUILD_BUG_ON() expression in
mapcache_domain_init().
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
| |
So far this has been repeated in 3 places, requiring to remember to
update all of them if a change is being made.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... noticed while putting together the 16Tb support patches for x86.
Briefly, this (in order of the changes below)
- fixes an inefficiency in x86's context switch code (translations to/
from struct page are more involved than to/from MFNs)
- drop unnecessary MFM-to-page conversions
- drop a redundant call to destroy_xen_mappings() (an indentical call
is being made a few lines up)
- simplify a VA-to-MFN translation
- drop dead code (several occurrences)
- add a missing __init annotation
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- fix super page frame table setup for memory hotplug case (should
create full table, or else the hotplug code would need to do the
necessary table population)
- simplify super page frame table setup (can re-use frame table setup
code)
- slightly streamline frame table setup code
- fix (tighten) a BUG_ON() and an ASSERT() condition
- fix spage <-> pdx conversion macros (they had no users so far, and
hence no-one noticed how broken they were)
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
| |
It's not very easy to find them if you don't know to look for the
TYPE_SAFE() macro.
Signed-off-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Note: these changes don't make any difference on x86.
Replace XEN_GUEST_HANDLE with XEN_GUEST_HANDLE_PARAM when it is used as
an hypercall argument.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
|
|
|
|
| |
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
|
|
| |
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
|
|
| |
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
| |
mm.h's __page_to_virt() has a rather opaque expression. Comment it.
Reported-By: Ian Campbell <ian.campbell@citrix.com>
Suggested-by: Ian Jackson <ian.jackson@eu.citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recent Dom0 kernels want to disable PCI MSI on all devices, yet doing
so on AMD IOMMUs (which get represented by a PCI device) disables part
of the functionality set up by the hypervisor.
Add a mechanism to mark certain PCI devices as having write protected
config spaces (both through port based [method 1] accesses and, for
x86-64, mmconfig), and use that for AMD's IOMMUs.
Note that due to ptwr_do_page_fault() being run first, there'll be a
MEM_LOG() issued for each such mmconfig based write attempt. If that's
undesirable, the order of the calls in fixup_page_fault() would need
to be swapped.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Wei Wang <wei.wang2@amd.com>
Acked-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
| |
Because the p2m lock was already recursive, we need to add a new
mm-lock class of recursive rwlocks.
Signed-off-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
| |
Replace its only useer with paging_mode_refcounts().
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
| |
Otherwise we wind up with zombie domains, still holding onto refs to the mem
event ring pages.
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We achieve this by locking/unlocking the global p2m_lock in get/put_gfn.
The lock is always taken recursively, as there are many paths that
call get_gfn, and later, make another attempt at grabbing the p2m_lock.
The lock is not taken for shadow lookups. We believe there are no problems
remaining for synchronized p2m+shadow paging, but we are not enabling this
combination due to lack of testing. Unlocked shadow p2m access are tolerable as
long as shadows do not gain support for paging or sharing.
HAP (EPT) lookups and all modifications do take the lock.
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The name 'shared_info' for the list of shared pages backed by a share frame
collided with the identifier also used for a domain's shared info page. To
avoid grep/cscope/etc aliasing, rename the shared memory token to 'sharing.
This patch only addresses style, and performs no functional changes. To ease
reviwing, the patch was left as a stand-alone last-slot addition to the queue
to avoid propagating changes throughout the whole series.
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Use the ordering constructs in mm-locks.h to enforce an order
for the p2m and page locks in the sharing code. Applies to either
the global sharing lock (in audit mode) or the per page locks.
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Signed-off-by: Adin Scanneell <adin@scannell.ca>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the removal of the hash table, all that is needed now is locking
of individual shared pages, as new (gfn,domain) pairs are removed or
added from the list of mappings.
We recycle PGT_locked and use it to lock individual pages. We ensure deadlock
is averted by locking pages in increasing order.
The global lock remains for the benefit of the auditing code, and is
thus enabled only as a compile-time option.
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Signed-off-by: Adin Scannell <adin@scannell.ca>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Eliminate the sharing hastable mechanism by storing a list head directly in the
page info for the case when the page is shared. This does not add any extra
space to the page_info and serves to remove significant complexity from
sharing.
Signed-off-by: Adin Scannell <adin@scannell.ca>
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
|
|
|
|
|
|
|
| |
This is a prerequisite for calling set_gpfn_from_mfn() unconditionally
from free_heap_pages().
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
|
|
|
| |
Define the lock and unlock functions once, and list all the locks in one
place so (a) it's obvious what the locking discipline is and (b) none of
the locks are visible to non-mm code. Automatically enforce that these
locks never get taken in the wrong order.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Again, (out-of-memory) errors must not cause hypervisor crashes, and
hence ought to be propagated.
This also adjusts the cache attribute changing loop in
get_page_from_l1e() to not go through an unnecessary iteration. While
this could be considered mere cleanup, it is actually a requirement
for the subsequent now necessary error recovery path.
Also make a few functions static, easing the check for potential
callers needing adjustment.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
|
|
|
|
|
| |
... decreasing cache footprint. As a prerequisite this requires making
cmdline_parse() a little more flexible.
Also remove a few variables altogether, and adjust sections
annotations for several others.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Keir Fraser <keir@xen.org>
|
|
|
|
|
|
| |
now that multi-page shadows need not be contiguous.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
|
|
|
|
|
|
|
|
| |
together using their list headers. Update the users of the
pinned-shadows list to expect l2_32 shadows to have four entries
in the list, which must be kept together during updates.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
|
|
|
|
|
|
|
|
| |
(where the refcounts are) and check that none of the routines
that do refcounting ever see the second, third or fourth page.
This is just stating and enforcing an existing implicit requirement.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
| |
This is because the P2M table, when placed at a kernel specified
location, gets populated with large pages, which the domain must have
a way to unmap/recycle.
Additionally when allowing Dom0 to use superpages, they ought to be
tracked accordingly in the superpage frame table.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current version of superpage mapping takes a PGT_writable
reference to every page in a superpage each time it is mapped. This
is extremely slow, so slow that applications become unusable.
My solution for this is to introduce a superpage table in the
hypervisor, similar to the frametable structure for pages. Currently
this table only has a type_info element. There are three types a
superpage can have, SGT_mark, SGT_dynamic, or SGT_none.
In normal operation, the first time a superpage is mapped, a
PGT_writable reference is taken to each page in the superpage, and the
superpage is set to type SGT_dynamic and the superpage typecount is
incremented. On subsequent mappings and unmappings, only the
superpage typecount changes. On the last unmap, the PGT_writable
reference on each page is removed.
The SGT_mark type is set and cleared through two new MMUEXT
hypercalls, mark_super and unmark_super. When the hypercall is made,
the superpage's type is set to SGT_mark and a PGT_writable reference
is taken to its pages. On unmark, the type is cleared and the
reference removed.
If a page is already set to SGT_dynamic when mark_super is called, the
type is changed to SGT_mark and no additional PGT_writable reference
is taken. If there are still outstanding mappings of this superpage
when unmark_super is called, the type is set to SGT_dynamic and the
PGT_writable reference is not removed.
Fast superpage mapping is only supported on 64 bit hypervisors. For
32 bit hyperviors, superpage mapping is supported but will be
extremely slow.
Signed-off-by: Dave McCracken <dave.mccracken@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This has two advantages:
(a) We can move the allocations to a context where we can handle
failure.
(b) We can implement matching deallocations on CPU offline.
Only the idle vcpu structure is now not freed on CPU offline. This
probably does not really matter.
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
|
|
|
|
|
| |
This naming scheme is more rational. Also use non-x86-specific
function sync_local_execstate() where possible.
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
int), each corresponding to a single sharable pages. Externally all sharing related
operations (e.g. nominate/share) will use sharing handles, thus solving a lot of
consistency problems (like: is this sharable page still the same sharable page
as before).
Internally, sharing handles can be translated to the MFNs (using a newly created
hashtable), and then for each MFNs a doubly linked list of GFNs translating to
this MFN is maintained. Finally, sharing handle is stored in page_info strucutre
for each sharable MFN.
All this allows to share and unshare pages efficiently. However, at the moment a
single lock is used to protect the sharing handle hash table. For scalability
reasons, the locking needs to be made more granular.
Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>
|
|
|
|
|
|
|
|
|
|
| |
when an MFN is shared. However, all existing calls can either infer the GFN (for
example p2m table destructor) or will not need to know GFN for shared pages.
This patch identifies and fixes all the M2P accessors, either by removing the
translation altogether or by making the relevant modifications. Shared MFNs have
a special value of SHARED_M2P_ENTRY stored in their M2P table slot.
Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>
|
|
|
|
|
|
|
|
|
|
| |
domain called 'dom_cow'. In order to share a page, the type needs to be changed
to PGT_shared_page and the owner to dom_dow. Only pages with PGT_none, and no
type count are allowed to become sharable. Conversly, sharable pages can only be
made 'private' if type count equals one. page_make_sharable() and
page_make_private() handle these transitions.
Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>
|
|
|
|
|
| |
Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
|
|
|
|
|
|
|
|
|
|
|
| |
The basic work flow to handle the memory hotadd is:
Update node information
Map new pages to xen 1:1 mapping
Setup frametable for new memory range
Setup m2p table for new memory range
Put the new pages to domheap
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
|
|
|
|
|
|
|
|
|
|
| |
hotplug in page fault handler.
In compact guest situation, the compat m2p table is copied, not
directly mapped in L3, so we have to sync it. Direct mapping range
may changes, and we need sync it with guest's table.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently xen hypervisor use nodes to keep start/end address of
node. It assume memory among nodes has no overlap, this is not always
true, especially if we have memory hotplug support in the system.
This patch backport Linux kernel's memblks to support overlapping
among node. The memblks will be used both for checking conflict, and
caculate memnode_shift.
Also, currently if there is no memory populated in a node when system
booting, the node will be unparsed later, and the corresponding CPU's
numa information will be removed also. This patch will keep the CPU
information.
One thing need notice is, currently we caculate memnode_shift with all
memory, including un-populated ones. This should work if the smallest
chuck is not so small. Other option can be flags in the page_info
structure, etc.
The memnodemap is changed from paddr to pdx, both to save space, and
also because currently most access is from pfn.
A flag is mem_hotplug added if there is hotplug memory range.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
|
|
|
|
|
|
|
| |
... to where it really is needed and meaningful (i.e. in some places
it seems to make more sense to use __x86_64__ instead).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Avoid backing frame table holes with memory, when those holes are
large enough to cover an exact multiple of large pages. This is based
on the introduction of a bit map, where each bit represents one such
range, thus allowing mfn_valid() checks to easily filter out those
MFNs that now shouldn't be used to index the frame table.
This allows for saving a couple of 2M pages even on "normal" systems.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduces a virtual space conserving transformation on the MFN thus
far used to index 1:1 mapping and frame table, removing the largest
range of contiguous bits (below the most significant one) which are
zero for all valid MFNs from the MFN representation, to be used to
index into those arrays, thereby cutting the virtual range these
tables must cover approximately by half with each bit removed.
Since this should account for hotpluggable memory (in order to not
requiring a re-write when that gets supported), the determination of
which bits are candidates for removal must not be based on the E820
information, but instead has to use the SRAT. That in turn requires a
change to the ordering of steps done during early boot.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
|