| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* vm/vm_page.c(vm_page_setup): %lu -> %zu
vm/vm_page.c: In function 'vm_page_setup':
vm/vm_page.c:1425:41: warning: format '%lu' expects argument of type 'long unsigned int', but argument 2 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
1425 | printf("vm_page: page table size: %lu entries (%luk)\n", nr_pages,
| ~~^ ~~~~~~~~
| | |
| long unsigned int size_t {aka unsigned int}
| %u
vm/vm_page.c:1425:54: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Wformat=]
1425 | printf("vm_page: page table size: %lu entries (%luk)\n", nr_pages,
| ~~^
| |
| long unsigned int
| %u
1426 | table_size >> 10);
| ~~~~~~~~~~~~~~~~
| |
| size_t {aka unsigned int}
Message-ID: <20241020190744.2522-1-jbranso@dismail.de>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When operating on the kernel map, vm_map_pageable_scan() does what
the code itself describes as "HACK HACK HACK HACK": it unlocks the map,
and calls vm_fault_wire() with the map unlocked. This hack is required
to avoid a deadlock in case vm_fault or one of its callees (perhaps, a
pager) needs to allocate memory in the kernel map. The hack relies on
other kernel code being "well-behaved", in particular on that nothing
will do any serious changes to this region of memory while the map is
unlocked, since this region of memory is "owned" by the caller.
Even if the kernel code is "well-behaved" and doesn't alter VM regions
that it doesn't "own", it can still access adjacent regions. While this
doesn't affect the region being wired down as such, it can still end up
causing trouble due to extension & coalescence (merging) of VM entries.
VM entry coalescence is an optimization where two adjacent VM entries
with identical properties are merged into a single one that spans the
combined region of the two original entries. VM entry extension is a
similar an optimization where an existing VM entry is extended to cover
an adjacent region, instead of a new VM entry being created to describe
the region.
These optimizations are a private implementation detail of vm_map, and
(while they can be observed through e.g. vm_region) they are not
supposed to cause any visible effects to how the described regions of
memory behave; coalescence/extension and clipping happen automatically
as needed when adding or removing mappings, or changing their
properties. This is why it's fine for "well-behaved" kernel code to
unknowingly cause extension or coalescence of VM entries describing a
region by operating on adjacent VM regions.
The "HACK HACK HACK HACK" code path relies on the VM entries in the
region staying intact while it keeps the map unlocked, as it passes
direct pointers to the entries into vm_fault_wire(), and also walks the
list of entries in the region by following the vme_next pointers in the
entries. Yet, this assumption is violated by the entries getting
concurrently modified by other kernel code operating on adjacent VM
regions, as described above. This is not only undefined behavior in the
sense of the C language standard, but can also cause very real issues.
Specifically, we've been seeing the VM subsystem deadlock when building
Mach with SMP support and running a test program that calls
mach_port_names() concurrently and repearedly. mach_port_names()
implementation allocates and wires down memory, and when called from
multiple threads, it was likely to allocate, and wire, several adjacent
regions of memory, which would then cause entry coalescence/extension
and clipping to kick in. The specific sequence of events that led to a
deadlock appear to have been:
1. Multiple threads execute mach_port_names() concurrently.
2. One of the threads is wiring down a memory region, another is
unwiring an adjacent memory region.
3. The wiring thread has unlocked the ipc_kernel_map, and called into
vm_fault_wire().
4. Due to entry coalescence/extension, the entry the wiring thread was
going to wire down now describes a broader region of memory, namely
it includes an adjustent region of memory that has previously been
wired down by the other thread that is about to unwire it.
5. The wiring thread sets the busy bit on a wired-down page that the
unwiring thread is about to unwire, and is waiting to take the map
lock for reading in vm_map_verify().
6. The unwiring thread holds the map lock for writing, and is waiting
for the page to lose its busy bit.
7. Deadlock!
To prevent this from happening, we have to ensure that the VM entries,
at least as passed into vm_fault_wire() and as used for walking the list
of such entries, stay intact while we have the map unlocked. One simple
way to achieve that that I have proposed previously is to make a
temporary copy of the VM entries in the region, and pass the copies into
vm_fault_wire(). The entry copies would not be affected by coalescence/
extension, even if the original entries in the map are. This is however
only straigtforward to do when there's just a single entry describing
the while region, and there are further concerns with e.g. whether the
underlying memory objects could, too, get coalesced.
Arguably, making copies of the memory entries is making the hack even
bigger. This patch instead implements a relatively clean solution that,
arguably, makes the whole thing less of a hack: namely, making use of
the in-transition bit on VM entries to prevent coalescence and any other
unwanted effects. The entry in-transition bit was introduced for a very
similar use case: the VM map copyout logic has to temporarily unlock the
map to run its continuation, so it marks the VM entries it copied out
into the map up to that point as being "in transition", asking other
code to hold off making any serious changes to those entries. There's a
companion "needs wakeup" bit that other code can set to block on the VM
entry exiting this in-transition state; the code that puts an entry into
the in-transition state is expected to, when unsetting the in-transition
bit back, check for needs_wakeup being set, and wake any waiters up in
that case, so they can retry whatever operation they wanted to do. There
is no need to check for needs_wakeup in case of vm_map_pageable_scan(),
however, exactly because we expect kernel code to be "well-behaved" and
not make any attempts to modify the VM region.
This relies on the in-transition bit inhibiting coalescence/extension,
as implemented in the previous commit.
Also, fix a tiny sad misaligned comment line.
Reported-by: Damien Zammit <damien@zamaudio.com>
Helped-by: Damien Zammit <damien@zamaudio.com>
Message-ID: <20240405151850.41633-3-bugaevc@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The in-transition mechanism exists to make it possible to unlock a map
while still making sure some VM entries won't disappear from under you.
This is currently used by the VM copyin mechanics.
Entries in this state are better left alone, and extending/coalescing is
only an optimization, so it makes sense to skip it if the entry to be
extended is in transition. vm_map_coalesce_entry() already checks for
this; check for it in other similar places too.
This is in preparation for using the in-transition mechanism for wiring,
where it's much more important that the entries are not extended while
in transition.
Message-ID: <20240405151850.41633-2-bugaevc@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When operating on the kernel map, vm_map_pageable_scan() does what
the code itself describes as "HACK HACK HACK HACK": it unlocks the map,
and calls vm_fault_wire() with the map unlocked. This hack is required
to avoid a deadlock in case vm_fault or one of its callees (perhaps, a
pager) needs to allocate memory in the kernel map. The hack relies on
other kernel code being "well-behaved", in particular on that nothing
will do any serious changes to this region of memory while the map is
unlocked, since this region of memory is "owned" by the caller.
This reasoning doesn't apply to the validity of the 'end' entry (the
first entry after the region to be wired), since it's not a part of the
region, and is "owned" by someone else. Once the map is unlocked, the
'end' entry could get deallocated. Alternatively, a different entry
could get inserted after the VM region in front of 'end', which would
break the 'for (entry = start; entry != end; entry = entry->vme_next)'
loop condition.
This was not an issue in the original Mach 3 kernel, since it used an
address range check for the loop condition, but got broken in commit
023401c5b97023670a44059a60eb2a3a11c8a929 "VM: rework map entry wiring".
Fix this by switching the iteration back to use an address check.
This partly fixes a deadlock with concurrent mach_port_names() calls on
SMP, which was
Reported-by: Damien Zammit <damien@zamaudio.com>
Message-ID: <20240405151850.41633-1-bugaevc@gmail.com>
|
|
|
|
|
|
|
|
|
| |
non-pageable
Otherwise, if the allocated memory is passed over for returning data such as
in device_read, we end up with
../vm/vm_map.c:4245: vm_map_copyin_page_list: Assertion `src_entry->wired_count > 0' failed.Debugger invoked: assertion failure
|
|
|
|
|
|
| |
This will prevent calling vm_map_delete without the map locked
unless ref_count is zero.
Message-ID: <20240223081505.458240-1-damien@zamaudio.com>
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a parameter called keep_map_locked to vm_map_lookup()
that allows the function to return with the map locked.
This is to prepare for fixing a bug with gsync where the map
is locked twice by mistake.
Co-Authored-By: Sergey Bugaev <bugaevc@gmail.com>
Message-ID: <20240222082410.422869-3-damien@zamaudio.com>
|
|
|
|
|
| |
In principle we are actually writing to the allocated area outside of
the vm lock, but better be safe in case somebody changes things.
|
|
|
|
|
|
|
|
| |
For rumpdisk to efficiently determine the physical address, both
for checking whether it is below 4GiB, and for giving it to the disk
driver, we need a gnumach primitive (and that is not conditioned by
MACH_VM_DEBUG like mach_vm_region_info and mach_vm_object_pages_phys
are).
|
|
|
|
|
|
|
|
|
| |
* vm/vm_map.c: use actual limits instead of min/max boundaries to
change pageability of the currently mapped memory.
This caused the initial vm_wire_all(host, task VM_WIRE_ALL) in glibc
startup to fail with KERN_NO_SPACE.
Message-ID: <20240111210907.419689-5-luca@orpolo.org>
|
|
|
|
|
|
|
|
| |
When
- extending an existing entry,
- changing protection or inheritance of a range of entries,
we can get several entries that could be coalesced. Attempt to do that.
Message-ID: <20230705141639.85792-4-bugaevc@gmail.com>
|
|
|
|
|
|
| |
This function attempts to coalesce a VM map entry with its preceeding
entry. It wraps vm_object_coalesce.
Message-ID: <20230705141639.85792-3-bugaevc@gmail.com>
|
|
|
|
| |
This is convenient when tracking buffer overflows
|
| |
|
| |
|
| |
|
|
|
|
|
| |
*result_paddr + size is exactly pass the allocated memory, so it can be
equal to the requested bound.
|
| |
|
|
|
|
|
|
| |
In case pmax is inside a segment, we should avoid using it, and stay
with the previous segment, thus being sure to respect the caller's
constraints.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Rumpdisk needs to allocate dma32 memory areas, so we do always need this
limit.
The non-Xen x86_64 case had a typo, and the 32bit PAE case didn't have
the DMA32 limit.
Also, we have to cope with VM_PAGE_DMA32_LIMIT being either above or below
VM_PAGE_DIRECTMAP_LIMIT depending on the cases.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
vm_object_coalesce() callers used to rely on the fact that it always
merged the next_object into prev_object, potentially destroying
next_object and leaving prev_object the result of the whole operation.
After ee65849bec5da261be90f565bee096abb4117bdd
"vm: Allow coalescing null object with an internal object", this is no
longer true, since in case of prev_object == VM_OBJECT_NULL and
next_object != VM_OBJECT_NULL, the overall result is next_object, not
prev_object. The current callers are prepared to deal with this since
they handle this case seprately anyway, but the following commit will
introduce another caller that handles both cases in the same code path.
So, declare the way vm_object_coalesce() coalesces the two objects its
implementation detail, and make it return the resulting object and the
offset into it explicitly. This simplifies the callers, too.
Message-Id: <20230705141639.85792-2-bugaevc@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a deallocated VM map entry refers to an object that only has a single
reference and doesn't have a pager port, we can eagerly release any
physical pages that were contained in the deallocated range.
This is not a 100% solution: it is still possible to "leak" physical
pages that can never appear in virtual memory again by creating several
references to a memory object (perhaps by forking a VM map with
VM_INHERIT_SHARE) and deallocating the pages from all the maps referring
to the object. That being said, it should help to release the pages in
the common case sooner.
Message-Id: <20230626112656.435622-6-bugaevc@gmail.com>
|
|
|
|
|
|
|
| |
When entering an object into a map, try to extend the next entry
backward, in addition to the previously existing attempt to extend the
previous entry forward.
Message-Id: <20230626112656.435622-5-bugaevc@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, vm_object_coalesce would only succeed with next_object being
VM_OBJECT_NULL (and with the previous patch, with the two object
references pointing to the same object). This patch additionally allows
the inverse: prev_object being VM_OBJECT_NULL and next_object being some
internal VM object that we have not created a pager port for, provided
the offset of the existing mapping in the object allows for placing the
new mapping before it.
This is not used anywhere at the moment (the only caller, vm_map_enter,
ensures that next_object is either VM_OBJECT_NULL or an object that has
a pager port), but it will get used with the next patch.
Message-Id: <20230626112656.435622-4-bugaevc@gmail.com>
|
|
|
|
|
|
|
|
| |
If a mapping of an object is made right next to another mapping of the
same object have the same properties (protection, inheritance, etc.),
Mach will now expand the previous VM map entry to cover the new address
range instead of creating a new entry.
Message-Id: <20230626112656.435622-3-bugaevc@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
struct vm_page is supposed to be a "small structure", but it takes up 96
bytes on x86_64 (to represent a 4k page). By utilizing bitfields and
strategically reordering members to avoid excessive padding, it can be
shrunk to 80 bytes.
- page_lock and unlock_request only need to store a bitmask of
VM_PROT_READ, VM_PROT_WRITE, and VM_PROT_EXECUTE. Even though the
special values VM_PROT_NO_CHANGE and VM_PROT_NOTIFY are defined, they
are not used for the two struct vm_page members.
- type and seg_index both need to store one of the four possible values
in the range from 0 to 3. Two bits are sufficient for this.
- order needs to store a number from 0 to VM_PAGE_NR_FREE_LISTS (which
is 11), or a special value VM_PAGE_ORDER_UNLISTED. Four bits are
sufficient for this.
No functional change.
Message-Id: <20230626112656.435622-2-bugaevc@gmail.com>
|
|
|
|
|
|
| |
types are correct now
Message-Id: <Y+SfNtIRuwj0Zap1@jupiter.tail36e24.ts.net>
|
|
|
|
|
|
| |
The documentation of vm_page_insert says that the object must be locked.
Moreover, the unlock call is here but no call was present.
Message-Id: <20230208225436.23365-1-etienne.brateau@gmail.com>
|
|
|
|
| |
(this is actually a no-op for i386)
|
|
|
|
|
|
|
|
|
|
| |
When generating stubs, Mig will will take the vm_size_array_t and define the
input request struct using rpc_vm_size_t since the size is variable. This will turn cause a mismatch
between types (vm_size_t* vs rpc_vm_size_t*). We could also ask Mig to produce
a prototype by using rpc_vm_size_t*, however we would need to change the implementation
of the RPC to use rpc_* types anyway since we want to avoid another allocation
of the array.
Message-Id: <Y9iwScHpmsgY3V0N@jupiter.tail36e24.ts.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* i386/i386/io_map.c: code is unused.
* i386/i386/io_perm.c: include mig prototypes.
* i386/i386/mp_desc.c: Deleted interrupt_stack_alloc since it is not
used.
* i386/i386/seg.h: Moved descriptor structs to i386/include/mach/i386/mach_i386_types.h
as that represents the interface types for RPCs.
Defined aliases for real_descriptor since those are used by the i386 RPCs. Inlined many
functions here too and removed seg.c.
* i386/i386/seg.c: Removed. All the functions are inline now.
* i386/i386/trap.c: Use static.
* i386/i386/trap.h: Define missing prototypes.
* i386/i386/tss.h: Use static inline for ltr.
* i386/i386/user_ldt.c: Include mig prototypes.
* i386/include/mach/i386/mach_i386.defs: Define real_descriptor_t types
since those are used in the RPC definition. Now both prototypes and
definitions will match.
* i386/include/mach/i386/mach_i386_types.h: Move struct descriptor
from seg.h since we need those for the RPC interfaces. Removed include
of io_perm.h since it generates circular includes otherwise.
* i386/intel/pmap.c: pmap_map is unused. Added static qualifier for
several functions.
* i386/intel/pmap.h: pmap_update_interrupt declared for non-SMP and SMP.
Message-Id: <Y89+R2VekOQK4IUo@jupiter.lan>
|
|
|
|
| |
Message-Id: <Y8mYd/pt/og4Tj5I@mercury.tail36e24.ts.net>
|
|
|
|
|
|
| |
This also reverts 566c227636481b246d928772ebeaacbc7c37145b and
963b1794d7117064cee8ab5638b329db51dad854
Message-Id: <Y8d75KSqNL4FFInm@mercury.tail36e24.ts.net>
|
|
|
|
|
| |
stack_statistics, swapin_thread_continue, and memory_object_lock_page are
not used outside their module.
|
|
|
|
|
|
|
| |
mach4.defs and mach_host.defs.
Also move more mach_debug rpcs to kern/mach_debug.h.
Message-Id: <Y7+LPMLOafUQrNHZ@jupiter.tail36e24.ts.net>
|
|
|
|
|
|
|
|
|
|
|
| |
Marked some functions as static (private) as needed and added missing
includes.
This also revealed some dead code which was removed.
Note that -Wmissing-prototypes is not enabled here since there is a
bunch more warnings.
Message-Id: <Y6j72lWRL9rsYy4j@mars>
|
| |
|
|
|
|
|
|
|
| |
Most of the changes include defining and using proper function type
declarations (with argument types declared) and avoiding using the
K&R style of function declarations.
Message-Id: <Y6Jazsuis1QA0lXI@mars>
|
|
|
|
|
|
| |
It seems we hit he "unable to recycle any page" even when there is no
memory pressure, probably just because the pageout thread somehow to
kicked but there's nothing to page out left.
|
|
|
|
|
|
| |
We already use this built-in in other places and this will move us closer to
being able to build the kernel without libc.
Message-Id: <Y5l80/VUFvJYZTjy@jupiter.tail36e24.ts.net>
|
|
|
|
|
|
|
|
|
|
| |
This allows *printf to use %zd/%zu/%zx to print vm_size_t and
vm_offset_t. Warnings using the incorrect specifiers were fixed.
Note that MACH_PORT_NULL became just 0 because GCC thinks that we were
comparing a pointer to a character (due to it being an unsigned int) so
I removed the explicit cast.
Message-Id: <Y47UNdcUF35Ag4Vw@reue>
|
|
|
|
|
|
|
|
|
|
|
| |
If a "wire_required" process calls vm_map_protect(0), the
memory gets unwired as expected. But if the process then calls
vm_map_protect(VM_PROT_READ) again, we need to wire that memory.
(This happens to be exactly what glibc does for its heap)
This fixes Hurd hangs on lack of memory, during which mach was swapping
pieces of mach-defpager out.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* vm/memory_object_proxy.c: truncate vm array types as if they were
the rpc_ version because MIG can't handle that. This rpc can't
handle more than one element anyway.
Note that the same issue with vm arrays is present at least with
syscall emulation, but that functionality seems unused for now.
A better fix could be to add a vm descriptor type in include/mach/message.h,
but then probably we don't need to use the rpc_ types in MIG anymore,
they would be needed only for the syscall definitions.
Signed-off-by: Luca Dariz <luca@orpolo.org>
Message-Id: <20220628101054.446126-15-luca@orpolo.org>
|
|
|
|
|
|
|
| |
* vm/vm_user.c: sign-extend mask with USER32
Signed-off-by: Luca Dariz <luca@orpolo.org>
Message-Id: <20220628101054.446126-6-luca@orpolo.org>
|
|
|
|
|
| |
Signed-off-by: Luca Dariz <luca@orpolo.org>
Message-Id: <20220628101054.446126-13-luca@orpolo.org>
|
|
|
|
|
|
|
|
|
|
| |
This allows contiguous allocations aligned to values
smaller than one page, but still a power of 2,
by forcing the alignment to be to the nearest page.
This works because PAGE_SIZE is a power of two.
Message-Id: <20220821065732.269573-1-damien@zamaudio.com>
|
| |
|