aboutsummaryrefslogtreecommitdiff
path: root/vm
diff options
context:
space:
mode:
authorSergey Bugaev <bugaevc@gmail.com>2024-04-05 18:18:50 +0300
committerSamuel Thibault <samuel.thibault@ens-lyon.org>2024-04-05 17:45:12 +0200
commit1a4ce71588630224412891fae16c59cde94190bf (patch)
tree78073b83e7f657820c278d8ce73b0a01878bcf53 /vm
parent6e5e1f09ef6e1ae0b92d38f0c7960ef3fed9f63b (diff)
downloadgnumach-1a4ce71588630224412891fae16c59cde94190bf.tar.gz
gnumach-1a4ce71588630224412891fae16c59cde94190bf.tar.bz2
gnumach-1a4ce71588630224412891fae16c59cde94190bf.zip
vm: Mark entries as in-transition while wiring down
When operating on the kernel map, vm_map_pageable_scan() does what the code itself describes as "HACK HACK HACK HACK": it unlocks the map, and calls vm_fault_wire() with the map unlocked. This hack is required to avoid a deadlock in case vm_fault or one of its callees (perhaps, a pager) needs to allocate memory in the kernel map. The hack relies on other kernel code being "well-behaved", in particular on that nothing will do any serious changes to this region of memory while the map is unlocked, since this region of memory is "owned" by the caller. Even if the kernel code is "well-behaved" and doesn't alter VM regions that it doesn't "own", it can still access adjacent regions. While this doesn't affect the region being wired down as such, it can still end up causing trouble due to extension & coalescence (merging) of VM entries. VM entry coalescence is an optimization where two adjacent VM entries with identical properties are merged into a single one that spans the combined region of the two original entries. VM entry extension is a similar an optimization where an existing VM entry is extended to cover an adjacent region, instead of a new VM entry being created to describe the region. These optimizations are a private implementation detail of vm_map, and (while they can be observed through e.g. vm_region) they are not supposed to cause any visible effects to how the described regions of memory behave; coalescence/extension and clipping happen automatically as needed when adding or removing mappings, or changing their properties. This is why it's fine for "well-behaved" kernel code to unknowingly cause extension or coalescence of VM entries describing a region by operating on adjacent VM regions. The "HACK HACK HACK HACK" code path relies on the VM entries in the region staying intact while it keeps the map unlocked, as it passes direct pointers to the entries into vm_fault_wire(), and also walks the list of entries in the region by following the vme_next pointers in the entries. Yet, this assumption is violated by the entries getting concurrently modified by other kernel code operating on adjacent VM regions, as described above. This is not only undefined behavior in the sense of the C language standard, but can also cause very real issues. Specifically, we've been seeing the VM subsystem deadlock when building Mach with SMP support and running a test program that calls mach_port_names() concurrently and repearedly. mach_port_names() implementation allocates and wires down memory, and when called from multiple threads, it was likely to allocate, and wire, several adjacent regions of memory, which would then cause entry coalescence/extension and clipping to kick in. The specific sequence of events that led to a deadlock appear to have been: 1. Multiple threads execute mach_port_names() concurrently. 2. One of the threads is wiring down a memory region, another is unwiring an adjacent memory region. 3. The wiring thread has unlocked the ipc_kernel_map, and called into vm_fault_wire(). 4. Due to entry coalescence/extension, the entry the wiring thread was going to wire down now describes a broader region of memory, namely it includes an adjustent region of memory that has previously been wired down by the other thread that is about to unwire it. 5. The wiring thread sets the busy bit on a wired-down page that the unwiring thread is about to unwire, and is waiting to take the map lock for reading in vm_map_verify(). 6. The unwiring thread holds the map lock for writing, and is waiting for the page to lose its busy bit. 7. Deadlock! To prevent this from happening, we have to ensure that the VM entries, at least as passed into vm_fault_wire() and as used for walking the list of such entries, stay intact while we have the map unlocked. One simple way to achieve that that I have proposed previously is to make a temporary copy of the VM entries in the region, and pass the copies into vm_fault_wire(). The entry copies would not be affected by coalescence/ extension, even if the original entries in the map are. This is however only straigtforward to do when there's just a single entry describing the while region, and there are further concerns with e.g. whether the underlying memory objects could, too, get coalesced. Arguably, making copies of the memory entries is making the hack even bigger. This patch instead implements a relatively clean solution that, arguably, makes the whole thing less of a hack: namely, making use of the in-transition bit on VM entries to prevent coalescence and any other unwanted effects. The entry in-transition bit was introduced for a very similar use case: the VM map copyout logic has to temporarily unlock the map to run its continuation, so it marks the VM entries it copied out into the map up to that point as being "in transition", asking other code to hold off making any serious changes to those entries. There's a companion "needs wakeup" bit that other code can set to block on the VM entry exiting this in-transition state; the code that puts an entry into the in-transition state is expected to, when unsetting the in-transition bit back, check for needs_wakeup being set, and wake any waiters up in that case, so they can retry whatever operation they wanted to do. There is no need to check for needs_wakeup in case of vm_map_pageable_scan(), however, exactly because we expect kernel code to be "well-behaved" and not make any attempts to modify the VM region. This relies on the in-transition bit inhibiting coalescence/extension, as implemented in the previous commit. Also, fix a tiny sad misaligned comment line. Reported-by: Damien Zammit <damien@zamaudio.com> Helped-by: Damien Zammit <damien@zamaudio.com> Message-ID: <20240405151850.41633-3-bugaevc@gmail.com>
Diffstat (limited to 'vm')
-rw-r--r--vm/vm_map.c27
1 files changed, 26 insertions, 1 deletions
diff --git a/vm/vm_map.c b/vm/vm_map.c
index 6bfc527f..8a9c6988 100644
--- a/vm/vm_map.c
+++ b/vm/vm_map.c
@@ -1555,9 +1555,21 @@ static void vm_map_pageable_scan(
* while we have it unlocked. We cannot trust user threads
* to do the same.
*
+ * We set the in_transition bit in the entries to prevent
+ * them from getting coalesced with their neighbors at the
+ * same time as we're accessing them.
+ *
* HACK HACK HACK HACK
*/
if (vm_map_pmap(map) == kernel_pmap) {
+ for (entry = start_entry;
+ (entry != vm_map_to_entry(map)) &&
+ (entry->vme_end <= end);
+ entry = entry->vme_next) {
+ assert(!entry->in_transition);
+ entry->in_transition = TRUE;
+ entry->needs_wakeup = FALSE;
+ }
vm_map_unlock(map); /* trust me ... */
} else {
vm_map_lock_set_recursive(map);
@@ -1583,6 +1595,19 @@ static void vm_map_pageable_scan(
if (vm_map_pmap(map) == kernel_pmap) {
vm_map_lock(map);
+ for (entry = start_entry;
+ (entry != vm_map_to_entry(map)) &&
+ (entry->vme_end <= end);
+ entry = entry->vme_next) {
+ assert(entry->in_transition);
+ entry->in_transition = FALSE;
+ /*
+ * Nothing should've tried to access
+ * this VM region while we had the map
+ * unlocked.
+ */
+ assert(!entry->needs_wakeup);
+ }
} else {
vm_map_lock_clear_recursive(map);
}
@@ -5037,7 +5062,7 @@ vm_map_coalesce_entry(
/*
* Get rid of the entry without changing any wirings or the pmap,
- * and without altering map->size.
+ * and without altering map->size.
*/
prev->vme_end = entry->vme_end;
vm_map_entry_unlink(map, entry);