aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* Add missing caddr_t typeSamuel Thibault2024-10-171-2/+0
| | | | Now that gnumach does not define it any more.
* fpu_set_state: accept fp_save_kind being incoherent if initialized is not setSamuel Thibault2024-09-081-1/+1
|
* add tests for FLOAT/XFLOAT stateLuca Dariz2024-09-084-1/+259
| | | | Message-ID: <20240904201806.510082-2-luca@orpolo.org>
* fpu_set_state: return an error on incoherent fp_save_kindSamuel Thibault2024-09-081-1/+3
|
* add xfloat thread state interfaceLuca Dariz2024-09-085-43/+161
| | | | | | | | | | | | | | | | | * i386/i386/fpu.c: extend current getter and setter to support the extended state; move the struct casting here to reuse the locking and allocation logic for the thread state; make sure the new state is set as valid, otherwise it won't be applied; add i386_get_xstate_size() to dynamically retrieve the FPU state size. * i386/i386/fpu.h: update prototypes to accept generic thread state * i386/i386/pcb.c: forward raw thread state to getter and setter, only checking for minimum size and use the new i386_get_xstate_size() helper. * i386/include/mach/i386/mach_i386.defs: expose the new helper i386_get_xstate_size(). * i386/include/mach/i386/thread_status.h: add interface definition for I386_XFLOAT_STATE and the corresponding data structure. Message-ID: <20240904201806.510082-1-luca@orpolo.org>
* x86_64: fix double fault handlerLuca Dariz2024-09-081-1/+1
| | | | | | * x86_64/locore.S: adjust to the changes in the thread state structure (segment registers), and add the missing opcode. Message-ID: <20240904201806.510082-3-luca@orpolo.org>
* add rpc interrupted testLuca Dariz2024-08-223-0/+96
| | | | | | | | * tests/test-machmsg.c: add two use cases used by glibc during signal handling * tests/include/testlib.h * tests/testlib.c: add new wait_thread_terminated() helper Message-ID: <20240821163616.189307-3-luca@orpolo.org>
* fpu: Drop conflicting alignmentSamuel Thibault2024-07-311-1/+1
| | | | | struct i386_xfp_xstate_header header is at offset 440 of struct i386_xfp_save, so not a multiple of 64 anyway.
* Add thread_get_name RPC to get the name of a thread.Flavio Cruz2024-07-142-0/+29
| | | | Message-ID: <6qm4fdtthi5nrmmleum7z2xemxz77adohed454eaeuzlmvfx4d@l3pyff4tqwry>
* Include stddef.h in sys/types.h to get size_t and NULL.Flavio Cruz2024-07-101-22/+1
| | | | | Remove unnecessary definitions from sys/types.h. Message-ID: <oitneneybjishhqq7bgedkasrqqd6nq7vselruaacw27sbe47e@6rt3xbi7fnie>
* Ensure we always pass -ffreestanding -nostdlib even if CFLAGS are overridden.Flavio Cruz2024-07-102-1/+10
| | | | Message-ID: <4cea36qrjeo7tkklmqcwgkrxstxiqykdofha65zxmpni2o6lp3@2offokab6fvn>
* Fix bogus formatSamuel Thibault2024-07-091-1/+1
|
* Fix xen buildSamuel Thibault2024-07-073-0/+4
| | | | | with -Werror=incompatible-pointer-types and -Werror=implicit-function-declaration.
* Disable specific warnings that are now errors in GCC 14Flavio Cruz2024-06-241-1/+7
| | | | Message-ID: <376mwj4qtzxqgg2p4teqefxep7qz2kxll25synb3sulgof24j5@wxhqtaf7ei32>
* tests/machmsg: check rx message size on different code pathsLuca Dariz2024-06-121-3/+114
| | | | | | | | * tests/test-machmsg.c: add more combinations to existing cases: - make tx and rx ports independent in the send/receive tests - add two more variants for send/receive tests, using two separate system calls, using different code paths in mach_msg(). Message-ID: <20240612062755.116308-2-luca@orpolo.org>
* x86_64: fix msg size forwarding in case it's not set by userspaceLuca Dariz2024-06-121-1/+3
| | | | | | | | | | | * ipc/copy_user.c: recent MIG stubs should always fill the size correctly in the msg header, but we shouldn't rely on that. Instead, we use the size that was correctly copied-in, overwriting the value in the header. This is already done by the 32-bit copyinmsg(), and was missing in the 64-bit version. Furthermore, the assertion about user/kernel size make sense with and without USER32, so take it out if the #ifdef. Message-ID: <20240612062755.116308-1-luca@orpolo.org>
* Add a test for thread stateSergey Bugaev2024-04-162-1/+217
| | | | | | | This tests generating and handling exceptions, thread_get_state(), thread_set_state(), and newly added thread_set_self_state(). It does many of the same things that glibc does when handling a signal. Message-ID: <20240416071013.85596-1-bugaevc@gmail.com>
* Add thread_set_self_state() trapSergey Bugaev2024-04-165-1/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a new Mach trap that sets the calling thread's state to the passed value, as if with a call to the thread_set_state() RPC. If the flavor of state being set is the one that contains the register used for syscall return value (i386_THREAD_STATE or i386_REGS_SEGS_STATE on x86, AARCH64_THREAD_STATE on AArch64), the set register value is *not* overwritten with KERN_SUCCESS when the state gets set successfully, yet errors do get reported if the syscall fails. Although the trap is intended to enable userland to implement sigreturn functionality in the AArch64 port (more on which below), the trap itself is architecture-independent, and fully implemented in terms of the existing kernel routines (thread_setstatus & thread_set_syscall_return). This trap's functionality is similar to sigreturn() on Unix or NtContinue() on NT. The use case for these all is restoring the local state of an interrupted thread in the following set-up: 1. A thread is running some arbitrary code. 2. An event happens that deserves the thread's immediate attention, analogous to a hardware interrupt request. This might be caused by the thread itself (e.g. running into a Mach exception that was arranged to be handled by the same thread), or by external events (e.g. receiving a Unix SIGCHLD). 3. Another thread (or perhaps the kernel, although this is not the case on Mach) suspends the thread, saves its state at the point of interruption, alters its state to execute some sort of handler for the event, and resumes the thread again, now running the handler. 4. Once the thread is done running the handler, it wants to return to what it was doing at the time it was interrupted. To do this, it needs to restore the state as saved at the moment of interruption. Unlike with setjmp()/longjmp(), we cannot rely on the interrupted logic collaborating in any way, as it's not aware that it's being interrupted. This means that we have to fully restore the state, including values of all the general-purpose registers, as well as the stack pointer, program counter, and any state flags. Depending on the instruction set, this may or may not be possible to do fully in userland, simply by loading all the registers with their saved values. It should be more or less easy to load the saved values into general-purpose registers, but state flags and the program counter can be more of a challenge. Loading the program counter value (in other words, performing an indirect jump to the interrupted instruction) has to be the very last thing we do, since we don't control the program flow after that. The only real place program counter can be loaded from is popped off the stack, since all general-purpose registers would already contain their restored values by that point, and using global storage is incompatible with another interruption of the same kind happening at the time we were about to return. For the same reason, the saved program counter cannot be really stored outside of the "active" stack area (such as below the stack pointer), since otherwise it can get clobbered by another interruption. This means that to support fully-userland returns, the instruction set must provide a single instruction that loads an address from the stack, adjusts the stack pointer, and performs an indirect jump to the loaded address. The instruction must also either preserve (previously restored) state flags, or additionally load state flags from the stack in addition to the jump address. On x86, 'ret' is such an instruction: it pops an address from the stack, adjusting the stack pointer without modifying flags, and performs an indirect jump to the address. On x86_64, where the ABI mandates a red zone, one can use the 'ret imm16' variant to additionally adjust the stack pointer by the size of the red zone, atomically restoring the value of the stack pointer at the time of the interruption while loading the return address from outside the red zone. This is how sigreturn is implemented in glibc for the Hurd on x86. On ARM AArch32, 'pop {pc}' (alternatively written 'ldr pc, [sp], #4') is such an instruction: since SP and PC are just general-purpose, directly accessible registers (r13 and r15), it is possible to perform a load from the address pointed to by SP into PC, with a post-increment of SP. It is, in fact, possible to restore all the other general-purpose registers too in a single instruction this way: 'pop {r0-r12, r14, r15}' will do that; here r13, the stack pointer, gets incremented after all the other registers get loaded from the stack. This also preserves the CPSR flags, which would need to be restored just prior to the 'pop'. On ARM AArch64 however, PC is no longer a directly accessible general- purpose register (and SP is only accessible that way by some of the instructions); so it is no longer possible to load PC from memory in a single instruction. The only way to perform an indirect jump is by using one of the dedicated branching instructions ('br', 'blr', or 'ret'). All of them accept the address to branch to in a general- purpose register, which is incompatible with our use case. Moreover, with the BTI extension, there is a BTYPE field in PSTATE that tracks which type (if any) of an indirect branch was the last executed instruction; this is then used to raise an exception if the instruction the indirect branch lands on was not intended to be a target of an indirect branch (of a matching type). It is important to restore the BTYPE (among the other state) when returning to an interrupted context; failing to do that will either cause an unexpected BTI failure exception (if the last executed instruction before the interruption was not an indirect branch, but the last instruction of the restoration logic is), or open up a window for exploitation (if the last executed instruction before the interruption was an indirect branch, but the last instruction of the restoration logic is not -- note that 'ret' is not considered an indirect branch for the purposes of BTI). So, it is not possible to fully restore the state of an interrupted context in userland on AArch64. The kernel can do that however (and is in fact doing just that every time it handles a fault or an IRQ): the 'eret' instruction for returning from an exception is accessible to EL1 (the kernel), but not EL0 (the user). 'eret' atomically restores PC from the ELR_EL1 system register, and PSTATE from the SPSR_EL1 system register (and does other things); both of these system registers are inaccessible from userland, and so couldn't have been used by the interrupted context for any purpose, meaning their values doesn't need to be restored. (They can be used by the kernel code, which presents an additional complication when it's the kernel context that gets interrupted and has to be returned to. To make this work, the kernel masks interrupt requests and avoids doing anything that could cause a fault when using those registers.) The above justifies the need for a kernel API to atomically restore saved userland state on AArch64 (and possibly other platforms that aren't x86). Mach already has an API to set state of a thread, namely the thread_set_state() RPC; however, a thread calling thread_set_state() on itself is explicitly disallowed. We have previously relaxed this restriction to allow setting i386_DEBUG_STATE and i386_FSGS_BASE_STATE on the current thread, so one way to address the need for such an API on AArch64 would be to also allow setting AARCH64_THREAD_STATE on the current thread. That is what I have originally proposed and implemented. Like the thread_set_self_state() trap implemented by this patch, the implementation of setting AARCH64_THREAD_STATE on the current thread needs to ensure that the set value of the x0 register does not get immediately overwritten with the return value of the mach_msg() trap. However, it's not only the return value of the mach_msg() trap that is important, but also the RPC reply message. The thread_set_state() RPC should not generate a reply message when used for returning to an interrupted context, since there'd be nobody expecting the message. This could be achieved by special-casing that in the kernel as well, or (simpler) by userland not passing a valid reply port in the first place. Note that the implementation of sigreturn in glibc already uses the strategy of passing an invalid reply port for the last RPC is does before returning to the interrupted context (which is deallocating the reply port used by the signal handler). Not passing a valid reply port and consequently not blocking on awaiting the reply message works, since the way Mach is implemented, kernel RPCs are always executed synchronously when userland sends the request message (unless the routine implementation includes explicit asynchrony, as device RPCs do, and gsync_wait() should do, but currently doesn't), meaning the RPC caller never has to *wait* for the reply message, as one is produced immediately. In other words, the mere act of invoking a kernel RPC (that does not involve explicit asynchrony) is enough to ensure it completes when mach_msg() returns, even if a reply message is not received (whether because an invalid reply port has been specified, or because MACH_RCV_MSG wasn't passed to mach_msg(), or because a message other than the kernel RPC's reply was received by the call). However, the same is not true when interposing is involved, and the thread's self port does not in fact point directly to the kernel, but to a userspace proxy of some sort. The two primary examples of this are Hurd's rpctrace tool, which interposes all the task's ports and proxies all RPCs after tracing them, and Mach's old netmsg/netname server, which proxies ports and messages over network. In this case, the actual implementation only runs once the request reaches the actual kernel, and not once the request message has been sent by the original caller, so it *is* necessary for the caller to await the reply message if it wants to make sure that the requested action has been completed. This does not cause much issues for deallocation of a reply port on the sigreturn code path in glibc, since that only delays when the port is deallocated, but does not otherwise change the program behavior. With thread_set_state(mach_thread_self()), however, this would be quite catastrophic, since the message-send would return back to the caller without changing its state, and the actual change of state would only happen at some later point. This issue is avoided nicely by turning the functionality into an explicit Mach trap rather than an RPC. As it's not an RPC, it doesn't involve messaging, and doesn't need a reply port or a reply message. It is always a direct call to the kernel (and not to any interposer), and it's always guaranteed to have completed synchronously once the trap returns. That also means that the thread_set_self_state() call won't be visible to rpctrace or forwarded over network for netmsg, but this is fine, since all it does is sets thread state (i.e. register values); the thread could do the same on its own by issuing relevant machine instruction without involving any Mach abstractions (traps or RPCs) at all if it weren't for the need of atomicity. Finally, this new trap is unfortunately somewhat of a security concern (as any sigreturn-like functionality is in general), since it would potentially allow an attacker who already has a way to invoke a function with 3 controlled argument values to set the values of all registers to any desired values (sigreturn-oriented programming). There is currently no mitigation for this other than the generic ones such as PAC and stack check guards. The limit of 150 used in the implementation has been chosen to be large enough to fit the largest thread state flavor so far, namely AARCH64_FLOAT_STATE, but small enough to not overflow the 4K stack. If a new thread state flavor is added that is too big to fit on the stack, the implementation should be switched to use kalloc instead of on-stack storage. Message-ID: <20240415090149.38358-9-bugaevc@gmail.com>
* aarch64: Add thread state typesSergey Bugaev2024-04-162-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Notes: * TPIDR_EL0, the TLS pointer, is included in the generic state directly. * TPIDR2_EL0, part of the SME extension, is not included in the generic state. If we add SME support, it will be a part of something like aarch64_sme_state. * CPSR is not a real register in AArch64 (unlike in AArch32), but a collection of individually accessible bits and pieces from PSTATE. Due to how the kernel accesses user mode's PSTATE (via SPSR), it's convenient to represent PSTATE as a pseudo-register in the same format as SPSR. This is also what QEMU and XNU do. * There is no hardware-enforced 'natural' order to place the registers in, since no registers get pushed onto the stack on exception entry. Saving and restoring registers from an instance of struct aarch64_thread_state is implemented entirely in software, and the format is essentially arbitrary. * aarch64_float_state includes registers of a 128-bit type; this may create issues for compilers other than GCC. * fp_reserved is not a register, but a placeholder. If and when Arm adds another floating-point meta-register, this will be changed to represent it, and that would not be considered a compatibility break, so don't access fp_reserved by name, or its value, from userland. Instead, memset the whole structure to 0 if starting from scratch, or memcpy an existing structure. More thread state types could be added in the future, such as aarch64_debug_state, aarch64_virt_state (for hardware-accelerated virtualization), potentially ones for PAC, SVE/SME, etc. Message-ID: <20240415090149.38358-8-bugaevc@gmail.com>
* aarch64: Add exception type definitionsSergey Bugaev2024-04-162-0/+91
| | | | | | | A few yet-unimplemented codes are also sketched out; these are included so you know roughly what to expect once the missing functionality gets implemented, but are not in any way stable or usable. Message-ID: <20240415090149.38358-7-bugaevc@gmail.com>
* aarch64: Add mach_aarch64 APISergey Bugaev2024-04-164-0/+196
| | | | | | | | | | This currently contains a single RPC to get Linux-compatible hwcaps, as well as the values of MIDR_EL1 and REVIDR_EL1 system registers. In the future, this is expected to host the APIs to manage PAC keys, and possibly some sort of AArch64-specific APIs for userland IRQ handlers. Message-ID: <20240415090149.38358-6-bugaevc@gmail.com>
* aarch64: Add vm_param.hSergey Bugaev2024-04-163-3/+38
| | | | | | | | | | | | | And make it so that the generic vm_param.h doesn't require the machine- specific one to define PAGE_SIZE etc. We *don't* want a PAGE_SIZE constant to be statically exported to userland; instead userland should initialize vm_page_size by querying vm_statistics(), and then use vm_page_size. We'd also like to eventually avoid exporting VM_MAX_ADDRESS, but this is not feasible at the moment. To make it feasible in the future, userland should try to avoid relying on the definition where possible. Message-ID: <20240415090149.38358-5-bugaevc@gmail.com>
* aarch64: Add public syscall ABISergey Bugaev2024-04-163-0/+73
| | | | | | | | | | | We use largely the same ABI as Linux: a syscall is invoked with the "svc #0" instruction, passing arguments the same way as for a regular function call. Specifically, up to 8 arguments are passed in the x0-x7 registers, and the rest are placed on the stack (this is only necessary for the vm_map() syscall). w8 should contain the (negative) Mach trap number. A syscall preserves all registers except for x0, which upon returning contains the return value. Message-ID: <20240415090149.38358-4-bugaevc@gmail.com>
* aarch64: Add the basicsSergey Bugaev2024-04-169-0/+391
| | | | | | | | | | | This adds "aarch64" host support to the build system, along with some uninteresting installed headers. The empty aarch64/aarch64/ast.h header is also added to create the aarch64/aarch64/ directory (due to Git peculiarity). With this, it should be possible to run 'configure --host=aarch64-gnu' and 'make install-data' successfully. Message-ID: <20240415090149.38358-3-bugaevc@gmail.com>
* Add CPU_TYPE_ARM64Sergey Bugaev2024-04-161-0/+1
| | | | | | | | | | | | | | | | | | This is distinct from CPU_TYPE_ARM, since we're going to exclusively use AArch64 / A64, which CPU_TYPE_ARM was never meant to support, and to match EM_AARCH64, which is also separate from EM_ARM. CPU_TYPE_X86_64 was similarly made distinct from CPU_TYPE_I386. This is named CPU_TYPE_ARM64 rather than CPU_TYPE_AARCH64, since AArch64 is an "execution state" (analogous to long mode on x86_64) rather than a CPU type. "ARM64" here is not a name of the architecture, but simply means an ARM CPU that is capable of (and for our case, will only really be) running in the 64-bit mode (AArch64). There are no subtypes defined, and none are expected to be defined in the future. Support for individual features/extensions should be discovered by other means, i.e. the aarch64_get_hwcaps() RPC. Message-ID: <20240415090149.38358-2-bugaevc@gmail.com>
* tests: give more timeSamuel Thibault2024-04-061-1/+1
| | | | If the host is loaded it may take some time to boot.
* tests: Reboot the VM after the testSamuel Thibault2024-04-061-1/+1
| | | | So it does not have to timeout.
* tests: Disable parallelismSamuel Thibault2024-04-061-0/+2
| | | | The makefile pieces are not ready for this.
* Xen: Fix missing includeSamuel Thibault2024-04-061-0/+1
| | | | For thread_wakeup.
* tests: Fix running on 32bit hostSamuel Thibault2024-04-061-1/+1
| | | | qemu-system-i386 says at most 2047 MB RAM can be simulated
* tests: Fix include pathSamuel Thibault2024-04-061-1/+1
|
* tests: Add missing test files shippingSamuel Thibault2024-04-062-4/+22
|
* SMP: force APICSamuel Thibault2024-04-051-0/+6
| | | | We need it to properly driver interrupts etc. of APs
* linux: Do not enable in SMP, it is not MP-safeSamuel Thibault2024-04-051-0/+5
|
* vm: Mark entries as in-transition while wiring downSergey Bugaev2024-04-051-1/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When operating on the kernel map, vm_map_pageable_scan() does what the code itself describes as "HACK HACK HACK HACK": it unlocks the map, and calls vm_fault_wire() with the map unlocked. This hack is required to avoid a deadlock in case vm_fault or one of its callees (perhaps, a pager) needs to allocate memory in the kernel map. The hack relies on other kernel code being "well-behaved", in particular on that nothing will do any serious changes to this region of memory while the map is unlocked, since this region of memory is "owned" by the caller. Even if the kernel code is "well-behaved" and doesn't alter VM regions that it doesn't "own", it can still access adjacent regions. While this doesn't affect the region being wired down as such, it can still end up causing trouble due to extension & coalescence (merging) of VM entries. VM entry coalescence is an optimization where two adjacent VM entries with identical properties are merged into a single one that spans the combined region of the two original entries. VM entry extension is a similar an optimization where an existing VM entry is extended to cover an adjacent region, instead of a new VM entry being created to describe the region. These optimizations are a private implementation detail of vm_map, and (while they can be observed through e.g. vm_region) they are not supposed to cause any visible effects to how the described regions of memory behave; coalescence/extension and clipping happen automatically as needed when adding or removing mappings, or changing their properties. This is why it's fine for "well-behaved" kernel code to unknowingly cause extension or coalescence of VM entries describing a region by operating on adjacent VM regions. The "HACK HACK HACK HACK" code path relies on the VM entries in the region staying intact while it keeps the map unlocked, as it passes direct pointers to the entries into vm_fault_wire(), and also walks the list of entries in the region by following the vme_next pointers in the entries. Yet, this assumption is violated by the entries getting concurrently modified by other kernel code operating on adjacent VM regions, as described above. This is not only undefined behavior in the sense of the C language standard, but can also cause very real issues. Specifically, we've been seeing the VM subsystem deadlock when building Mach with SMP support and running a test program that calls mach_port_names() concurrently and repearedly. mach_port_names() implementation allocates and wires down memory, and when called from multiple threads, it was likely to allocate, and wire, several adjacent regions of memory, which would then cause entry coalescence/extension and clipping to kick in. The specific sequence of events that led to a deadlock appear to have been: 1. Multiple threads execute mach_port_names() concurrently. 2. One of the threads is wiring down a memory region, another is unwiring an adjacent memory region. 3. The wiring thread has unlocked the ipc_kernel_map, and called into vm_fault_wire(). 4. Due to entry coalescence/extension, the entry the wiring thread was going to wire down now describes a broader region of memory, namely it includes an adjustent region of memory that has previously been wired down by the other thread that is about to unwire it. 5. The wiring thread sets the busy bit on a wired-down page that the unwiring thread is about to unwire, and is waiting to take the map lock for reading in vm_map_verify(). 6. The unwiring thread holds the map lock for writing, and is waiting for the page to lose its busy bit. 7. Deadlock! To prevent this from happening, we have to ensure that the VM entries, at least as passed into vm_fault_wire() and as used for walking the list of such entries, stay intact while we have the map unlocked. One simple way to achieve that that I have proposed previously is to make a temporary copy of the VM entries in the region, and pass the copies into vm_fault_wire(). The entry copies would not be affected by coalescence/ extension, even if the original entries in the map are. This is however only straigtforward to do when there's just a single entry describing the while region, and there are further concerns with e.g. whether the underlying memory objects could, too, get coalesced. Arguably, making copies of the memory entries is making the hack even bigger. This patch instead implements a relatively clean solution that, arguably, makes the whole thing less of a hack: namely, making use of the in-transition bit on VM entries to prevent coalescence and any other unwanted effects. The entry in-transition bit was introduced for a very similar use case: the VM map copyout logic has to temporarily unlock the map to run its continuation, so it marks the VM entries it copied out into the map up to that point as being "in transition", asking other code to hold off making any serious changes to those entries. There's a companion "needs wakeup" bit that other code can set to block on the VM entry exiting this in-transition state; the code that puts an entry into the in-transition state is expected to, when unsetting the in-transition bit back, check for needs_wakeup being set, and wake any waiters up in that case, so they can retry whatever operation they wanted to do. There is no need to check for needs_wakeup in case of vm_map_pageable_scan(), however, exactly because we expect kernel code to be "well-behaved" and not make any attempts to modify the VM region. This relies on the in-transition bit inhibiting coalescence/extension, as implemented in the previous commit. Also, fix a tiny sad misaligned comment line. Reported-by: Damien Zammit <damien@zamaudio.com> Helped-by: Damien Zammit <damien@zamaudio.com> Message-ID: <20240405151850.41633-3-bugaevc@gmail.com>
* vm: Don't attempt to extend in-transition entriesSergey Bugaev2024-04-051-0/+4
| | | | | | | | | | | | | | | | The in-transition mechanism exists to make it possible to unlock a map while still making sure some VM entries won't disappear from under you. This is currently used by the VM copyin mechanics. Entries in this state are better left alone, and extending/coalescing is only an optimization, so it makes sense to skip it if the entry to be extended is in transition. vm_map_coalesce_entry() already checks for this; check for it in other similar places too. This is in preparation for using the in-transition mechanism for wiring, where it's much more important that the entries are not extended while in transition. Message-ID: <20240405151850.41633-2-bugaevc@gmail.com>
* vm: Fix use-after-free in vm_map_pageable_scan()Sergey Bugaev2024-04-051-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When operating on the kernel map, vm_map_pageable_scan() does what the code itself describes as "HACK HACK HACK HACK": it unlocks the map, and calls vm_fault_wire() with the map unlocked. This hack is required to avoid a deadlock in case vm_fault or one of its callees (perhaps, a pager) needs to allocate memory in the kernel map. The hack relies on other kernel code being "well-behaved", in particular on that nothing will do any serious changes to this region of memory while the map is unlocked, since this region of memory is "owned" by the caller. This reasoning doesn't apply to the validity of the 'end' entry (the first entry after the region to be wired), since it's not a part of the region, and is "owned" by someone else. Once the map is unlocked, the 'end' entry could get deallocated. Alternatively, a different entry could get inserted after the VM region in front of 'end', which would break the 'for (entry = start; entry != end; entry = entry->vme_next)' loop condition. This was not an issue in the original Mach 3 kernel, since it used an address range check for the loop condition, but got broken in commit 023401c5b97023670a44059a60eb2a3a11c8a929 "VM: rework map entry wiring". Fix this by switching the iteration back to use an address check. This partly fixes a deadlock with concurrent mach_port_names() calls on SMP, which was Reported-by: Damien Zammit <damien@zamaudio.com> Message-ID: <20240405151850.41633-1-bugaevc@gmail.com>
* elf-load: Respect PT_GNU_STACKSergey Bugaev2024-03-294-4/+14
| | | | | If a bootstrap ELF contains a PT_GNU_STACK phdr, take stack protection from there. Otherwise, default to VM_PROT_ALL.
* kd: Include i386/irq.hSamuel Thibault2024-03-271-0/+1
| | | | to get unmask_irq declaration
* tests: Create tests/ in the build tree before trying to use itSergey Bugaev2024-03-271-0/+1
| | | | Message-ID: <20240327161841.95685-18-bugaevc@gmail.com>
* tests: Don't ask for executable stackSergey Bugaev2024-03-272-0/+4
| | | | Message-ID: <20240327161841.95685-17-bugaevc@gmail.com>
* tests: Make exception subcode a longSergey Bugaev2024-03-271-1/+1
| | | | Message-ID: <20240327161841.95685-16-bugaevc@gmail.com>
* tests: Use vm_page_sizeSergey Bugaev2024-03-271-8/+11
| | | | Message-ID: <20240327161841.95685-15-bugaevc@gmail.com>
* tests: Add vm_page_sizeSergey Bugaev2024-03-272-0/+15
| | | | Message-ID: <20240327161841.95685-14-bugaevc@gmail.com>
* tests: Add a more serious mach_msg_server() routineSergey Bugaev2024-03-273-37/+142
| | | | Message-ID: <20240327161841.95685-13-bugaevc@gmail.com>
* tests: Fix halt()Sergey Bugaev2024-03-272-2/+3
| | | | | Mark it as noreturn, and make sure to halt, not reboot. Message-ID: <20240327161841.95685-12-bugaevc@gmail.com>
* Make -fno-PIE etc. architecture-dependentSergey Bugaev2024-03-273-4/+8
| | | | | | | | | | There might be good reasons why Mach on x86 shouldn't be built as PIC/ PIE, but there are also very good reasons to support PIE on other architectures. Potentially implementing KASLR is one such reason; but also the Linux AArch64 boot protocol (that the AArch64 port will use for booting) lets the bootloader load the kernel image at any address, which makes PIC pretty much required. Message-ID: <20240327161841.95685-11-bugaevc@gmail.com>
* ipc: Turn ipc_entry_lookup_failed() into a macroSergey Bugaev2024-03-271-9/+13
| | | | | | | ipc_entry_lookup_failed() is used with both mach_msg_user_header_t and mach_msg_header_t arguments, which are different types. Make it into a macro, so it works with both. Message-ID: <20240327161841.95685-9-bugaevc@gmail.com>
* kern/rdxtree: Fix undefined behaviorSergey Bugaev2024-03-271-2/+2
| | | | | | | | Initializing a variable with itself is undefined, and GCC 14 rightfully produces a warning about the variable being used (to initialize itself) prior to initialization. X15 sets the variables to 0 instead, so do the same in Mach. Message-ID: <20240327161841.95685-8-bugaevc@gmail.com>
* gsync: Use copyin()/copyout() to access user memorySergey Bugaev2024-03-271-7/+31
| | | | | | | | | Depending on the architecture and setup, it may not be possible to access user memory directly, for example, due to user mode mappings not being accessible from kernel mode (x86 SMAP, AArch64 PAN). There are dedicated machine-specific copyin()/copyout() routines that know how to access user memory from the kernel; use them. Message-ID: <20240327161841.95685-6-bugaevc@gmail.com>