aboutsummaryrefslogtreecommitdiff
path: root/kern
Commit message (Collapse)AuthorAgeFilesLines
* kern: Add a mach host operation which returns elapsed time since bootupZhaoming Luo2024-12-292-0/+40
| | | | | | | | | | | | | | Add host_get_uptime64() mach interface operation. It can be used to get the time passed since the boot up. * doc/mach.texi: Add the documentation for the operation * include/mach/mach_host.defs: Add the interface * include/mach/time_value.h: Extend the mappable time variable * kern/mach_clock.c: Operation implementation * kern/mach_clock.h: Add a new variable for storing uptime Signed-off-by: Zhaoming Luo <zhmingluo@163.com> Message-ID: <20241224015751.1282-1-zhmingluo@163.com>
* kern: Comment fixedZhaoming Luo2024-12-242-2/+2
| | | | | | | | Read also: https://mail.gnu.org/archive/html/bug-hurd/2024-12/msg00219.html Signed-off-by: Zhaoming Luo <zhmingluo@163.com> Message-ID: <20241224024417.1403-1-zhmingluo@163.com>
* smp: Parallel SMP initDamien Zammit via Bug reports for the GNU Hurd2024-12-221-4/+0
| | | | | | | | | Now that things are in place, we switch to parallel init. The key to this change is that the INIT/STARTUP sequence is done in one step, and all cpus wake up at the same time. Synchronisation is done via waiting for individual flags stored in separate memory locations. Message-ID: <20241222014306.430098-2-damien@zamaudio.com>
* task_info: Fix resident_size overflowSamuel Thibault2024-12-151-1/+1
|
* fix a compiler warning.jbranso@dismail.de2024-10-211-1/+1
| | | | | | | | | | | | | | * kern/slab.c(kalloc_init): %lu -> %zu kern/slab.c: In function 'kalloc_init': kern/slab.c:1349:33: warning: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Wformat=] 1349 | sprintf(name, "kalloc_%lu", size); | ~~^ ~~~~ | | | | | size_t {aka unsigned int} | long unsigned int | %u Message-ID: <20241020190744.2522-2-jbranso@dismail.de>
* Add thread_get_name RPC to get the name of a thread.Flavio Cruz2024-07-141-0/+21
| | | | Message-ID: <6qm4fdtthi5nrmmleum7z2xemxz77adohed454eaeuzlmvfx4d@l3pyff4tqwry>
* Add thread_set_self_state() trapSergey Bugaev2024-04-163-1/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a new Mach trap that sets the calling thread's state to the passed value, as if with a call to the thread_set_state() RPC. If the flavor of state being set is the one that contains the register used for syscall return value (i386_THREAD_STATE or i386_REGS_SEGS_STATE on x86, AARCH64_THREAD_STATE on AArch64), the set register value is *not* overwritten with KERN_SUCCESS when the state gets set successfully, yet errors do get reported if the syscall fails. Although the trap is intended to enable userland to implement sigreturn functionality in the AArch64 port (more on which below), the trap itself is architecture-independent, and fully implemented in terms of the existing kernel routines (thread_setstatus & thread_set_syscall_return). This trap's functionality is similar to sigreturn() on Unix or NtContinue() on NT. The use case for these all is restoring the local state of an interrupted thread in the following set-up: 1. A thread is running some arbitrary code. 2. An event happens that deserves the thread's immediate attention, analogous to a hardware interrupt request. This might be caused by the thread itself (e.g. running into a Mach exception that was arranged to be handled by the same thread), or by external events (e.g. receiving a Unix SIGCHLD). 3. Another thread (or perhaps the kernel, although this is not the case on Mach) suspends the thread, saves its state at the point of interruption, alters its state to execute some sort of handler for the event, and resumes the thread again, now running the handler. 4. Once the thread is done running the handler, it wants to return to what it was doing at the time it was interrupted. To do this, it needs to restore the state as saved at the moment of interruption. Unlike with setjmp()/longjmp(), we cannot rely on the interrupted logic collaborating in any way, as it's not aware that it's being interrupted. This means that we have to fully restore the state, including values of all the general-purpose registers, as well as the stack pointer, program counter, and any state flags. Depending on the instruction set, this may or may not be possible to do fully in userland, simply by loading all the registers with their saved values. It should be more or less easy to load the saved values into general-purpose registers, but state flags and the program counter can be more of a challenge. Loading the program counter value (in other words, performing an indirect jump to the interrupted instruction) has to be the very last thing we do, since we don't control the program flow after that. The only real place program counter can be loaded from is popped off the stack, since all general-purpose registers would already contain their restored values by that point, and using global storage is incompatible with another interruption of the same kind happening at the time we were about to return. For the same reason, the saved program counter cannot be really stored outside of the "active" stack area (such as below the stack pointer), since otherwise it can get clobbered by another interruption. This means that to support fully-userland returns, the instruction set must provide a single instruction that loads an address from the stack, adjusts the stack pointer, and performs an indirect jump to the loaded address. The instruction must also either preserve (previously restored) state flags, or additionally load state flags from the stack in addition to the jump address. On x86, 'ret' is such an instruction: it pops an address from the stack, adjusting the stack pointer without modifying flags, and performs an indirect jump to the address. On x86_64, where the ABI mandates a red zone, one can use the 'ret imm16' variant to additionally adjust the stack pointer by the size of the red zone, atomically restoring the value of the stack pointer at the time of the interruption while loading the return address from outside the red zone. This is how sigreturn is implemented in glibc for the Hurd on x86. On ARM AArch32, 'pop {pc}' (alternatively written 'ldr pc, [sp], #4') is such an instruction: since SP and PC are just general-purpose, directly accessible registers (r13 and r15), it is possible to perform a load from the address pointed to by SP into PC, with a post-increment of SP. It is, in fact, possible to restore all the other general-purpose registers too in a single instruction this way: 'pop {r0-r12, r14, r15}' will do that; here r13, the stack pointer, gets incremented after all the other registers get loaded from the stack. This also preserves the CPSR flags, which would need to be restored just prior to the 'pop'. On ARM AArch64 however, PC is no longer a directly accessible general- purpose register (and SP is only accessible that way by some of the instructions); so it is no longer possible to load PC from memory in a single instruction. The only way to perform an indirect jump is by using one of the dedicated branching instructions ('br', 'blr', or 'ret'). All of them accept the address to branch to in a general- purpose register, which is incompatible with our use case. Moreover, with the BTI extension, there is a BTYPE field in PSTATE that tracks which type (if any) of an indirect branch was the last executed instruction; this is then used to raise an exception if the instruction the indirect branch lands on was not intended to be a target of an indirect branch (of a matching type). It is important to restore the BTYPE (among the other state) when returning to an interrupted context; failing to do that will either cause an unexpected BTI failure exception (if the last executed instruction before the interruption was not an indirect branch, but the last instruction of the restoration logic is), or open up a window for exploitation (if the last executed instruction before the interruption was an indirect branch, but the last instruction of the restoration logic is not -- note that 'ret' is not considered an indirect branch for the purposes of BTI). So, it is not possible to fully restore the state of an interrupted context in userland on AArch64. The kernel can do that however (and is in fact doing just that every time it handles a fault or an IRQ): the 'eret' instruction for returning from an exception is accessible to EL1 (the kernel), but not EL0 (the user). 'eret' atomically restores PC from the ELR_EL1 system register, and PSTATE from the SPSR_EL1 system register (and does other things); both of these system registers are inaccessible from userland, and so couldn't have been used by the interrupted context for any purpose, meaning their values doesn't need to be restored. (They can be used by the kernel code, which presents an additional complication when it's the kernel context that gets interrupted and has to be returned to. To make this work, the kernel masks interrupt requests and avoids doing anything that could cause a fault when using those registers.) The above justifies the need for a kernel API to atomically restore saved userland state on AArch64 (and possibly other platforms that aren't x86). Mach already has an API to set state of a thread, namely the thread_set_state() RPC; however, a thread calling thread_set_state() on itself is explicitly disallowed. We have previously relaxed this restriction to allow setting i386_DEBUG_STATE and i386_FSGS_BASE_STATE on the current thread, so one way to address the need for such an API on AArch64 would be to also allow setting AARCH64_THREAD_STATE on the current thread. That is what I have originally proposed and implemented. Like the thread_set_self_state() trap implemented by this patch, the implementation of setting AARCH64_THREAD_STATE on the current thread needs to ensure that the set value of the x0 register does not get immediately overwritten with the return value of the mach_msg() trap. However, it's not only the return value of the mach_msg() trap that is important, but also the RPC reply message. The thread_set_state() RPC should not generate a reply message when used for returning to an interrupted context, since there'd be nobody expecting the message. This could be achieved by special-casing that in the kernel as well, or (simpler) by userland not passing a valid reply port in the first place. Note that the implementation of sigreturn in glibc already uses the strategy of passing an invalid reply port for the last RPC is does before returning to the interrupted context (which is deallocating the reply port used by the signal handler). Not passing a valid reply port and consequently not blocking on awaiting the reply message works, since the way Mach is implemented, kernel RPCs are always executed synchronously when userland sends the request message (unless the routine implementation includes explicit asynchrony, as device RPCs do, and gsync_wait() should do, but currently doesn't), meaning the RPC caller never has to *wait* for the reply message, as one is produced immediately. In other words, the mere act of invoking a kernel RPC (that does not involve explicit asynchrony) is enough to ensure it completes when mach_msg() returns, even if a reply message is not received (whether because an invalid reply port has been specified, or because MACH_RCV_MSG wasn't passed to mach_msg(), or because a message other than the kernel RPC's reply was received by the call). However, the same is not true when interposing is involved, and the thread's self port does not in fact point directly to the kernel, but to a userspace proxy of some sort. The two primary examples of this are Hurd's rpctrace tool, which interposes all the task's ports and proxies all RPCs after tracing them, and Mach's old netmsg/netname server, which proxies ports and messages over network. In this case, the actual implementation only runs once the request reaches the actual kernel, and not once the request message has been sent by the original caller, so it *is* necessary for the caller to await the reply message if it wants to make sure that the requested action has been completed. This does not cause much issues for deallocation of a reply port on the sigreturn code path in glibc, since that only delays when the port is deallocated, but does not otherwise change the program behavior. With thread_set_state(mach_thread_self()), however, this would be quite catastrophic, since the message-send would return back to the caller without changing its state, and the actual change of state would only happen at some later point. This issue is avoided nicely by turning the functionality into an explicit Mach trap rather than an RPC. As it's not an RPC, it doesn't involve messaging, and doesn't need a reply port or a reply message. It is always a direct call to the kernel (and not to any interposer), and it's always guaranteed to have completed synchronously once the trap returns. That also means that the thread_set_self_state() call won't be visible to rpctrace or forwarded over network for netmsg, but this is fine, since all it does is sets thread state (i.e. register values); the thread could do the same on its own by issuing relevant machine instruction without involving any Mach abstractions (traps or RPCs) at all if it weren't for the need of atomicity. Finally, this new trap is unfortunately somewhat of a security concern (as any sigreturn-like functionality is in general), since it would potentially allow an attacker who already has a way to invoke a function with 3 controlled argument values to set the values of all registers to any desired values (sigreturn-oriented programming). There is currently no mitigation for this other than the generic ones such as PAC and stack check guards. The limit of 150 used in the implementation has been chosen to be large enough to fit the largest thread state flavor so far, namely AARCH64_FLOAT_STATE, but small enough to not overflow the 4K stack. If a new thread state flavor is added that is too big to fit on the stack, the implementation should be switched to use kalloc instead of on-stack storage. Message-ID: <20240415090149.38358-9-bugaevc@gmail.com>
* elf-load: Respect PT_GNU_STACKSergey Bugaev2024-03-292-4/+11
| | | | | If a bootstrap ELF contains a PT_GNU_STACK phdr, take stack protection from there. Otherwise, default to VM_PROT_ALL.
* kern/rdxtree: Fix undefined behaviorSergey Bugaev2024-03-271-2/+2
| | | | | | | | Initializing a variable with itself is undefined, and GCC 14 rightfully produces a warning about the variable being used (to initialize itself) prior to initialization. X15 sets the variables to 0 instead, so do the same in Mach. Message-ID: <20240327161841.95685-8-bugaevc@gmail.com>
* gsync: Use copyin()/copyout() to access user memorySergey Bugaev2024-03-271-7/+31
| | | | | | | | | Depending on the architecture and setup, it may not be possible to access user memory directly, for example, due to user mode mappings not being accessible from kernel mode (x86 SMAP, AArch64 PAN). There are dedicated machine-specific copyin()/copyout() routines that know how to access user memory from the kernel; use them. Message-ID: <20240327161841.95685-6-bugaevc@gmail.com>
* Load 64-bit ELFs on all 64-bit portsSergey Bugaev2024-03-271-1/+1
| | | | | Not only on x86_64. Message-ID: <20240327161841.95685-5-bugaevc@gmail.com>
* Disable host_kernel_version() everywhere but on i386Sergey Bugaev2024-03-271-2/+2
| | | | | It's not only x86_64, none of new architectures are going to have it. Message-ID: <20240327161841.95685-3-bugaevc@gmail.com>
* move x86 copy_user.[ch] to ipc/ and make it arch-indipendentLD2024-03-091-1/+1
| | | | Message-ID: <20240309140244.347835-3-luca@orpolo.org>
* remove machine/machspl.h as it duplicates machine/spl.hLD2024-03-0914-14/+14
| | | | Message-ID: <20240309140244.347835-2-luca@orpolo.org>
* Check for null ports in task_set_essential, task_set_name and thread_set_name.Flavio Cruz2024-02-282-0/+9
| | | | | | Otherwise, it is easy to crash the kernel if userland passes arbitrary port names. Message-ID: <ZdriTgNhPsfu7c2M@jupiter.tail36e24.ts.net>
* kern: move pset_idle_lock/unlock to headerSamuel Thibault2024-02-233-14/+16
| | | | so that kern/machine.c can use it
* kern: Use _nocheck variants of locks taken at splsched()Damien Zammit2024-02-232-16/+28
| | | | | Fixes assertion errors when LDEBUG is compiled in. Message-ID: <20240223081404.458062-1-damien@zamaudio.com>
* kern: Use _irq variant of lock and disable interruptsDamien Zammit2024-02-233-4/+4
| | | | | During quantum adjustment, disable interrupts and call appropriate lock. Message-ID: <20240223080948.457792-1-damien@zamaudio.com>
* kern/processor: Do not set default_pset.empty on bootstrapDamien Zammit2024-02-231-2/+0
| | | | | | This is not needed because cpu_up does this when it comes online, it calls pset_add_processor(). Message-ID: <20240223080357.457465-1-damien@zamaudio.com>
* kern/gsync: Use vm_map_lookup with keep_map_lockedDamien Zammit2024-02-221-13/+6
| | | | | | | | | | | | | This prevents a deadlock in smp where a read lock on the map is taken in gsync and then the map is locked again inside vm_map_lookup() but another thread had a pre-existing write lock, therefore the second read lock blocks. This is fixed by removing the initial gsync read lock on the map but keeping the read lock held upon returning from vm_map_lookup(). Co-Authored-By: Sergey Bugaev <bugaevc@gmail.com> Message-ID: <20240222082410.422869-4-damien@zamaudio.com>
* vm_map_lookup: Add parameter for keeping map lockedDamien Zammit2024-02-221-1/+1
| | | | | | | | | | | This adds a parameter called keep_map_locked to vm_map_lookup() that allows the function to return with the map locked. This is to prepare for fixing a bug with gsync where the map is locked twice by mistake. Co-Authored-By: Sergey Bugaev <bugaevc@gmail.com> Message-ID: <20240222082410.422869-3-damien@zamaudio.com>
* Fix compile with MACH_LOCK_MONDamien Zammit2024-02-191-0/+7
|
* Introduce and use assert_splsched()Samuel Thibault2024-02-193-6/+6
|
* kern: Fix parenthesis around assignment used as valueDamien Zammit2024-02-191-1/+1
|
* smp: Set processor set to non-empty when adding a processorDamien Zammit2024-02-121-0/+1
| | | | | | This allows the slave_pset to be used for actual tasks with the processor_set RPCs. Message-ID: <20240212053817.1919056-1-damien@zamaudio.com>
* Add thread_set_name RPC.Flavio Cruz2024-02-122-0/+28
| | | | | | | Like task_set_name, we use the same size as the task name and will inherit the task name, whenever it exists. This will be used to implement pthread_setname_np. Message-ID: <20240212062634.1082207-2-flaviocruz@gmail.com>
* Replace kernel header includes in include/mach/mach_types.h with forward ↵Flavio Cruz2024-02-123-0/+7
| | | | | | | | | | | | | | | declarations. I was trying to reuse TASK_NAME_SIZE in kern/thread.h but it was impossible because files included from kern/task.h end up requiring kern/thread.h (through percpu.h), creating a recursive dependency. With this change, mach_types.h only defines forward declarations and modules have to explicitly include the appropriate header file if they want to be able touch those structures. Most of the other includes are required because we no longer grab many different includes through mach_types.h. Message-ID: <20240212062634.1082207-1-flaviocruz@gmail.com>
* task: fix addressability of assign_active fieldSamuel Thibault2024-02-111-2/+2
| | | | It is used for thread_wakeup and alike.
* smp: Create AP processor set and put all APs inside itDamien Zammit2024-02-113-1/+18
| | | | | | | This has the effect of running with one cpu only with smp, but has the ability to enable APs in userspace with the right processor set RPCs. Message-ID: <20240211120051.1889789-1-damien@zamaudio.com>
* smp: Fix parenthesis around logic expression valueDamien Zammit2024-02-111-1/+1
| | | | Message-ID: <20240211070915.1879676-1-damien@zamaudio.com>
* mach_msg: Fix checking reception sizeSamuel Thibault2023-10-011-1/+1
| | | | | | | We need to check against the actual user size that will be used, not the current kernel size. Usually userland uses amply-large reception buffer, but better be exact.
* Add and use ikm_cache_alloc/free/_trySamuel Thibault2023-10-012-23/+13
|
* slab: Make whatis look furtherSamuel Thibault2023-10-011-3/+53
| | | | | Without a tree, we can still look up by hand in the buffers. This also allows to find freed objects.
* ddb: Add whatis commandSamuel Thibault2023-10-012-1/+40
| | | | This is convenient when tracking buffer overflows
* Allow disabling of MACH_PCSAMPLE and disable by defaultDamien Zammit2023-09-301-16/+16
| | | | | | This fixes a page fault when the sampling occurs in MP. Perhaps it is not MP safe yet. Message-Id: <20230930063032.75232-4-damien@zamaudio.com>
* kdb: Add "show all runqs" debug commandDamien Zammit2023-09-291-0/+1
| | | | Message-Id: <20230929045936.31535-1-damien@zamaudio.com>
* percpu: active_stack with gsDamien Zammit2023-09-253-8/+3
| | | | Message-Id: <20230925002417.467022-1-damien@zamaudio.com>
* SMP: Fix setting up initial gdtSamuel Thibault2023-09-241-1/+1
| | | | | We cannot access cpu_id_lut from the initial AP state, so update the percpu segment after loading gdt.
* percpu active_thread using gs segmentDamien Zammit2023-09-247-10/+8
| | | | | TESTED: As per previous commit Message-Id: <20230924052824.449219-4-damien@zamaudio.com>
* percpu area using gs segmentDamien Zammit2023-09-244-19/+11
| | | | | | | | | | | | | | | This speeds up smp again, by storing the struct processor in a percpu area and avoiding an expensive cpu_number every call of current_processor(), as well as getting the cpu_number by an offset into the percpu area. Untested on 64 bit and work remains to use other percpu arrays. TESTED: (NCPUS=8) -smp 1 boots to login shell ~2x slower than uniprocessor TESTED: (NCPUS=8) -smp 2 boots to INIT but hangs there TESTED: (NCPUS=8) -smp 4 gets stuck seemingly within rumpdisk and hangs TESTED: (NCPUS=1) uniprocessor is a bit faster than normal Message-Id: <20230924103428.455966-3-damien@zamaudio.com>
* cpu_number: Inline widely used simple functionDamien Zammit2023-09-243-3/+4
| | | | | | TESTED: on uniprocessor and smp, both behaved as normal. Message-Id: <20230924103428.455966-2-damien@zamaudio.com>
* sched_prim.c: Check all run queues not just master processorDamien Zammit2023-08-221-2/+8
| | | | Message-Id: <20230816014835.2322718-6-damien@zamaudio.com>
* eventcount: Fix locking thread while calling thread_setrunSamuel Thibault2023-08-221-1/+1
|
* sched_prim.c: Lock thread when calling thread_setrunDamien Zammit2023-08-221-0/+2
| | | | Message-Id: <20230816014835.2322718-5-damien@zamaudio.com>
* slab: Optimize non-slab PAGE_SIZE allocationsSamuel Thibault2023-08-211-0/+4
| | | | | In case there is no slab for PAGE_SIZE allocations, we can use direct physical allocation rather than consuming the kernel virtual space.
* pmap+slab: Add more smoketestsSamuel Thibault2023-08-141-0/+3
| | | | | Checking the range of addresses for operations on the kernel_pmap is quite cheap, and allows to catch oddities quite early enough.
* slab [SLAB_VERIFY]: Fix not enabling KMEM_CF_VERIFY on 4K slabsSamuel Thibault2023-08-141-1/+1
|
* slab [SLAB_VERIFY]: Do not enable KMEM_CF_VERIFY on large slabsSamuel Thibault2023-08-131-3/+3
| | | | | That would be refused by kmem_cache_compute_properties later on anyway, and prevent the kernel from booting at all.
* lock: Fix SMP buildSamuel Thibault2023-08-131-1/+1
|
* kern/sched_prim: Cause ast on cpu coming out of idleDamien Zammit2023-08-131-0/+6
| | | | Message-Id: <20230811083424.2154350-3-damien@zamaudio.com>