Improvements in forking, threading, and signal code

May 07, 2019 posted by Kamil Rytarowski

Improvements in forking, threading, and signal code

I am improving signaling code in the NetBSD kernel, covering corner cases with regression tests, and improving the documentation. I've been working at the level of sytems calls (syscalls): forking, threading, handling these with GDB, and tracing syscalls. Some work happens behind the scenes as I support the work of Michal Gorny on LLDB/ptrace features.

clone(2)/__clone(2) tracing fixes

clone(2) (and its alias __clone(2)) is a Linux-compatible system call that is equivalent to fork(2) and vfork(2). However it was more customization options. Some programs use clone(2) directly and in some cases it's just easier to precompile the same program also for the NetBSD distribution, without extra changes, using clone(2) directly in the program for NetBSD.

During my work on the fork1(9) kernel function -- which handles debugger-related events -- I implemented regression tests of this syscall. This was combined with certain supported modes of operation of clone(2), particularly checking supported flags. These combinations did not use more than a one in the same test.

Naturally, a judicious selection of edge cases in the regression tests should give meaningful results. I plan to stress the kernel with a random set of flags with a kernel fuzzer. In turn, this will help to catch immediate kernel problems quickly.

During my work I have discovered that support for clone(2) for a debugger has been defective since inception. This never worked due to a small 1-byte programming mistake. The fix landed in sys/kern/kern_fork.c r.1.207. As the fork1(9) code evolved since the introduction of PSL_TRACEFORK, the fix is no longer a single-liner, but still it removes only 3 bytes from the kernel code (in the past it would be 1 byte removal)!

@@ -477,11 +477,11 @@ fork1(struct lwp *l1, int flags, int exitsig, void *stack, size_t stacksize,
         * Trace fork(2) and vfork(2)-like events on demand in a debugger.
         */

        tracefork = (p1->p_slflag & (PSL_TRACEFORK|PSL_TRACED)) ==
-           (PSL_TRACEFORK|PSL_TRACED) && (flags && FORK_PPWAIT) == 0;
+           (PSL_TRACEFORK|PSL_TRACED) && (flags & FORK_PPWAIT) == 0;

        tracevfork = (p1->p_slflag & (PSL_TRACEVFORK|PSL_TRACED)) ==
-           (PSL_TRACEVFORK|PSL_TRACED) && (flags && FORK_PPWAIT) != 0;
+           (PSL_TRACEVFORK|PSL_TRACED) && (flags & FORK_PPWAIT) != 0;

        tracevforkdone = (p1->p_slflag & (PSL_TRACEVFORK_DONE|PSL_TRACED)) ==
-           (PSL_TRACEVFORK_DONE|PSL_TRACED) && (flags && FORK_PPWAIT);
+           (PSL_TRACEVFORK_DONE|PSL_TRACED) && (flags & FORK_PPWAIT);

        if (tracefork || tracevfork)
                proc_changeparent(p2, p1->p_pptr);

        if (tracefork) {

(flags & FORK_PPWAIT) & implements bitwise AND. Logical AND (&&) was intended. Despite many eyes reading and editing this code, this particular issue was overlooked until the introduction of the regression tests. The effect of this change was that every clone(2) variation was incorrectly mapped into corresponding fork(2)/vfork(2) event.

More information about the C semantics can be checked in web resources on web pages.

Now clone(2) should work comparably well, for example, when compared to fork(2) and vfork(2). Current work is to map all clone(2) calls that have the property of a stopped parent to vfork(2); otherwise, if they don't have this property, they should be mapped to fork(2). This approach allows me to directly map clone(2) variations into well defined interfaces in debuggers that distinguish 3 types of forking events:

FORK
VFORK
VFORK_DONE

From a debugger's point of view it doesn't matter whether or not clone(2) shares the file descriptor table with its parent. It's an implementation detail, and either way it is expected to be handled by a tracer in the same way.

More options of clone(2) can be found in the NetBSD manual pages.

I still plan to check a similar interface, posix_spawn(3), which performs both operations in one call: clone(2) and exec(2). Most likely, according to my code reading, the syscall is not handled appropriately in the kernel space and I will need to map it into proper forking events. My motivation here is to support all system interfaces spawning new processes.

child_return(9) refactoring

child_return(9) is a kernel function that prepares a newly spawned child to return value 0 from fork(2), while its parent process returns child's process id. Originally the child_return(9) function was purely implemented in the MD part of each NetBSD port. I've since changed this and converted child_return(9) into MI code that is shared between all architectures. md_child_return() is now used for port specific code only.

The updated child_return(9) contains ptrace(2)- and ktruss(1)-related code that is shared now between all ports.

Incidentally, I noted a bug in a set of functions in NetBSD's aarch64 (ARM64) port. A set of functions called in thir original child_return() failed to call userret(9). In turn, the return path to user-mode procedures was incorrect. The bug has since been corrected and this resulted in passing several ATF tests.

This code has been also hardened for races that are theoretically possible, but unlikely to happen in practice.. on the other hand such statement usually means that a bug can be triggered easily in a loop within a short period of time. In order to stop risking and assuming that all the code now and in future changes is safe enough, I've added additional checks that assume that we won't generate a debugger related event in abnormal conditions like just receiving a SIGKILL, ignoring it and overwriting it with another signal SIGTRAP. The code has been modified to use racy-check, check for condition and if it evaluates to true, I am performing locking operations and recheck in new conditions the state, rechecking the integrity state before generating an event for a debugger.

/*
 * MI code executed in each newly spawned process before returning to userland.
 */
void
child_return(void *arg)
{
        struct lwp *l = arg;
        struct proc *p = l->l_proc;

        if (p->p_slflag & PSL_TRACED) {
                /* Paranoid check */
                mutex_enter(proc_lock);
                if (!(p->p_slflag & PSL_TRACED)) {
                        mutex_exit(proc_lock);
                        goto my_tracer_is_gone;
                }

                mutex_enter(p->p_lock);
                eventswitch(TRAP_CHLD);
        }

my_tracer_is_gone:
        md_child_return(l);

        /*
         * Return SYS_fork for all fork types, including vfork(2) and clone(2).
         *
         * This approach simplifies the code and avoids extra locking.
         */
        ktrsysret(SYS_fork, 0, 0);
}

Forking improvements under ptrace(2)

I've refactored the test cases and verified some of the forking semantics in the kernel. This included narrow cases such as nested vfork(2). Thankfully this worked correctly "as is" (except typical vforking(2) races that still exist).

I've added support to fork1(9) scenarios for detaching or killing a process in the middle of the process of spawning of a new child. This is needed in debuggers such as GDB that can either follow forkers or forkees, immediately detaching the other one. Bugs in these scenarios have been corrected and I have verified that GDB behaves correctly in these situations.

Threading improvements under ptrace(2)

I've reworked the current code for reporting threading (LWP) events to debuggers (LWP_CREATE and LWP_EXITED). The updated revision is also no longer prone to masking SIGTRAP in a child. Since these improvements, LWP events are now significantly more stable than they used to be. Reporting LWP_EXITED is still incorrect as there is a race condition between WAIT() and EXIT(). The effect of this is that the signal from EXIT() is never delivered to a debugger that is polling for it, and therefore it is missed.

Other changes

ATF ptrace(2) test corrections for handling of SIGILL crashes on SPARC, and detection of FPU availability on ARM, have been added.

The PT_IO operation in ptrace(2) can result in a false positive success value, however no byte transfer operation has been performed. Michal Gorny detected this problem in LLDB for invalid frame pointer reads. The NetBSD driver overlooked this scenario and was looping infinitely. This surprising property was also detected in PT_WRITE/PT_READ operations and found to be triggered in GDB. As a workaround I have disabled 0-length transfer requests in PT_IO by returning EINVAL. Zero-byte transfer operations bring the PT_WRITE/PT_READ calls into question, as we have no way to distinguish successful operation from an empty-byte transfer returning success.

In turn, this means PT_WRITE/PT_READ should be deprecated. I plan to document this clearly.

I've decided to finally forbid setting Program Counter to 0x0 in PT_CONTINUE/PT_DETACH/etc as it's hardly ever a valid executable address. There are two main factors here:

The Linux kernel does not implement this argument and by a convention Linux software sets it to 0. This causes portability issues for software originally developed on Linux.

If we overwrite a program counter with 0x0, a user will be confused. They will have no idea whether the debugger is at fault, there is a bug in the kernel, or there is a fault in a program that jumped to 0x0 for misleading reasons.

This change raised concerns about making 0x0 a magic value. This can be mitigated by allowing 0x0 to take a documented meaningful value, or making it indicative of the sysctl(3) value vm.user_va0_disable.

I've added previously missing support for KTR (ktrace(1)) events. In particular, this is for debugger-related signals except vfork(2) because this creates unkillable processes. I am considering fixing this by synchronizing the parent of the vfork(2)ed process and its child. This change will enable the debugger to process event signals in ktruss(1) logs.

GDB support improvements

During the last month I introduced a regression bug in passing crash signals to the debugger. I reduced some specific information passed to a tracer, indirectly improving NetBSD stability support while debugging in GDB. The trade-off was a slight reduction in readability of LLDB backtraces and crash reports.

Independently I've been asked by Christos Zoulas to fix GDB support for threads. I addressed the kernel shortage quickly and reworked the NetBSD platform code in GDB. Dead code (which was leftover from the original FreeBSD original code) was removed, missing code added, and monitoring debugger-related events was reworked. The latter supports the improved kernel APIs produced during my overall work. GDB still exhibits issues with threads, for example, for convoluted Golang binaries, but has been improved to the extent that ATF regression tests for GDB pass again.

Syscall tracing API

The episode of GDB fixes stimulated me to add support for passing the syscall number along with the SIGTRAP signal. I've described the interface in the commit message:

commit 7dd3c39f7d951a10642fce0f99d9e86d28156836
Author: kamil
Date:   Mon May 6 08:05:03 2019 +0000

    Ship with syscall information with SIGTRAP TRAP_SCE/TRAP_SCX for tracers

    Expand siginfo_t (struct size not changed) to new values for
    SIGTRAP TRAP_SCE/TRAP_SCX events.

     - si_sysnum  -- syscall number (int)
     - si_retval  -- return value (2 x int)
     - si_error   -- error code (int)
     - si_args    -- syscall arguments (8 x uint64_t)

    TRAP_SCE delivers si_sysnum and si_args.
    TRAP_SCX delivers si_sysnum, si_retval, si_error and si_args.

    Users: debuggers (like GDB) and syscall tracers (like strace, truss).

    This MI interface is similar to the Linux kernel proposal of
    PTRACE_GET_SYSCALL_INFO by the strace developer team.

In order to verify the updated API before merging it into the kernel, I wrote a truss-like or strace-like tool for the kernel interfaces. I authored three versions of picotrace: the first in C; the second in Lua+C; and the third one with C. The final 3rd version has been published and imported as pkgsrc/devel/picotrace.

The upstream source code is available online at https://github.com/krytarowski/picotrace.

It is documented in pkgsrc as follows:

picotrace enables syscall trace logging for the specified processes.

The tracer uses the ptrace(2) system call to perform the tracing.

The picotrace program implements bare functionality by design.  It has
no pretty printing of data structures or interpretation of numberical arguments
to their corresponding syscalls.

picotrace is designed to be a framework for other more advanced tracers,
and to illustrate canonical usage of the ptrace(2).  New features are
not expected unless they reflect a new feature in the kernel.

Summary

I was able run literally all existing ATF ptrace(2) from the testsuite and pass all of them. Unfortunately, VFORK and LWP operations still present race conditions in the kernel and can cause failures. In order to reduce concerns from other developers, I have disabled the racy tests by default. There is also a new observation that one test that used to be rock stable is now sometimes flaky. It has not been investigated, but I suspect that something with the pipe(2) kernel code has a regression or surfaced an old problem. I plan to investigate this once I will finish other ptrace(2) tasks.

Plan for the next milestone

I will visit BSDCan this month to speak about NVMM and HAXM. I will resume my work on forking and threading bugs after the conference. The lwp_exit() and wait() race will be prioritized as it affects most current users. After resolving this problem, I will be back to posix_spawn(2), followed by addressing vfork(2) races.

This work was sponsored by The NetBSD Foundation.

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

http://netbsd.org/donations/#how-to-donate

[0 comments]

« Announcing Google... | Main | NetBSD 8.1 Release... »