posix_spawn syscall added

February 26, 2012 posted by Martin Husemann

Charles Zhang implemented the posix_spawn syscall during Google Summer of Code 2011. After a lot of polishing and rework based on feedback during public discussion of the code, this has now been committed to NetBSD-current.

This caused some fallout and ended in a tight race with the imminent branch date for NetBSD 6. Now that the dust has settled, it is time for a look back at the mistakes made and lessons learned.

What is posix_spawn?

Traditionally BSD systems used the vfork(2) hack to improve speed of process creation. However, this does (in general) not play well with multi-threaded applications. The posix_spawn call is a thread-safe way to create new processes and manipulate a tiny bit of state (like dup/close/open file descriptors) upfront.

Work continued after GSoC

The results Charles had at the end of his GSoC term were a working in-kernel implementation of posix_spawn and a few free-form test cases, one of which failed. The kernel code duplicated a lot of other code, which clearly was not acceptable for commit to the NetBSD source tree. The reason Charles solved it this way was the short time frame available - and that the best solution we could think of during the summer was very intrusive.

In preparation for a potential merge into the NetBSD code base, I reworked the code to avoid copying helper functions (like file descriptor manipulations for other processes), cleaned up and debugged a bit using a LOCKDEBUG kernel, which pointed out a few more issues. After solving those as well as intensively testing all error paths, I posted a patch for review.

At this point the integration was already prepared completely - a new syscall, new libc functions, new manual pages need a lot of set lists updates and test building a "release" at least once (preferably on an architecture providing 32bit compat libraries), furthermore the posix_spawn code needed (simple) machine-dependent code to be added to all architectures, which at least requires test-building a representative set of kernels.

Another complete rework

In response to the posted, very intrusive, patch, YAMAMOTO Takashi suggested a pretty elegant way to solve the problem without a lot of the intrusive changes. The idea was simple, and it actually worked after a few adjustments. This led to another public patch for review.

This version already included an atf version of the test programs, which all passed (both on amd64 and sparc64). I felt pretty confident with this state and expected a smooth integration.

Unexpected fallout

More for completeness I did a full test run (not only the posix_spawn related tests) - and found some unexpected test failures, all in rump based tests. I retried and got different failures. Suspicious - I did not touch rump, besides regenerating the syscall definitions. I rebooted a standard kernel (without posix_spawn), did a full test run and only got failures in the posix_spawn tests (of course). So something in the change must have broken something else.

Analysis was a painful process, so only a short summary of the results: the modified kernel exec path used a pointer to a kernel stack variable, which was later copied to a saved data structure - but the pointer was not adjusted accordingly. Later the pointer was referenced, and only a single bit checked. Depending on what was in memory at the stale old stack location at that time, a branch was taken or not. This caused the ELF auxiliary data vector to sometimes contain a different effective UID, and ld.elf_so switching into secure mode - in which case it ignores environment variables like LD_PRELOAD. This causes big failure in many test programs using rump (at least).

While I was debugging this, discussions continued. We were not sure if we should add complex code like this to the kernel, where a pure userland implementation clearly is possible (FreeBSD uses this, for example). I did a few benchmark runs, but was unable to show any clear performance benefit for either implementation - the differences were in the sub-promille ranges, with noise in the 2-3 percent range, clearly no usable result from a statistical point of view. Another topic under discussion was the near planned branch for NetBSD 6. According to our rules, we do not want to add a syscall post-branch to a release branch.

Go ahead, finally!

The discussions ended with the core team voting for a kernel version, and the release engineering team voting for a pre-netbsd-6-branch integration. So I updated my posix_spawn source tree, did another test build, ran tests (again on amd64 and sparc64), updated again - and committed in a few steps.

Big fallout

Checking mails early next morning (a Sunday, before walking the dog) I found a PR already: running the m4 configure script crashed i386 and amd64 kernels. Tsutsui kindly had provided a backtrace in the report, and it looked suspiciously familiar to me. While walking the dog I thought about it and when I got home I checked: indeed I had seen and fixed this before, when testing error paths in the first instance of the change. However, when dropping all the intrusive modifications I had in my tree and redoing the version without them, I must have accidentally dropped the fix for this (it was in sys/uvm instead of sys/kern). No big deal, I had fixed it once already, so I could fix it again. Committed, asked for verification - and did get a NAK. However, with a different back trace this time. Tried on my amd64 notebook - worked for me. Duh?

Looking at the code and fixing the second fallout now was straight forward, and also provided the hint why I did not see it before: I was not running a GENERIC kernel on my notebook, and had (some time way back in the past) removed options DIAGNOSTIC from this configuration. Stupid me!

I received more feedback (YAMAMOTO-san pointed out some race conditions) and had a discussion about the place where the test programs should live in the source. To not risk delaying the netbsd-6 branch, I applied a minimal fix for the races, moved the test programs - and added a few more test cases covering the initial m4 configure problems (the rework earlier had made it pretty simple now to test all error paths from atf test cases).

This caused the automatic test setup to crash on every run ("Tests did not complete"). At this point I am still not sure why I did not catch this before commit - but there is no point in arguing, human failure - my fault (most likely explanation: after the last changes to the test cases, I did not test again on amd64 but only sparc64 - the test cases triggered a KASSERT in the x86 pmap, but not in the sparc64 one).

I fixed this, and also another PR, interestingly about m4 configure again. Simple argument validation bug, not covered by the test cases yet - so I added another test.

Are we there yet?

Luckily fallout seems to have stopped now, but we are not completely there yet. The new process created by posix_spawn keeps the parent lwp blocked until it is done with all file descriptor modifications and setup, and the new process is ready to go to userland first time. This provides a proper error return value from the parent (the posix_spawn syscall itself), but it stops the new child from (for example) already running on another CPU early. This will be simple to change, but after all the fallout we have seen, I will only touch it after very extensive testing again.

Lessons learned

When bringing in a new syscall with several supporting libc functions, fallout is always to be expected. It can be minimized by including test programs early - but in the end, real life will teach you what tests you have missed when writing the test programs. It is also important do full test suite runs early, and test on different architectures. Even better if you test on kernels with (at least) DIAGNOSTIC enabled. But in the end, mistakes will happen nevertheless.

[0 comments]

« domU SMP support... | Main | NetBSD 6.0_BETA... »