LLDB from trunk is running on NetBSD once again!
Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.
Originally, LLDB was ported to NetBSD by Kamil Rytarowski. However, multiple upstream changes and lack of continuous testing have resulted in decline of support. So far we haven't been able to restore the previous state.
In February, I have started working on LLDB, as contracted by the NetBSD Foundation. My four first goals as detailed in the previous report were:
Restore tracing in LLDB for NetBSD (i386/amd64/aarch64) for single-threaded applications.
Restore execution of LLDB regression tests, unless there is need for a significant LLDB or kernel work, mark detected bugs as failing or unsupported ones.
Enable execution of LLDB regression tests on the buildbot in order to catch regressions.
Upstream NetBSD (i386/amd64) core(5) support. Develop LLDB regression tests (and the testing framework enhancement) as requested by upstream.
Of those tasks, I consider running regression tests on the buildbot the highest priority. Bisecting regressions post-factum is hard due to long build times, and having continuous integration working is going to be very helpful to maintaining the code long-term.
In this report, I'd like to summarize what I achieved and what technical difficulties I met.
The kqueue interoperability issues
Given no specific clue as to why LLDB was no longer able to start processes on NetBSD, I've decided to start by establishing the status of the test suites. More specifically, I've started with a small subset of LLDB test suite — unittests. In this section, I'd like to focus on two important issues I had with them.
Firstly, one of the tests was hanging indefinitely. As I established, the purpose of the test was to check whether the main loop implementation correctly detects and reports when all the slaves of a pty are disconnected (and therefore the reads on master would fail). Through debugging, I've came to the conclusion that kevent() is not reporting this particular scenario.
I have built a simple test case (which is now part of kqueue ATF tests) and confirmed it. Afterwards, I have attempted to establish whether this behavior is correct. While kqueue(2) does not mention ptys specifically, it states the following for pipes:
- Fifos, Pipes
Returns when there is data to read; data contains the number of bytes available.
When the last writer disconnects, the filter will set EV_EOF in flags. This may be cleared by passing in EV_CLEAR, at which point the filter will resume waiting for data to become available before returning.
Furthermore, my test program indicated that FreeBSD exhibits the described EV_EOF behavior. Therefore, I have decided to write a kernel patch adding this functionality, submitted it to review and eventually committed it after applying helpful suggestions from Robert Elz ([PATCH v3] kern/tty_pty: Fix reporting EOF via kevent and add a test case). I have also disabled the test case temporarily since the functionality is non-critical to LLDB (r353545).
Secondly, a few gdbserver-based tests were flaky — i.e. unpredictably passed and failed every iteration. I've started debugging this with a test whose purpose was to check verbose error messages support in the protocol. To my surprise, it seemed as if gdbserver worked fine as far as error message exchange was concerned. This packet was followed by a termination request from client — and it seemed that the server sometimes replies to it correctly, and sometimes terminates just before receiving it.
While working on this particular issue, I've noticed a few deficiencies in LLDB's error handling. In this case, this involved two major issues:
gdbserver ignored errors from main loop. As a result, if kevent() failed, it silently exited with a successful status. I've fixed it to catch and report the error verbosely instead: r354030.
Main loop reported meaningless return value (-1) from kevent(). I've established that most likely all kevent() implementation use errno instead, and made the function return it: r354029.
After applying those two fixes, gdbserver clearly indicated the problem: kevent() returned due to EINTR (i.e. the process receiving a signal). Lacking correct handling for this value, the main loop implementation wrongly treated it as fatal error and terminated the program. I've fixed this via implementing EINTR support for kevent() in r354122.
This trivial fix not only resolved most of the flaky tests but also turned out to be the root cause for LLDB being unable to start processes. Therefore, at this point tracing for single-threaded processes was restored on amd64. Testing on other platforms is pending.
Now, for the moral: working error reporting can save a lot of time.
Socket issues
The next issue I hit while working on the unittests is rather curious, and I have to admit I haven't managed to neither find the root cause or build a good reproducer for it. Nevertheless, I seem to have caught the gist of it and found a good workaround.
The test in question focuses on the high-level socket API in LLDB. It is rather trivial — it binds a server in one thread, and tries to connect to it from a second thread. So far, so good. Most of the time the test works just fine. However, sometimes — especially early after booting — it hangs forever.
I've debugged this thoroughly and came to the following conclusion: the test binds to 127.0.0.1 (i.e. purely IPv4) but tries to connect to localhost. The latter results in the client trying IPv6 first, failing and then succeeding with IPv4. The connection is accepted, the test case moves forward and terminates successfully.
Now, in the failing case, the IPv6 connection attempt succeeds, even though there is no server bound to that port. As a result, the client part is happily connected to a non-existing service, and the server part hangs forever waiting for the connection to come.
I have attempted to reproduce this with an isolated test case, reproducing the use of threads, binding to port zero, the IPv4/IPv6 mixup and I simply haven't been able to reproduce this. However, curiously enough my test case actually fixes the problem. I mean, if I start my test case before LLDB unit tests, they work fine afterwards (until next reboot).
Being unable to make any further progress on this weird behavior, I've decided to fix the test design instead — and make it connect to the same address it binds to: r353868.
Getting the right toolchain for live testing
The largest problem so far was getting LLDB tests to interoperate with NetBSD's clang driver correctly. On other systems, clang either defaults to libstdc++, or has libc++ installed as part of the system (FreeBSD, Darwin). The NetBSD driver wants to use libc++ but we do not have it installed by default.
While this could be solved via installing libc++ on the buildbot host, I thought it would be better to establish a solution that would allow LLDB to use just-built clang — similarly to how other LLVM projects (such as OpenMP) do. This way, we would be testing the matching libc++ revision and users would be able to run the tests in a single checkout out of the box.
Sadly, this is non-trivial. While it could be all hacked into the driver itself, it does not really belong there. While it is reasonable to link tests into the build tree, we wouldn't want regular executables built by user to bind to it. This is why normally this is handled via the test system. However, the tests in LLDB are an accumulation of at least three different test systems, each one calling the compiler separately.
In order to establish a baseline for this, I have created wrappers for clang that added the necessary command-line options. The state-of-art wrapper for clang looked like the following:
#!/usr/bin/env bash topdir=/home/mgorny/llvm-project/build-rel-master cxxinc="-cxx-isystem $topdir/include/c++/v1" lpath="-L $topdir/lib" rpath="-Wl,-rpath,$topdir/lib" pthread="-pthread" libs="-lunwind" # needed to handle 'clang -v' correctly [ $# -eq 1 ] && [ "$1" = -v ] && exec $topdir/bin/clang-9-real "$@" exec $topdir/bin/clang-9-real $cxxinc $lpath $rpath "$@" $pthread $libs
The actual executable I renamed to clang-9-real, and this wrapper replaced clang and a similar one replaced clang++. clang-cl was linked to the real executable (as it wasn't called in wrapper-relevant contexts), while clang-9 was linked to the wrapper.
After establishing a baseline of working tests, I've looked into migrating the necessary bits one by one to the driver and/or LLDB test system, removing the migrated parts and verifying whether tests pass the same.
My proposal so far involves, appropriately:
Replacing -cxx-isystem with libc++ header search using path relative the compiler executable: D58592.
Integrating -L and -Wl,-rpath with the LLDB test system: D58630.
Adding NetBSD to list of platforms needing -pthread: r355274.
The need for -lunwind is solved via switching the test failing due to the lack of it to use libc++ instead of libstdc++: r355273.
The reason for adjusting libc++ header search in the driver rather than in LLDB tests is that the path is specific to building against libc++, and the driver makes it convenient to adjust the path conditionally to standard C++ library being used. In other words, it saves us from hard-relying on the assumption that tests will be run against libc++ only.
I've went for integrating -L in the test system since we do not want to link arbitrary programs to the libraries in LLVM's build directory. Appending this path unconditionally should be otherwise harmless to LLDB's tests, so that is the easier way to go.
Originally I wanted to avoid appending RPATHs. However, it seems that the LD_LIBRARY_PATH solution that works for Linux does not reliably work on NetBSD with LLDB. Therefore, passing -Wl,-rpath along with -L allowed me to solve the problem simpler.
Furthermore, those design solutions match other LLVM projects. I've mentioned OpenMP before — so far we had to pass -cxx-isystem to its tests explicitly but it passed -L for us. Those patches render passing -cxx-isystem unnecessary, and therefore make LLDB follow the suit of OpenMP.
Finishing touches
Having a reasonably working compiler and major regressions fixed, I have focused on establishing a baseline for running tests. The goal is to mark broken tests XFAIL or skip them. With all tests marked appropriately, we would be able to start running tests on the buildbot and catch regressions compared to this baseline. The current progress on this can be see in D58527.
Sadly, besides failing tests there is still a small number of flaky or hanging tests which are non-trivial to detect. The upstream maintainer, Pavel Labath is very helpful and I hope to be able to finally get all the flaky tests either fixed or covered with his help.
Other fixes not worth a separate section include:
fixing compiler warnings about empty format strings: r354922,
fixing two dlopen() based test cases not to link -ldl on NetBSD: r354617,
finishing Kamil's patch for core file support: r354466, followup fix in r354483,
removing dead code in main loop: r354050,
fixing stand-alone builds after they've been switched to LLVMConfig.cmake: r353925,
skipping lldb-mi tests when Python support (needed by lldb-mi) is disabled: r353700,
fixing incorrect initialization of sigset_t (not actually used right now): r353675.
Buildbot updates
The last part worth mentioning is that the NetBSD LLVM buildbot has seen some changes. Notably, zorg r354820 included:
fixing the bot commit filtering to include all projects built,
renaming the bot to shorter netbsd-amd64,
and moving it to toolchain category.
One of the most useful functions of buildbot is that it associated every successive build with new commits. If the build fails, it blames the authors of those commits and reports the failure to them. However, for this to work buildbot needs to be aware which projects are being tested.
Our buildbot configuration has been initially based on one used for LLDB, and it assumed LLVM, Clang and LLDB are the only projects built and tested. Over time, we've added additional projects but we failed to update the buildbot configs appropriately. Finally, with the help of Jonas Hahnfeld, Pavel Labath and Galina Kistanova we've managed to update the list and make the bot blame all projects correctly.
While at it, we were suggested to rename the bot. The previous name was lldb-amd64-ninja-netbsd8, and others suggested that the developers may ignore failures in other projects seeing lldb there. Kamil Rytarowski also pointed out that the version number confuses users to believe that we're running separate bots for different versions. The new name and category mean to clearly indicate that we're running a single bot instance for multiple projects.
Quick summary and future plans
At this point, the most important regressions in LLDB have been fixed and it is able to debug simple programs on amd64 once again. The test suite patches are still waiting for review, and once they're approved I still need to work on flaky tests before we can reliably enable that on the buildbot. This is the first priority.
The next item on the TODO list is to take over and finish Kamil's patch for core files with thread. Most notably, the patch requires writing tests, and verifying whether there are no new bugs affecting it.
On a semi-related note, LLVM 8.0.0 will be released in a few days and I will be probably working on updating src to the new version. I will also try to convince Joerg to switch from unmaintained libcxxrt to upstream libc++abi. Kamil also wanted to change libc++ include path to match upstream (NetBSD is dropping /v1 suffix at the moment).
Once this is done, the next big step is to fix threading support. Testing on non-amd64 arches is deferred until I gain access to some hardware.
This work is sponsored by The NetBSD Foundation
The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:
http://netbsd.org/donations/#how-to-donate
[0 comments]