LLDB Threading support now ready for mainline


November 09, 2019 posted by Michał Górny

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support which is taking longer than expected. You can read more about that in my September 2019 report.

So far the number of issues uncovered while enabling proper threading support has stopped me from merging the work-in-progress patches. However, I've finally reached the point where I believe that the current work can be merged and the remaining problems can be resolved afterwards. More on that and other LLVM-related events happening during the last month in this report.

LLVM news and buildbot status update

LLVM switched to git

Probably the most important event to note is that the LLVM project has switched from Subversion to git, and moved their repositories to GitHub. While the original plan provided for maintaining the old repositories as read-only mirrors, as of today this still hasn't been implemented. For this reason, we were forced to quickly switch buildbot to the git monorepo.

The buildbot is operational now, and seems to be handling git correctly. However, it is connected to the staging server for the time being. Its URL changed to http://lab.llvm.org:8014/builders/netbsd-amd64 (i.e. the port from 8011 to 8014).

Monthly regression report

Now for the usual list of 'what they broke this time'.

LLDB has been given a new API for handling files, in particular for passing them to Python scripts. The change of API has caused some 'bad file descriptor' errors, e.g.:

ERROR: test_SBDebugger (TestDefaultConstructorForAPIObjects.APIDefaultConstructorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/decorators.py", line 343, in wrapper
    return func(self, *args, **kwargs)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/TestDefaultConstructorForAPIObjects.py", line 133, in test_SBDebugger
    sb_debugger.fuzz_obj(obj)
  File "/data/motus/netbsd8/netbsd8/llvm/tools/lldb/packages/Python/lldbsuite/test/python_api/default-constructor/sb_debugger.py", line 13, in fuzz_obj
    obj.SetInputFileHandle(None, True)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 3890, in SetInputFileHandle
    self.SetInputFile(SBFile.Create(file, borrow=True))
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5418, in Create
    return cls.MakeBorrowed(file)
  File "/data/motus/netbsd8/netbsd8/build/lib/python2.7/site-packages/lldb/__init__.py", line 5379, in MakeBorrowed
    return _lldb.SBFile_MakeBorrowed(BORROWED)
IOError: [Errno 9] Bad file descriptor
Config=x86_64-/data/motus/netbsd8/netbsd8/build/bin/clang-10
----------------------------------------------------------------------

I've been able to determine that the error was produced by flush() method call invoked on a file descriptor referring to stdin. Appropriately, I've fixed the type conversion method not to flush read-only fds.

Afterwards, Lawrence D'Anna was able to find and fix another fflush() issue.

A newly added test revealed that platform process list -v command on NetBSD missed listing the process name. I've fixed it to provide Arg0 in process info.

Another new test failed due to our target not implementing ShellExpandArguments() API. Apparently the only target actually implementing it is Darwin, so I've just marked TestCustomShell XFAIL on all BSD targets.

LLDB upstream was forced to reintroduce readline module override that aims to prevent readline and libedit from being loaded into a single program simultaneously. This module failed to build on NetBSD. I've discovered that the original was meant to be built on Linux only, and since the problem still doesn't affect other platforms, I've made it Linux-only again.

libunwind build has been changed to link using the C compiler rather than C++. This caused some libc++ failures on NetBSD. The author has reverted the change for now, and is looking for a better way of resolving the problem.

Finally, I have disabled another OpenMP test that caused NetBSD to hang. While ideally I'd like to have the underlying kernel problem fixed, this is non-trivial and I prefer to focus on LLDB right now.

New LLD work

I've been asked to rebase my LLD patches for the new code. While doing it, I've finally committed the -z nognustack option patch from January.

In the meantime, Kamil's been working on finally resolving the long-standing impasse on LLD design. He is working on a new NetBSD-specific frontend to LLD that would satisfy our system-wide linker requirements without modifying the standard driver used by other platforms.

Upgrade to NetBSD 9 beta

Our recent work, especially the work on threading support has required a number of fixes in the NetBSD kernel. Those fixes were backported to NetBSD 9 branch but not to 8. The 8 kernel used by the buildbot was therefore suboptimal for testing new features. Furthermore, with the 9.0 release coming soon-ish, it became necessary to start actively testing it for regressions.

The buildbot has been upgraded to NetBSD 9 beta on 2019-11-06. Initially, the upgrade has caused LLDB to start crashing on startup. I have not been able to pinpoint the exact issue yet. However, I've established that it happens with -O3 optimization level only, and I've worked it around by switching the build to -O2. I am planning to look into the problem more once the buildbot is restored fully.

The upgrade to nb9 has caused 4 LLDB tests to start succeeding, and 6 to start failing. Namely:

********************
Unexpected Passing Tests (4):
    lldb-api :: commands/watchpoints/watchpoint_commands/condition/TestWatchpointConditionCmd.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandPython.py
    lldb-api :: lang/c/bitfields/TestBitfields.py
    lldb-api :: commands/watchpoints/watchpoint_commands/command/TestWatchpointCommandLLDB.py

********************
Failing Tests (6):
    lldb-shell :: Reproducer/Functionalities/TestExpressionEvaluation.test
    lldb-api :: commands/expression/call-restarts/TestCallThatRestarts.py
    lldb-api :: functionalities/signal/handle-segv/TestHandleSegv.py
    lldb-unit :: tools/lldb-server/tests/./LLDBServerTests/StandardStartupTest.TestStopReplyContainsThreadPcs
    lldb-api :: functionalities/inferior-crashing/TestInferiorCrashingStep.py
    lldb-api :: functionalities/signal/TestSendSignal.py

I am going to start investigating the new failures shortly.

Further LLDB threading work

Fixes to register support

Enabling thread support revealed a problem in register API introspection specific to NetBSD. The API responsible for passing registers in groups to Python was unable to name some of the groups on NetBSD, and the null names have caused the TestRegistersIterator to fail. Threading support made this specifically visible by replacing a regular test failure with Python code error.

In order to resolve the problem, I had to describe all supported register sets in NetBSD register context. The code was roughly based on the Linux equivalent, modified to match register sets used by our ptrace() API. Interestingly, I had to also include MPX registers that are currently unimplemented, as otherwise LLDB implicitly put them in an anonymous group.

While at it, I've also changed the register set numbering to match the more common ordering, in order to avoid issues in the future.

Finished basic thread support patch

I've finally completed and submitted the patch for NetBSD thread support. Besides fixing a few mistakes, I've implemented thread affinity support for all relevant SIGTRAP events (breakpoints, traces, hardware watchpoints) and removed incomplete hardware breakpoint stub that caused LLDB to crash.

In its current form, this patch combines three changes essential to correct support of threaded programs:

  1. It enables reporting of new and exited threads, and maintains debugged thread list based on that.

  2. It modifies the signal (generic and SIGTRAP) handling functions to read the thread identifier and associate the event with correct thread(s). Previously, all events were assigned to all threads.

  3. It updates the process resuming function to support controlling the state (running, single-stepping, stopped) of individual threads, and raising a signal either to the whole process or to a single thread. Previously, the code used only the requested action for the first thread and populated it to all threads in the process.

Proper watchpoint support in multi-threaded programs

I've submitted a separate patch to copy watchpoints to newly-created threads. This is necessary due to the design of Debug Register support in NetBSD. Quoting the ptrace(2) manpage:

  • debug registers are only per-LWP, not per-process globally
  • debug registers must not be inherited after (v)forking a process
  • debug registers must not be inherited after forking a thread
  • a debugger is responsible to set global watchpoints/breakpoints with the debug registers, to achieve this PTRACE_LWP_CREATE / PTRACE_LWP_EXIT event monitoring function is designed to be used

LLDB supports per-process watchpoints only at the moment. To fit this into NetBSD model, we need to monitor new threads and copy watchpoints to them. Since LLDB does not keep explicit watchpoint information at the moment (it relies on querying debug registers), the proposed implementation verbosely copies dbregs from the currently selected thread (all existing threads should have the same dbregs).

Fixed support for concurrent watchpoint triggers

The final problem I've been investigating was a server crash with the new code when multiple watchpoints were triggered concurrently. My final patch aims to fix handling concurrent watchpoint events.

When a watchpoint is triggered, the kernel delivers SIGTRAP with TRAP_DBREG to the debugger. The debugger investigates DR6 register of the specified thread in order to determine which watchpoint was triggered, and reports it. When multiple watchpoints are triggered simultaneously, the kernel reports that as series of successive SIGTRAPs. Normally, that works just fine.

However, on x86 watchpoint triggers are reported before the instruction is executed. For this reason, LLDB temporarily disables the breakpoint, single-steps and reenables it. The problem with that is that the GDB protocol doesn't control watchpoints per thread, so the operation disables and reenables the watchpoint on all threads. As a side effect of this, DR6 is cleared everywhere.

Now, if multiple watchpoints were triggered concurrently, DR6 is set on all relevant threads. However, after handling SIGTRAP on the first one, the disable/reenable (or more specifically, remove/readd) wipes DR6 on all threads. The handler for next SIGTRAP can't establish the correct watchpoint number, and starts looking for breakpoints. Since hardware breakpoints are not implemented, the relevant method returns an error and lldb-server eventually exits.

There are two problems to be solved there. Firstly, lldb-server should not exit in this circumstances. This is already solved in the first patch as mentioned above. Secondly, we need to be able to handle concurrent watchpoint hits independently of the clear/set packets. This is solved by this patch.

There are multiple different approaches to this problem. I've chosen to remodel clear/set watchpoint method in order to prevent it from resetting DR6 if the same watchpoint is being restored, as the alternatives (such as pre-storing DR6 on the first SIGTRAP) have more corner conditions to be concerned about.

The current design of these two methods assumes that the 'clear' method clears both the triggered state in DR6 and control bits in DR7, while the 'set' method sets the address in DR0..3, and the control bits in DR7.

The new design limits the 'clear' method to disabling the watchpoint by clearing the enable bit in DR7. The remaining bits, as well as trigger status and address are preserved. The 'set' method uses them to determine whether a new watchpoint is being set, or the previous one merely reenabled. In the latter case, it just updates DR7, while preserving the previous trigger. In the former, it updates all registers and clears the trigger from DR6.

This solution effectively prevents the disable/reenable logic of LLDB from clearing concurrent watchpoint hits, and therefore makes it possible for the SIGTRAP handler to report them correctly. If the user manually replaces the watchpoint with another one, DR6 is cleared and LLDB does not associate the concurrent trigger to the watchpoint that no longer exists.

Thread status summary

The current version of the patches fixes approximately 47 test failures, and causes approximately 4 new test failures and 2 hanging tests. There is around 7 new flaky tests, related to signals concurrent with breakpoints or watchpoints.

Future plans

The first immediate goal is to investigate and resolve test suite regressions related to NetBSD 9 upgrade. The second goal is to get the threading patches merged, and simultaneously work on resolving the remaining test failures and hangs.

When that's done, I'd like to finally move on with the remaining TODO items. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

[0 comments]

 



Post a Comment:
Comments are closed for this entry.