Threading support in LLDB continued


October 05, 2019 posted by Michał Górny

Upstream describes LLDB as a next generation, high-performance debugger. It is built on top of LLVM/Clang toolchain, and features great integration with it. At the moment, it primarily supports debugging C, C++ and ObjC code, and there is interest in extending it to more languages.

In February, I have started working on LLDB, as contracted by the NetBSD Foundation. So far I've been working on reenabling continuous integration, squashing bugs, improving NetBSD core file support, extending NetBSD's ptrace interface to cover more register types and fix compat32 issues and fixing watchpoint support. Then, I've started working on improving thread support. You can read more about that in my July 2019 report.

I've been on vacation in August, and in September I've resumed the work on LLDB. I've started by fixing new regressions in LLVM suite, then improved my previous patches and continued debugging test failures and timeouts resulting from my patches.

LLVM 8 and 9 in NetBSD

Updates to LLVM 8 src branch

I have been asked to rebase my llvm8 branch of NetBSD src tree. I've done that, and updated it to LLVM 8.0.1 while at it.

LLVM 9 release

The LLVM 9.0.0 final has been tagged in September. I have been doing the pre-release testing for it, and discovered that the following tests were hanging:

LLVM :: ExecutionEngine/MCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/MCJIT/eh.ll
LLVM :: ExecutionEngine/MCJIT/multi-module-eh-a.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh-lg-pic.ll
LLVM :: ExecutionEngine/OrcMCJIT/eh.ll
LLVM :: ExecutionEngine/OrcMCJIT/multi-module-eh-a.ll

I couldn't reproduce the problem with LLVM trunk, so I've instead focused on looking for a fix. I've came to the conclusion that the problem was fixed through adding missing linked library. I've requested backport in bug 43196 and it has been merged in r371042.

I didn't put more effort into figuring out why the lack of this linkage caused issues for us. However, as Lang Hames said on the bug, ‘adding the dependency was the right thing to do’.

LLVM 9 for NetBSD src

Afterwards, I have started working on updating my NetBSD src branch for LLVM 9. However, in middle of that I've been informed that Joerg has already finished doing that independently, so I've stopped.

Furthermore, I was informed that LLVM 9.0.0 will not make it to src, since it still lacks some fixes (most notably, adding a pass to lower is.constant and objectsize intrinsics). Joerg plans to import some revision of the trunk instead.

Buildbot regressions

Initial regressions

The first problem that needed solving was LLDB build failure caused by replacing std::once_flag with llvm::once_flag. I've came to the conclusion that the build fails because the call site in LLDB combined std::call_once with llvm::once_flag. The solution was to replace the former with llvm::call_once.

After fixing the build failure, we had a bunch of test failures on buildbot to address. Kamil helped me and tracked one of them down to a new test for stack exhaustion handling. The test author decided that it ‘is only a best-effort mitigation for the case where things have already gone wrong’, and marked it unsupported on NetBSD.

On the plus side, two of the tests previously failing on NetBSD have been fixed upstream. I've un-XFAIL-ed them appropriately. Five new test failures in LLDB were related to those tests being unconditionally skipped before — I've marked them XFAIL pending further investigation in the future.

Another set of issues was caused by enabling -fvisibility=hidden for libc++ which caused problems when building with GCC. After being pinged, the author decided to enable it only for builds done using clang.

New issues through September

During September, two new issues arose. The first one was my fault, so I'm going to cover it in appropriate section below. The second one was new thread_local test failing. Since it was a newly added test that failed on most of the supported platforms, I've just added NetBSD to the list of failing platforms.

Current buildbot status

After fixing the immediate issues, the buildbot returned to previous status. The majority of tests pass, with one flaky test repeatedly timing out. Normally, I would skip this specific test in order to have buildbot report only fresh failures. However, since it is threading-related I'm waiting to finish my threading update and reassert afterwards.

Furthermore, I have added --shuffle to lit arguments in order to randomize the order in which the tests are run. According to upstream, this reduces the chance of load-intensive tests being run simultaneously and therefore causing timeouts.

The buildbot host seems to have started crashing recently. OpenMP tests were causing similar issues in the past, and I'm currently trying to figure out whether they are the culprit again.

__has_feature(leak_sanitizer)

Kamil asked me to implement a feature check for leak sanitizer being used. The __has_feature(leak_sanitizer) preprocessor macro is complementary to __SANITIZE_LEAK__ used in NetBSD gcc and is used to avoid reports when leaks are known but the cost of fixing them exceeds the gain.

Progress in threading support

Fixing LLDB bugs

In the course of previous work, I had a patch for threading support in LLDB partially ready. However, the improvements have also resulted in some of the tests starting to hang. The main focus of my late work as investigating those problems.

The first issue that I've discovered was inconsistency in expressing no signal sent. In some places, LLDB used LLDB_INVALID_SIGNAL (-1) to express that, in others it used 0. So far this went unnoticed since the end result in ptrace calls was the same. However, the reworked NetBSD threading support used explicit PT_SET_SIGINFO which — combined with wrong signal parameter — wiped previously queued signal.

I've fixed C packet handler, then fixed c, vCont and s handlers to use LLDB_INVALID_SIGNAL correctly. However, I've only tested the fixes with my updated thread support, causing regression in the old code. Therefore, I've also had to fix LLDB_INVALID_SIGNAL handling in NetBSD plugin for the time being.

Thread suspend/resume kernel problem

Sadly, further investigation of hanging tests led me to the conclusion that they are caused by kernel bugs. The first bug I've noticed is that PT_SUSPEND/PT_RESUME do not cause the thread to be resumed correctly. I have written the following reproducer for it:

#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

void* thread_func(void* foo) {
    int i;
    printf("in thread_func, lwp = %d\n", _lwp_self());
    for (i = 0; i < 100; ++i) {
        printf("t2 %d\n", i);
        sleep(2);
    }
    printf("out thread_func\n");
    return NULL;
}

int main() {
    int ret;
    int pid = fork();
    assert(pid != -1);
    if (pid == 0) {
        int i;
        pthread_t t2;

        ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
        assert(ret != -1);
        printf("in main, lwp = %d\n", _lwp_self());
        ret = pthread_create(&t2, NULL, thread_func, NULL);
        assert(ret == 0);
        printf("thread started\n");

        for (i = 0; i < 100; ++i) {
            printf("t1 %d\n", i);
            sleep(2);
        }

        ret = pthread_join(t2, NULL);
        assert(ret == 0);
        printf("thread joined\n");
    }

    sleep(1);
    ret = kill(pid, SIGSTOP);
    assert(ret == 0);
    printf("stopped\n");

    pid_t waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    printf("t2 suspend\n");
    ret = ptrace(PT_SUSPEND, pid, NULL, 2);
    assert(ret == 0);
    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    sleep(3);
    ret = kill(pid, SIGSTOP);
    assert(ret == 0);
    printf("stopped\n");

    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    printf("t2 resume\n");
    ret = ptrace(PT_RESUME, pid, NULL, 2);
    assert(ret == 0);
    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    sleep(5);
    ret = kill(pid, SIGTERM);
    assert(ret == 0);

    waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);

    return 0;
}

The program should run a two-threaded subprocess, with both threads outputting successive numbers. The second thread should be suspended shortly, then resumed. However, currently it does not resume.

I believe that this caused by ptrace_startstop() altering process flags without reimplementing the complete logic as used by lwp_suspend() and lwp_continue(). I've been able to move forward by calling the two latter functions from ptrace_startstop(). However, Kamil has indicated that he'd like to make those routines use separate bits (to distinguish LWPs stopped by process from LWPs stopped by debugger), so I haven't pushed my patch forward.

Multiple thread reporting kernel problem

The second and more important problem is related to how new LWPs are reported to the debugger. Or rather, that they are not reported reliably. When many threads are started by the process in a short time (e.g. in a loop), the debugger receives reports only for some of them.

This can be reproduced using the following program:

#include <assert.h>
#include <lwp.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

void* thread_func(void* foo) {
    printf("in thread, lwp = %d\n", _lwp_self());
    sleep(10);
    return NULL;
}

int main() {
    int ret;
    int pid = fork();
    assert(pid != -1);
    if (pid == 0) {
        int i;
        pthread_t t[10];

        ret = ptrace(PT_TRACE_ME, 0, NULL, 0);
        assert(ret != -1);
        printf("in main, lwp = %d\n", _lwp_self());
        raise(SIGSTOP);
        printf("main resumed\n");

        for (i = 0; i < 10; i++) {
            ret = pthread_create(&t[i], NULL, thread_func, NULL);
            assert(ret == 0);
            printf("thread %d started\n", i);
        }

        for (i = 0; i < 10; i++) {
            ret = pthread_join(t[i], NULL);
            assert(ret == 0);
            printf("thread %d joined\n", i);
        }

        return 0;
    }

    pid_t waited = waitpid(pid, &ret, 0);
    assert(waited == pid);
    printf("wait: %d\n", ret);
    assert(WSTOPSIG(ret) == SIGSTOP);

    struct ptrace_event ev;
    ev.pe_set_event = PTRACE_LWP_CREATE | PTRACE_LWP_EXIT;

    ret = ptrace(PT_SET_EVENT_MASK, pid, &ev, sizeof(ev));
    assert(ret == 0);

    ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
    assert(ret == 0);

    while (1) {
        waited = waitpid(pid, &ret, 0);
        assert(waited == pid);
        printf("wait: %d\n", ret);
        if (WIFSTOPPED(ret)) {
            assert(WSTOPSIG(ret) == SIGTRAP);

            ptrace_siginfo_t info;
            ret = ptrace(PT_GET_SIGINFO, pid, &info, sizeof(info));
            assert(ret == 0);

            struct ptrace_state pst;
            ret = ptrace(PT_GET_PROCESS_STATE, pid, &pst, sizeof(pst));
            assert(ret == 0);
            printf("SIGTRAP: si_code = %d, ev = %d, lwp = %d\n",
                    info.psi_siginfo.si_code, pst.pe_report_event, pst.pe_lwp);

            ret = ptrace(PT_CONTINUE, pid, (void*)1, 0);
            assert(ret == 0);
        } else
            break;
    }

    return 0;
}

The program starts 10 threads, and the debugger should report 10 SIGTRAP events for LWPs being started (ev = 8) and the same number for LWPs exiting (ev = 16). However, initially I've been getting as many as 4 SIGTRAPs, and the remaining 6 threads went unnoticed.

The issue is that do_lwp_create() does not raise SIGTRAP directly but defers that to mi_startlwp() that is called asynchronously as the LWP starts. This means that the former function can return before SIGTRAP is emitted, and the program can start another LWP. Since signals are not properly queued, multiple SIGTRAPs can end up being issued simultaneously and lost.

Kamil has already worked on making simultaneous signal deliver more reliable. However, he reverted his commit as it caused regressions. Nevertheless, applying it made it possible for the test program to get all SIGTRAPs at least most of the time.

The ‘repeated’ SIGTRAPs did not include correct LWP information, though. Kamil has recently fixed that by moving the relevant data from process information to signal information struct. Combined with his earlier patch, this makes my test program pass most of the time (sadly, there seem to be some more race conditions involved).

Summary of threading work

My current work-in-progress patch can be found on Differential as D64647. However, it is currently unsuitable for merging as some tests start failing or hanging as a side effect of the changes. I'd like to try to get as many of them fixed as possible before pushing the changes to trunk, in order to avoid causing harm to the build bot.

The status with the current set of Kamil's work-in-progress patches applied to the kernel includes approximately 4 failing tests and 10 hanging tests.

Other LLVM news

Manikishan Ghantasala has been working on NetBSD-specific clang-format improvements in this year's Google Summer of Code. He is continuing to work on clang-format, and has recently been given commit access to the LLVM project!

Besides NetBSD-specific work, I've been trying to improve a few other areas of LLVM. I've been working on fixing regressions in stand-alone build support and regressions in support for BUILD_SHARED_LIBS=ON builds. I have to admit that while a year ago I was the only person fixing those issues, nowadays I see more contributions submitting patches for breakages specific to those builds.

I have recently worked on fixing bad assumptions in LLDB's Python support. However, it seems that Haibo Huang has taken it from me and is doing a great job.

My most recent endeavor was fixing LLVM_DISTRIBUTION_COMPONENTS support in LLVM projects. This is going to make it possible to precisely fine-tune which components are installed, both in combined tree and stand-alone builds.

Future plans

My first goal right now is to assert what is causing the test host to crash, and restore buildbot stability. Afterwards, I'd like to continue investigating threading problems and provide more reproducers for any kernel issues we may be having. Once this is done, I'd like to finally push my LLDB patch.

Since threading is not the only goal left in the TODO, I may switch between working on it and on the remaining TODO items. Those are:

  1. Add support to backtrace through signal trampoline and extend the support to libexecinfo, unwind implementations (LLVM, nongnu). Examine adding CFI support to interfaces that need it to provide more stable backtraces (both kernel and userland).

  2. Add support for i386 and aarch64 targets.

  3. Stabilize LLDB and address breaking tests from the test suite.

  4. Merge LLDB with the base system (under LLVM-style distribution).

This work is sponsored by The NetBSD Foundation

The NetBSD Foundation is a non-profit organization and welcomes any donations to help us continue funding projects and services to the open-source community. Please consider visiting the following URL to chip in what you can:

https://netbsd.org/donations/#how-to-donate

[0 comments]

 



Post a Comment:
Comments are closed for this entry.