New build cluster speeds up daily autobuilds
Saying goodbye
For several years our autobuild cluster at Columbia University has been providing our CI and produced many official release builds.
Over the years, with new compiler versions and more targets to build, the build times (and space requirements) grew. A lot. A few weeks ago the old cluster needed slightly more then nine and a half hours for a full build of -current.
So it was time to replace the hardware. At the same time we moved to another colocation with better network connectivity.
Preparing for the future
But not only did the hardware age, the CI system we use (a bunch of shell scripts) needed a serious overhaul.
As reported in the last AGM, riastradh@ did most of the work to make the scripts able to deal with mercurial instead of cvs.
This is a necessary step in the preparation of our upcoming move away from cvs, which is now completed. While the build cluster currently (again) runs from cvs, we completed a few full builds from mercurial in the last few weeks.
User visible changes
The cvs runs for builds were purely time driven. To deal with latency between cvs.NetBSD.org and anoncvs.NetBSD.org there was a 10-minute lag built into the system, so all timestamps (the names of the build directories) were exact to the minute only and had a trailing zero.
This does not fit well with modern distributed source control systems. For a hg or git build the cluster fetches the repository state from the master repository and updates the checkout directory to the latest state of a branch (or HEAD). So the repository state dictates the time stamp for the build, not vice versa as we did for cvs.
As a result there is a difference between the time a build is started (wall clock) and the time of the last change of the current branch in the repository (build time or reproducible ID). You can see both times in the status page. Just right now it displays:
Currently building tag HEAD-lint, build 2025-07-22 19:19:02 UTC (started at: 2025-07-22 19:26:40 UTC). No results yet.
And, obviously, we can now easily skip builds when nothing has changed - which happens often on old branches, but also may happen when all HEAD builds are done before anything new has been committed. Hurry up, you lazy developers!
While we are still running from cvs, the process of checking if anything has changed is quite slow (in the order of five minutes). So you may notice a status display like:
Currently searching for source changes...while the "cvs update" is crawling. When running from hg (or git) this will be lot faster.
The new hardware
The new cluster consists of four build machines, each equipped with a dual 16 core EPYC CPU and 256 GB of RAM. As a "brain" we have an additional controller node with 32 GB RAM and a (somewhat weaker) Intel CPU. The controller node creates all source sets, does repository operations, and distributes sources but does not perform any building itself.
On the build machines we run each 8 builds in parallel, and each build with low parallelism (-j 4). This tries to saturate all cpus during most of the time of the overall build. Individual builds have several phases with little/reduced possibility of parallelism, e.g., initially during configure runs, or while linking llvm components, or near the end when compressing artefacts. The goal is not to run a single (individual) build as fast as possible, but the whole set of them in as little time overall.
The build log summary
The head of the result page create for each build now tries to give all the details necessary to recreate this particular build, e.g.:
From source dated Tue Jul 22 04:45:13 UTC 2025 Reproducible builds timestamp: 1753159513 Repository: src@cvs:HEAD:20250722044513-xsrc@cvs:HEAD:20250720221743
The two timestamps are the same, one in seconds since the Unix epoch, the other human readable.
The repository ID is, for cvs, again based on timestamps. But for hg or git you will see the typical commit hash values here.
Why a custom CI system?
We considered using an off-the-shelf CI system and customizing it to our needs instead of moving the fully custom build system forward. We also talked to FreeBSD release engineering about it and asked for their experience. Overall the work for customizing an off-the-shelf solution looked roughly equivalent to the work needed now to modify the existing working solution, and we saw no huge benefit for the future picking either of them.
What does our build infrastructure provide?
First and foremost, we can quickly verify if all of our supported branches indeed compile. While developers are supposed to commit only tested changes, usually they build at most one branch and architecture. But sometimes changes have unforeseen consequences, for example an install image of an exotic architecture growing more in size than expected.
Then, in theory, no NetBSD user has to waste any CPU cycles in order to install (for example) NetBSD-current or the latest NetBSD 9 tree. Instead you can simply download an image and install from that — no matter if your architecture is amd64 or a slightly more exotic one (but you will of course always be able to download the source tree and build that if that suits you better).
Note that a supported architecture is not just supported at one point in time and then as time progresses might start collecting dust and not compile anymore, making it hard to merge all current changes back into that tree. Instead once an architecture is supported, our CI system will without rest build that architecture as one of the many we support. Take the rather new Wii port as an example. Now that it is supported, you can at least for the foreseeable future download the latest and greatest NetBSD release, populate an SD card with the image and boot it. Now, in half a year, and in 10 years (and even further down the line).
Acknowledgements
We are grateful to Two Sigma Investments, LP for providing the space and connectivity for the new build cluster.
[0 comments]