This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Check out part 1 where we expose the context/high-level principles of the whole CI system, and make the machine fully controllable remotely (power on, OS to boot, keyboard/screen emulation using a serial console).
In this article, we will start demystifying the boot process, and discuss about different ways to generate and boot an OS image along with a kernel for your machine. Finally, we will introduce boot2container, a project that makes running containers on bare metal a breeze!
This work is sponsored by the Valve Corporation.
Generating a kernel & rootfs for your Linux-based testing
To boot your test environment, you will need to generate the following items:
- A kernel, providing all the necessary drivers for your test;
- A userspace, containing all the dependencies of your test (rootfs);
- An initramfs (optional), containing the drivers/firmwares needed to access the userspace image, along with an init script performing the early boot sequence of the machine;
The initramfs is optional because the drivers and their firmwares can be built in the kernel directly.
Let’s not generate these items just yet, but instead let’s look at the different ways one could generate them, depending on their experience.
The embedded way
If you are used to dealing with embedded devices, you are already familiar with projects such as Yocto or Buildroot. They are well-suited to generate a tiny rootfs which can be be useful for netbooted systems such as the one we set up in part 1 of this series. They usually allow you to describe everything you want on your rootfs, then will configure, compile, and install all the wanted programs in the rootfs.
If you are wondering which one to use, I suggest you check out the presentation from Alexandre Belloni / Thomas Pettazoni which will give you an overview of both projects, and help you decide on what you need.
- Minimal size: Only what is needed is included
- Complete: Configures and compiles the kernel for you
- Slow to generate: Everything is compiled from source
- Small selection of software/libraries: Adding build recipes is however relatively easy
The Linux distribution way
If you are used to installing Linux distributions, your first instinct might be to install your distribution of choice in a chroot or a Virtual Machine, install the packages you want, and package the folder/virtual disk into a tarball.
Fortunately, building the kernel is relatively simple, and there are plenty of tutorials on the topic (see ArchLinux’s wiki). Just make sure to compile modules and firmware in the kernel, to avoid the complication of using an initramfs. Don’t forget to also compress your kernel if you decide to netboot it!
- Relatively fast: No compilation necessary (except for the kernel)
- Familiar environment: Closer to what users/developers use in the wild
- Larger: Packages tend to bring a lot of unwanted dependencies, drastically increasing the size of the image
- Limited choice of distros: Not all distributions are easy to install in a chroot
- Insecure: Requires root rights to generate the image, which may accidentally trash your distro
- Poor reproducibility: Distributions get updates continuously, leading to different outcomes when running the same command
- No caching: all the steps to generate the rootfs are re-done every time
- Incomplete: does not generate a kernel or initramfs for you
The refined distribution way: containers
Containers are an evolution of the old chroot trick, but instead made secure thanks the addition of multiple namespaces to Linux. Containers and their runtimes have been addressing pretty much all the cons of the “Linux distribution way”, and became a standard way to share applications.
On top of generating a rootfs, containers also allow setting environment variables, control the command line of the program, and have a standardized transport mechanism which simplifies sharing images.
Finally, container images are constituted of cacheable layers, which can be used to share base images between containers, and also speed up the generation of the container image by only re-computing the layer that changed and all the layers applied on top of it.
The biggest draw-back of containers is that they usually are meant to be run on pre-configured hosts. This means that if you want to run the container directly, you will need to make sure to include an initscript or install systemd in your container, and set it as the entrypoint of the container. It is however possible to perform these tasks before running the container, as we’ll explain in the following sections.
- Fastest: No compilation necessary (except for the kernel), and layers cached
- Familiar: Shared environment between developers and the test system
- Flexible: Full choice of distro
- Secure: No root rights needed, everything is done in a user namespace
- Shareable: Containers come with a transport/storage mechanism (registries)
- Reproducible: Easily run the exact same userspace on your dev and test machines
- Larger: Packages tend to bring a lot of dependencies, drastically increasing the size of the image
- Incomplete: does not generate a kernel or initramfs for you
Deploying and booting a rootfs
Now we know how we could generate a rootfs, so the next step is to be able to deploy and boot it!
Challenge #1: Deploying the Kernel / Initramfs
There are multiple ways to deploy an operating system:
- Flash and reboot: Typical on ARM boards / Android phones;
- Netboot: Typical in big organizations that manage thousands of machines.
The former solution is great at preventing the bricking of a device that depends on an Operating System to be flashed again, as it enables checking the deployment on the device itself before rebooting.
The latter solution enables diskless test machines, which is an effective way to reduce state (the enemy #1 of reproducible results). It also enables a faster deployment/boot time as the CI system would not have to boot the machine, flash it, then reboot. Instead, the machine simply starts up, requests an IP address through BOOTP/DHCP, downloads the kernel/initramfs, and executes the kernel. This was the solution we opted for in part 1 of this blog series.
Whatever solution you end up picking, you will now be presented with your next challenge: making sure the rootfs remains the same across reboots.
Challenge #2: Deploying the rootfs efficiently
If you have chosen the
Flash and reboot deployment method, you may be
prepared to re-flash the entire Operating System image every time you boot.
This would make sure that the state of a previous boot won’t leak into
This method can however become a big burden on your network when scaled to tens of machines, so you may be tempted to use a Network File System such as NFS to spread the load over a longer period of time. Unfortunately, using NFS brings its own set of challenges (how deep is this rabbit hole?):
- The same rootfs directory cannot be shared across machines without duplication unless mounted read-only, as machines should not be able to influence each-other’s execution;
- The NFS server needs to remain available as long as at least one test machine is running;
- Network congestion might influence the testing happening on the machine, which can affect functional testing, but will definitely affect performance testing.
So, instead of trying to spread the load, we could try to reduce the size of the rootfs by only sending the content that changed. For example, the rootfs could be split into the following layers:
- The base Operating System needed for the testing;
- The driver you want to test (if it wasn’t in the kernel);
- The test suite(s) you want to run.
Layers can be downloaded by the test machine, through a short-lived-state network protocol such as HTTP, as individual SquashFS images. Additionally, SquashFS provides compression which further reduces the storage/network bandwidth needs.
The layers can then be directly combined by first mounting the layers to
separate folders in read-only mode (only mode supported by SquashFS), then
merging them using
OverlayFS will store all the writes done to this file system into the
directory. If this work directory is backed up by a ramdisk
(tmpfs) or a never-reused temporary
directory, then this would guarantee that no information from previous boots
would impact the new boots!
If you are familiar with containers, you may have recognized this approach as what is used by containers: layers + overlay2 storage driver. The only difference is that container runtimes depend on tarballs rather than SquashFS images, probably because this is a Linux-only filesystem.
If you are anything like me, you should now be pretty tempted to simply use containers for the rootfs generation, transport, and boot! That would be a wise move, given that thousands of engineers have been working on them over the last decade or so, and whatever solution you may come up with will inevitably have even more quirks than these industry standards.
I would thus recommend using containers to generate your rootfs, as there are plenty of tools that will generate them for you, with varying degree of complexity. Check out buildah, if Docker, or Podman are not too high/level for your needs!
Let’s now brace for the next challenge, deploying a container runtime!
Challenge #3: Deploying a container runtime to run the test image
In the previous challenge, we realized that a great way to deploy a rootfs efficiently was to simply use a container runtime to do everything for us, rather than re-inventing the wheel.
This would enable us to create an initramfs which would be downloaded along with the kernel through the usual netboot process, and would be responsible for initializing the machine, connecting to the network, mounting the layer cache partition, setting the time, downloading a container, then executing it. The last two steps would be performed by the container runtime of our choice.
Generating an initramfs is way easier than one can expect. Projects like dracut are meant to simplify their creation, but my favourite has been u-root, coming from the LinuxBoot project. I generated my first initramfs in less than 5 minutes, so I was incredibly hopeful to achieve the outlined goals in no time!
Unfortunately, the first setback came quickly:
container runtimes (Docker, or
Podman) are huge (~150 to 300 MB), if we are to believe
Alpine Linux’s size of their respective packages and dependencies! While this
may not be a problem for the
Flash and reboot method, it is definitely a
significant issue for the
Netboot method which would need to download it for
Challenge #3.5: Minifying the container runtime
After spending a significant amount of time studying container runtimes, I identified the following functions:
- Transport / distribution: Downloading a container image from a container registry to the local storage (spec );
- De-layer the rootfs: Unpack the layers’ tarball, and use OverlayFS to merge them (default storage driver, but there are many other ways);
- Generate the container manifest: A JSON-based config file specifying how the container should be run;
- Executing the container
Thus started my quest to find lightweight solutions that could do all of these steps… and wonder just how deep is this rabbit hole??
The usual executor found in the likes of Podman and Docker is runc. It is written in Golang, which compiles everything statically and leads to giant binaries. In this case, runc clocks at ~12MB. Fortunately, a knight in shining armour came to the rescue, re-implemented runc in C, and named it crun. The final binary size is ~400 KB, and it is fully compatible with runc. That’s good-enough for me!
To download and unpack the rootfs from the container image, I found genuinetools/img which supports that out of the box! Its size was however much bigger than expected, at ~28.5MB. Fortunately, compiling it ourselves, stripping the symbols, then compressing it using UPX led to a much more manageable ~9MB!
What was left was to generate the container manifest according to the runtime
started by hardcoding it to verify that I could indeed run the container. I was
relieved to see it would work on my development machine, even thought it
fails on my initramfs. After spending a couple of hours diffing
straces, poking a couple
of files sysfs/config files, and realizing that pivot_root does not work in an
, I finally managed to run the container with
crun run --no-pivot!
I was over the moon, as the only thing left was to generate the container manifest by patching genuinetools/img to generate it according to the container image manifest (like docker or podman does). This is where I started losing grip: lured by the prospect of a simple initramfs solving all my problems, being so close to the goal, I started free-falling down what felt like the deepest rabbit hole of my engineering career… Fortunately, after a couple of weeks, I emerged, covered in mud but victorious! Queue the gory battle log :)
When trying to access the container image’s manifest in img, I realized that it was re-creating the layers and manifest, and thus was losing the information such as entrypoint, environment variables, and other important parameters. After scouring through its source code and its 500 kLOC of dependencies, I came to the conclusion that it would be easier to start a project from scratch that would use Red Hat’s image and storage libraries to download and store the container on the cache partition. I then needed to unpack the layers, generate the container manifest, and start runc. After a couple of days, ~250 lines of code, and tons of staring at straces to get it working, it finally did! Out was img, and the new runtime’s size was under 10 MB \o/!
The last missing piece in the puzzle was performance-related: use OverlayFS to merge the layers, rather than unpacking them ourselves.
This is when I decided to have another look at Podman, saw that they have
their own internal library for all the major functions, and decided to compile
podman to try it out. The binary size was ~50 MB, but after removing some
features, setting the
-w -s LDFLAGS, and compressing it using
upx --best, I
got the final size to be ~14 MB! Of course, Podman is more than just one binary,
so trying to run a container with it failed. However, after a bit of
experimentation and stracing, I realized that running the container with
--privileged --network=host would work using crun… provided we force-added
--no-pivot parameter to crun. My happiness was however short-lived,
replaced by a MAJOR FACEPALM MOMENT:
After a couple of minutes of constant facepalming, I realized I was also relieved, as Podman is a battle-tested container runtime, and I would not need to maintain a single line of Go! Also, I now knew how deep the rabbit was, and we just needed to package everything nicely in an initramfs and we would be good. Success, at last!
Boot2container: Run your containers from an initramfs!
If you have managed to read through the article up to this point, congratulations! For the others who just gave up and jumped straight to this section, I forgive you for teleporting yourself to the bottom of the rabbit hole directly! In both cases, you are likely wondering where is this breeze you were promised in the introduction?
Boot2container enters the chat.
Boot2container is a lightweight (sub-20 MB) and fast initramfs I developed that will allow you to ignore the subtleties of operating a container runtime and focus on what matters, your test environment!
Here is an example of how to run boot2container, using SYSLINUX:
MENU LABEL Run docker's hello world container, with caching disabled
APPEND b2c.container=docker://hello-world b2c.cache_device=none b2c.ntp_peer=auto
The hello-world container image will be run in privileged mode, without the host network, which is what you want when running the container for bare metal testing!
Make sure to check out the list of features and options before either generating the initramfs yourself or downloading it from the releases page. Try it out with your kernel, or the example one bundled in in the release!
With this project mostly done, we pretty much conclude the work needed to set up the test machines, and the next articles in this series will be focusing on the infrastructure needed to support a fleet of test machines, and expose it to Gitlab/Github/…
That’s all for now, thanks for reading that far!