Why you should Dockerise your build environments
Docker and containers is all the buzz these days, but most of the talks about it is related to running online services. There is however another extremely useful mode where a Docker image is used as a "command line tool" to perform some task from a clean state and then end. A perfect use case for this is building software, where instead of having to configure your own computer with all the build environments needed you simply use Docker images.
Before landing on the rather dull title of "Dockerising your build environments" I went through several other potential titles. All which ran along the same theme as "How Docker solved every problem I ever had!". In this article I am going to discuss the various common challenges in building software I have encounter over the past 20 years and how embracing Docker has eliminated all of them.
In the earlier days of software development I only dealt with Windows systems. My host was Windows, and the software I was building was for Windows. Visual Studio was the builder of choice at the time. There was no Continuous Integration (CI) system at that stage and it was not too onerous to keep all the developers machines in sync with the same version of VS. We had a few tweaks that were required, and got away with simply a readme file of how to setup your build environment. Official builds were typically done on my computer. Over they years our tiny company became much larger and with it extra complications.
We started adding more platforms that we built for. While originally we only built for Windows, we later added various variants of Linux, MacOS, iOS, and Android to the list. On top of this we had many more developers, many who used different hosts than Windows. Additionally we had many more projects that we were maintaining. Maintaining a sensible build system became far more difficult.
We started using Bamboo as our CI system. This would be our official builder for releases. It also meant everyone was able to set off builds for every platform using this. Typically the developers were only able to build a subset of the products locally. For example a developer with a MacOS host would probably only be able to locally build for MacOS and iOS, and even then usually only with one particular version of Xcode installed which may be "close enough" but not quite the version that the official builder would use. It was possible to use Virtual Machines (VMs) on your computer to build for other platforms but it was a lot of work setting up the environments, and rarely gave you 100% coverage of all the possibilities we built for. The CI system could be relied on to produce the official builds, however the maintenance of it was problematic and resource contention a large issue.
Our CI system ran on a cluster of VMware ESX servers, each server hosted many VMs. Each VM would have a particular build environment and would have the Bamboo runner agent on it. Due to various shortcomings and restrictions each VM could only run one job at a time. In order to build in a timely manner we had multiple VMs for each build type so builds would occur in parallel. For example in just building Windows alone would require 8 jobs as we had x86, x64, Debug, Release, and all in Windows user mode and Windows Kernel mode builds. Each of these builder VMs would be an individual CM that was always running, waiting for jobs to be assigned. This had several issues
- The VMs had considerable overhead as each one was a full operating system and the Bamboo agent itself would consume 500MB of RAM while idle.
- While the VMs were originally cloned from a template they all needed to be updated individually. Sometimes they would deviate as someone forgot to update all of them in sync.
- There could only be so many VMs running, but not all could be used all at once unless your project just happened to require exactly the same amount of jobs and types as we had builders. So for any given run there would be several VMs running idle but taking resources while jobs were queued on a smaller set.
- Getting a build environment change for a particular project was very cumbersome and required coordination with many other people.
The last point was really the most significant and frustrating problem. If your project decided it needed a newer version of some builder, say the dotnet framework, it was not as simple as just upgrading the builder VMs. There were significant hurdles:
- The builders are shared between many projects and the other ones may not be ready to update the version they build with.
- The servers were typically full already so adding brand new VMs was not possible without removing some older ones.
- In order to not just have everyone adhoc messing with the builders whenever they felt like it someone was assigned to be responsible for the whole setup. This required a lot of coordination and negotiation.
Despite the best efforts we were often tripped up. There were countless situations where it was deemed that it was perfectly safe to update a bunch of builders to a newer version of something and then a couple of days later here screams from another team when their project build now breaks and they are trying to get a release out that day. There were also a lot of compromises that had to be made such as "can you not just use an older version of python" or "you can have just 2 builders for that configuration and its going to take forever to build". Developers also had to get used to random fails that you just knew you had to press rebuild and typically it would work. These were caused by a rouge builder being slightly different and failing a particular job, but it only happened when a particular job landed on a particular builder. Tracking down the issue was time consuming so typically just easier to hit rebuild and forget about it until it happened again. Only when the failure rate was too high did anyone get around to fixing it. Often the "fix" was simply to disable that builder and kick the problem down the road for a while.
All the above troubles were just for the CI system, trying to get people to be able to locally build was even harder. A typical scenario would be a new developer would install their computer and if it was Windows would install Visual Studio 2017 but they forgot to include the extra optional ATL component and one particular project would break when compiling. If they did install VS2017 correctly there was a good chance that they didn't install the WinDDK that we used, or didn't set the environment variables up properly. As a result it typically took days to get a new developer up to the point of just being able to build the software, and even then only a subset of it.
In late 2020 I discovered Docker and my life changed completely, and shortly after our entire build system changed completely! Over the course of several months I experimented and refined docker images until it was at a level I called "pretty near perfect". Docker allows each individual build environment to be completely isolated and containerised without affecting anything else. For example you can have a docker image that has a particular version of Ubuntu on and a particular version of GCC or Clang and that does not touch anything on the host computer or affect any other image. Unlimited numbers of images can exist in parallel all without interfering with each other. This is a very different situation to trying to install multiple build environments on a single host where conflicts almost always occur.
There is now no requirement for any developer to install any build system at all on their own computer. They need only install a host OS, their favourite editor, and install Docker. Installing Docker on Linux, MacOS, or Windows is trivial. Once these are setup along with access to the source code a developer can build any configuration of the project locally. The docker images are all hosted on our network and are automatically downloaded the first time they are used. The start up time of a new developer machine to be able to build software has now gone from a few days to be able to build some of it to thirty minutes to be able to build 100% of it.
The CI system is now also almost infinitely better. Instead of a whole myriad of different VMs there is now only one type and that is a Gitlab runner that can launch Docker images (We also shifted from Bamboo to Gitlab which is far better suited for Docker use). We only need a single VM per physical ESX host. The VM is configured to have many CPU cores and memory. Each VM has a very simple setup which is scripted. It takes approximately 15 minutes to provision a brand new host from nothing. Ubuntu is installed, Docker is installed, and Gitlab runner is installed. That is it. The maintenance of the VMs is pretty much zero as there is nothing to update. Each project can use whatever build environment they wish, just by selecting the Docker image they want for any particular job. There is no one in charge of maintaining a central set of Docker images, anyone can create their own customised image at anytime and publish it and it will immediately available to everyone. As each image has its own name it does not affect anyone else's project. No resources are wasted because the images are only running while building, the rest of the time they just an image on disc. Each machine only has one copy of the image on disc regardless of how many instances it may run at once. Images that are only small variations of another image take up even less space as they just store a difference layer.
It is now very easy for a project to make a small tweak to an existing build environment and publish a new image that they can use without wrecking anyone else's project. No compromises have to be made as to how many builders of that type we can have. And no one needs to wait for someone else to coordinate and decide what environments are available.
One other extremely good feature about Docker images is that they are created via scripts (Dockerfiles) which means the build environment is always documented. Rather than relying on build documents that rarely get updated and peoples' memories, the exact method to install a build environment is explicitly listed in the Dockerfile.
All of this might sound too good to be true, so "what are the downsides" you might ask?
Absolutely none! It truly is amazing! Although it took me quite a long time to get it to this level, no one else in our company needs to do the same as its extremely easy to just piggy back on the work already done. Anyone can take any of the build scripts and tweak them to their own requirements, or just use them directly if it's compatible. Some of the build environments were much harder to initially Dockerise than others. For example a build environment consisting of Ubuntu 20.04 and Clang is not especially challenging and has only a few lines in a Dockerfile. Building for MacOS and iOS were more of a challenge, as was getting Visual Studio to build using Wine in a Docker container. The challenges and time spent were well worth it given the results we now have.
FAQ: "Okay that sounds great, but how do I actually do it?"
A: "A follow up article will demonstrate some real world build environments you can use."
The following article gives a working example of a Docker image that can build Windows drivers using the WinDDK running inside Wine. The resulting image can be run on any host.