Fast CI builds | Quentin Dufour

← Retour

Historically, in the good old Jenkins days, a CI build would occure in a workspace that was kept across build. So your previous artifacts could be re-used if they did not change (for example, make would detect that some files did not change since that last build and thus did not recompile them). Also it was assumed that all dependencies were directly installed on the machine. Both of these properties allowed for very fast and efficient builds: only what changed needed to be rebuilt.

This approach had many shortcomings: stale cache would break builds (or wrongly make it work), improper dependency tracking would make building on a new machine very hard, etc. In the end, developers stop trusting the CI that remain broken, bugs start cripling the project and are not noticed, and finally the codebase becomes unmaintainable.

To avoid these problems, developers started to use a new generation of CI relying on VM (like Travis CI) or containers (like Drone). All builds start with a fresh environment, often a well-known distribution like Ubuntu. Then, for each builds, all the dependencies are installed and the build is done from scratch. Such approach greatly helped developers better track their dependencies and make sure that building their project from scratch remains possible. However, build times skyrocketted. You can wait more than 10 minutes before running a command that would actually check your code. And as recommended by many people¹²³ the whole build cycle (lint, build, test) shoud remains below 10 minutes to be useful.

To speed-up the CI, various optimizations have been explored. CI sometimes propose some sort of caching API, and when it does not, an object store like S3 can be used. This cache is used either by directly copying the dependency folder⁴ (for example the target/ folder for Rust or the node_modules/ for Node.JS), or through dedicated tools like sccache⁵. In this scenario, fetching/updating the cache involves a non negligible amount of filesystem+network I/O. Another approach relies on providing your own build image that will often be cached on workers. This image can contain your toolchain (for example Rust + Cargo + Clippy + etc.), but also your project dependencies (by copying your Cargo.toml or package.json file) and pre-fetching/compiling them. This approach still involves some maintenance burden: image must be rebuilt and published each time a dependency is changed, it’s project specific, it can easily break, you still do not track correctly your dependencies, etc.

Can we cache without making our builds fragile?

Nix to the rescue

Following our short discussions, the question that surface is wether or not we can cache efficiently without making our build fragile. Ideally, our project would be split in parts compiled in strict isolation, dependencies between parts would be stricly tracked, cache would be kept locally, and new job would only focus on rebuilding the changed components (avoiding steps like restoring cache & co).

That’s what Nix can do, at least in a theory. A SaaS CI ecosystem start developping around it with solutions like Garnix or Hercules CI.

But personnaly, I am more interested in FOSS solutions, and thus existing solutions like Hydra or Typhon seem more baroque. Worse, often a CI system based on Docker is already deployed in your organization (like Woodpecker, Gitlab Runner, Forgejo Actions, etc.), and so you didn’t really have a choice here: you must use what’s already there.

The Docker way

In the following, I will describe a docker deployment that should be generic enough to be adapted to any Docker-based CI system. It’s inspired by my own experience⁶ and a blog post by Kevin Cox⁷.

First, we will spawn a unique nix-daemon on the worker, outside of the CI system:

docker run -i \
  -v nix:/nix \
  --privileged \
  nixpkgs/nix:nixos-22.05 \
  nix-daemon

Then we will mount this nix volume as read-only in our jobs. The job will be able to access the store to run the programs it needs. It can add new things to the store by scheduling builds in the daemon through a dedicated UNIX socket. This approach is called Multi-user Nix: trusted building⁸.

docker run -it --rm \
  -e "NIX_REMOTE=unix:///mnt/nix/var/nix/daemon-socket/socket?root=/mnt" \
  -e "NIX_CONFIG=extra-experimental-features = nix-command flakes" \
  -v nix:/mnt/nix:ro \
  -v `pwd`:/workspace \
  -w /workspace \
  nixpkgs/nix:nixos-24.05 \
  nix build .#

Note how the nix daemon and the nix interactive instance have a different version. It’s possible as, in the interactive instance, we did not mount the daemon store on the default path (/nix) but on another one (/mnt/nix) and instructed it to use it in the NIX_REMOTE environment variable. This point is important as it enables you to decouple the lifecycle of your worker daemons from the one of your projects, which drastically ease maintenance.

A woodpecker integration

Basically, you want to run your nix-daemon next to your woodpecker agent, for example in a docker-compose. Then, you need to pass specific parameters to your woodpecker agent such that our volume and environment variables are automatically injected to all your builds:

version: '3.4'
services:
  nix-daemon:
    image: nixpkgs/nix:nixos-22.05
    restart: always
    command: nix-daemon
    privileged: true
    volumes:
      - "nix:/nix"

  woodpecker-runner:
    image: woodpeckerci/woodpecker-agent:v2.4.1
    restart: always
    environment:
      # -- our NixOS / CI specific env
      - WOODPECKER_BACKEND_DOCKER_VOLUMES=woodpecker_nix:/mnt/nix:ro
      - WOODPECKER_ENVIRONMENT=NIX_REMOTE:unix:///mnt/nix/var/nix/daemon-socket/socket?root=/mnt,NIX_CONFIG:extra-experimental-features = nix-command flakes
      # -- change these for each agent
      - WOODPECKER_HOSTNAME=i_forgot_to_change_my_runner_name
      - WOODPECKER_AGENT_SECRET=xxxx
      # -- should not need change
      - WOODPECKER_SERVER=woodpecker.example:1111
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"

volumes:
  nix:

Note that the volume is named woodpeck_nix and not nix in the woodpacker agent configuration (WOODPECKER_BACKEND_DOCKER_VOLUMES environment declaration). It’s because our docker-compose.yml is in a woodpecker folder and docker compose prefixes the created volumes with the name of the deployment, by default the parent folder name. The prefix is not needed elsewhere, as elsewhere, the resolution is dynamically done by compose. But the WOODPECKER_BACKEND_DOCKER_VOLUMES declaration is not part of compose, it will be used later by woodpecker when interacting directly with the Docker API.

Then, in your project .woodpecker.yml, you can seemlessly use nix and enjoy efficient and quick caching:

steps:
  - name: build
    image: nixpkgs/nix:nixos-24.05
    commands:
      - nix build .#

Limitations

Anyone having access to your CI will have a read access to your nix store. People will also be able to store data in your /nix/store.

Finally, if I remember correctly, there are some attacks to alter the content of a derivation (such that a content in /nix/store is not the product of the hashed derivation). In other words, it’s mainly a single-tenant solution.

So a great evolution would be a multi-tenant system, either by improving the nix-daemon isolation, or by running one nix-daemon per-project or per-user/per-organization. Today, none of these solutions is possible.

Another limitation is garbage collection: if the nix-daemon can do some garbage collection, none of its policy is interesting for a CI. Mainly, if you activate it, it will ditch everything as it is connected to “no root path” from its point of view. A LRU cache policy would be a great addition. At least, you can manually trigger a garbage collection once your disk is full…

How long should your CI take. Various industry resources suggest an ideal CI time of around 10 minutes for completing a full build, test, and analysis cycle. As Kent Beck, author of Extreme Programming, said, “A build that takes longer than ten minutes will be used much less often, missing the opportunity for feedback. A shorter build doesn’t give you time to drink your coffee.” ↩
Measure and Improve Your CI Speed with Semaphore. We’re convinced that having a build slower than 10 minutes is not proper continuous integration. When a build takes longer than 10 minutes, we waste too much precious time and energy waiting, or context switching back and forth. We merge rarely, making every deploy more risky. Refactoring is hard to do well. ↩
Continuous Integration Certification. Finally he asks if, when the build fails, it’s usually back to green within ten minutes. With that last question only a few hands remain. Those are the people who pass his certification test. ↩
Rust CI Cache. We can cache the build artifacts by caching the target directory of our workspace. ↩
My ideal Rust workflow. The basic idea behind sccache, at least in the way I have it set up, it’s that it’s invoked instead of rustc, and takes all the inputs (including compilation flags, certain environment variables, source files, etc.) and generates a hash. Then it just uses that hash as a cache key, using in this case an S3 bucket in us-east-1 as storage. ↩
I tried writing a CI on top of Nomad that would wrap a dockerized NixOS, and also deployed a Woodpecker/Drone CI NixOS runner. ↩
Nix Build Caching Inside Docker Containers. I wanted to see if I could cache dependencies without uploading, downloading or copying them around for each job. ↩
Untrusted CI: Using Nix to get automatic trusted caching of untrusted builds. This means that untrusted contributors can upload a “build recipe” to a privileged Nix daemon which takes care of running the build as an unprivileged user in a sandboxed context, and of persisting the build output to the local Nix store afterward. ↩

← Retour