Implementing a content-addressed Nix

2 December 2021 — by Théophane Hufschmitt

Although the feature is still marked as experimental, the release of Nix 2.4 marks the entry of content-addressed derivations in a released Nix version. This is something to celebrate, but also an occasion to look back and see what this addition means, implementation wise.

This new feature required indeed some deep changes both in the high-level algorithm describing the build loop (and also as a consequence in its low-level details) as well as in the “plumbing” code around it.

Note: Because it touches the implementation, this post will be more technical than the others, and will also assume that you’re somewhat familiar with what content-addressed derivations means, and how it works. If you haven’t done it already, I invite you to read the previous blog articles on the topic (1, 2 and 3), and if you’re really curious, the corresponding RFC.

High-level overview of the build loop

The first − and most obvious − change is that the flow of the build loop itself had to be adapted to take into account this new way of realising derivations.

Before looking at what changed, let’s try to understand how things used to work. What happens (or rather happened) when someone runs nix-build default.nix? The first thing of course, is that the Nix expression evaluator will happily evaluate this default.nix file. We won’t detail this process here, suffice it to say that this will return a derivation which represents the thing that we want to build.

This derivation will be passed on to the “build loop”, whose behavior is roughly the following (I’m using a Python-like pseudocode, but the actual implementation is in C++):

def buildDerivation(derivation : Derivation) -> None:
    remainingOutputs = tryToSubstituteOutputs(derivation)
    if (remainingOutputs == []):
        return
    buildInputs()
    doBuild()
    registerOutputPaths()

In plain english, what happens is that the builder will first try to substitute the outputs of the derivation from the configured binary caches. If some couldn’t be substituted, then it will recursively build all the inputs of the derivation, build the derivation itself and eventually register its output paths in the database. Simple and straightforward (although as you might imagine the devil is in the details, like always).

So what changes with content-addressed derivations? Quite a bunch of things in fact.

Building resolved derivations

The first big change is that for early-cutoff to work, we don’t really want to build the derivation that’s given to us, but rather its resolved version, which is the same derivation but where all the inputs are replaced by their content-addressed path (this is explained more in details in the corresponding RFC). So our buildDerivation should actually be redefined as

def buildDerivation(derivation : Derivation) -> Set[Realisation]:
    remainingOutputs = tryToSubstituteOutputs(derivation)
    if (remainingOutputs == []):
        return
    buildInputs()
    return buildResolvedDerivation(Derivation.resolve(derivation))

where buildResolvedDerivation : ResolvedDerivation -> Set[Realisation] is the one that will do most of the job

Calculated output paths

Another change is that while the output paths used to be a given, they are now a product of the build (it’s the whole point of content-addressed derivations). It means that:

We must register what path each output maps to in the database, and
This mapping must be passed to any piece of code that needs to access the output paths as it can’t just be inferred anymore.

The first point is handled by a new function registerRealisation : Realisation -> (), where a Realisation associates a given derivation output (foo.drv!out) to a given store path (/nix/store/…-foo-out).

For the second point, we must change the registerOutputPaths function a bit: the way doBuild works is that it will build everything in a predetermined temporary location. Then registerOutputPaths will do all the magic to move these paths to their final content-addressed location, as described in my post about self-references. Eventually, this function will return a set of Realisation materializing the newly built derivation outputs.

Our new buildResolvedDerivation now looks like:

def buildResolvedDerivation(derivation : ResolvedDerivation)
      -> Set[Realisation]:
    # Maybe a substituter didn’t know about the original derivation,
    # but knows about the resolved one, so let’s try substituting
    # again
    remainingOutputs = tryToSubstituteOutputs(derivation)
    if (remainingOutputs == []):
        return
    # No need to try building the inputs as by definition
    # a resolved derivation already has all its inputs available
    doBuild()
    newRealisations = registerOutputPaths()
    for realisation in newRealisations:
        registerRealisation(realisation)
    return newRealisations

Registering realisations

In addition to the two points above, we must also register the new realisations for the original (non-resolved) derivation, meaning that buildDerivation should take the set of Realisation that buildResolvedDerivation provides, and register them as if they were its own build outputs. Something like:

def buildDerivation(derivation : Derivation) -> Set[Realisation]:
  remainingOutputs = tryToSubstituteOutputs(derivation)
  if (remainingOutputs == []):
      return
  buildInputs()
  newRealisations = buildResolvedDerivation(
          Derivation.resolve(derivation))
  registerRealisationsFromResolved(newRealisations)
  return newRealisations

Mixing content-addressed and non-content-addressed

These changes are obviously making the build process slightly more involved, but in a way this is only making explicit some stuff that was implicitly passed. The actual build loop is actually even more complex than that right now (even at this level of abstraction), because there’s something else to add on top of this: since content-addressed derivations are an experimental features, this new build loop must only be followed when the feature is enabled. So there’s a bunch of if (settings.isExperimentalFeatureEnabled(Xp::CaDerivations)) in a bunch of places. I won’t write the full algorithm, because it’s more messy than interesting, and it’s (hopefully 🤞) only temporary as the feature is supposed to get out of its experimental state one way or another at some point.

The build scheduling system

Getting more into the details and closer to the actual implementation, things don’t look that nicely structured, mostly because everything is plagued with a weird cps-like style that seems to come out of nowhere.

The reason for this style is that all this process has to be asynchronous, so that the internal scheduler can properly supervise an arbitrary number of concurrent builds and keep the user informed as to what’s going on. And because this part of the code base predates all the nice async frameworks that we can have right now, it uses its own framework, which indeed does the job but does impose a cost in terms of maintainability and readability of the code¹.

Explaining this mechanism in details would be too long to fit in this post², so I’ll just give a quick overview of the bits that really matter here.

At its core, this framework is a cooperative scheduling system. Each high-level task is represented by a subclass of Goal, which is some kind of state machine. The interface of a Goal, is essentially:

class Goal:
    # type State = Goal -> None
    State = Callable[[Goal], None]
    state : State

    dependencies : [Goal]

    # Called when a goal wants to say that it doesn’t have more to do
    def amDone(self) -> None:
      ...

The idea is that state should point to a method of the goal (which represents the next thing to do for the goal). The method pointed to by state should eventually return, and set state to a new method representing the next thing to run (so morally yielding the control back to the scheduler). On top of that, a scheduler will look at all the goals, and will schedule all the goals with no dependencies by calling their state method, until they call amDone to signal that they can be destroyed.

A consequence of this design is that there’s no function inside a goal, just “actions” (the States) that take no argument except the goal itself and return nothing. So any information that must be passed to or returned from a Sate must be present in a global environment (with respect to the current goal), meaning that it must be a field of the class (which is neither memory-efficient nor nice to work with, but has the advantage of making the whole scheduling system much simpler). Likewise, no information can be passed from a goal to another, except by side-effects.

For example, a simplified version of the interface of the DerivationGoal class (the one representing for the buildDerivation function above) would be something like:

class DerivationGoal(Goal):
    # type State = DerivationGoal -> None
    State = Callable[[DerivationGoal], None]

    derivation : Derivation

    outputsToBuild : List[OutputName] = []
    def tryToSubstituteOutputs(self) -> None:
        # Will fill `outputsToBuild` with the name of the outputs that couldn’t
        # be substituted.
        ...
        state = self.outputsSubstituted

    def outputsSubstituted(self) -> None:
        if (outputsToBuild) == []:
            state = 0 # Well that’s not really valid in python, but nevermind
        else:
            state = self.doBuild

    def doBuild(self) -> None:
        ...
        state = self.registerOutputPaths

    newRealisations : List[Realisation] = []
    def registerOutputPaths(self) -> None:
        # sets `newRealisations`
        ...
        state = self.registerRealisations

    def registerRealisations(self) -> None:
        ...
        self.amDone()

As can be seen above, this “state machine” isn’t very modular since everything has to be in the same scope. Besides, this is also a minefield, as most States rely on some implicit state (yes, the wording is a bit tricky) because it’s going to assume that a previous State has set the right fields of the class. Failure to meet this requirements will generally result at best in a segfault, at worst in some nonsensical behavior. Also because of this reliance to an ambient state, sharing code between two different Goal types is much more complex.

This means that the relatively complex logic needed to build content-addressed derivations has been pretty painful to implement. In a nutshell, it works by having a single goal DerivationGoal, implementing both the buildDerivation and buildResolvedDerivation functions above, and that will behave as one or the other depending on the shape of the given derivation. If the derivation doesn’t have the shape of a resolved one, then right at the end, tryToSubstituteOutputs will resolve the derivation, declare a dependency on a new DerivationGoal (with the resolved derivation, this time), and switch to a new resolvedFinished state that will

Query the database for the newly registered realisations (since they couldn’t be passed explicitly between the two goals)
Implement the registerRealisationsFromResolved function described above. This obviously also meant adding a bunch of new fields in the DerivationGoal class to handle all the CA-specific values that had to be passed around between goals.

Outside of the build loop

So far, we’ve been deliberately ignoring everything happening at instantiation time, and focused on what’s happening at build time. Let’s now look at some of the plumbing.

At the interface between the Nix evaluator and the builder lie the “derivations”. These are the data-structures that are returned by the evaluation and which describe in low-level terms how to build store paths.

Amongst other things, a derivation contains

A set of inputs
A build script
A list of outputs

In an input-addressed world, these outputs are plain store paths (computed by hashing all the rest of the derivation). This isn’t the case in an input-addressed world (as the output paths can only be computed once the build is done). But we still need a way to refer to these outputs in the derivation. In particular, we want to export each output path as an environment variable before the start of the build, so that we can do for example echo Foo > $out to create a file as the out output of the derivation. To be able to do that, we assign to each output a “placeholder”, a long chain of characters (which is also computed from a hash of the rest of the derivation so that it is deterministic but can’t be guessed a priori). Right before the build, this placeholder will be replaced by the temporary output used for the build.

More generally, the fact that we don’t know the output paths in advance led to some global changes in the code base, as a lot of places assumed otherwise. For example, the logic for nix build was something along the lines of:

class Buildable:
    """Represents a derivation that can be built """
    drvPath : StorePath
    outputs : Map[str,StorePath]

buildable : Buildable = evalCode(input)
buildBuildable(buildable)
symlink(buildable.outputs["result"], "result")

The Buildable class obviously can’t represent unbuilt content-addressed derivations, so it had to be changed. In a first step, we changed the type of the outputs field to Map[str,Optional[StorePath]] to take into account the fact that for some derivations in some contexts, we don’t know the output paths. This change isn’t the best one semantically speaking, since in most places we should know statically whether the output paths are known or not. But it had the advantage of being very simple to implement and get to work with input-addressed derivations (just add some indirection whenever we access the output field as for input-addressed derivations the field will always be set). And from that, we could progressively migrate to take into account content-addressed derivations by making sure that we weren’t blindly dereferencing the optional in places where it was potentially empty.

Then, we could go one step further and replace this class by two different ones:

class Buildable:
    """Something that we might want to build"""
    drvPath: StorePath
    outputNames: Set[str]

class Built:
    """Something that we have built"""
    drvPath: StorePath
    outputs : Map[str,StorePath]

The only ways to create a Built value being to either build a Buildable or query the database to (optionnally) get a prebuilt version of it, we now have our safety back, and we can be sure that internally, Nix will never try to access the path of an unbuilt derivation output.

Another issue we have to deal with, is that Nix is (or can be used as) a distributed system: The most common setup involves a client and a daemon, and distributed builds and binary caches can mean that several versions of the tool can have to work together. Obviously, this means that different versions have to play well together (as much as the intersection of the features of each allows), and the introduction of content-addressed derivations isn’t an excuse for breaking this.

Amongst other things, this means that

If the daemon or a remote builder doesn’t handle CA derivations, then things must either work or fail graciously. In the case of the remote builder, we reused a trick from the work of recursive-nix, which is that content-addressed derivations will only be sent to builders which advertise the ca-derivations feature. That way a network of Nix machines can freely mix ca-aware and non-ca-aware machines.
Remote caches must also transparently support both content-addressed and input-addressed derivations. Speaking in terms of http binary caches, this was actually quite natural, because as far as remote-caching is concerned, content-addressed derivations build on top of already existing mechanisms, and only requires adding a (totally separate) endpoint to query for the realisations. So the client is free to ignore the substituter if it doesn’t provide that endpoint, and conversely if the client doesn’t query that endpoint… well nothing happens, but that’s precisely what we want.

Conclusion

Overall, implementing this was quite a bumpy ride, but also an awesome occasion to dig into the internals of Nix (and try to improve them by the way). Hopefully, this post was also an occasion for you to join me in this wild ride, and give you some taste of how your favorite package manager works internally. And maybe even make you want to contribute to it?

If this makes you wonder whether we could do better, the answer is yes, we just need more contributors to be able to tackle this sort of deep refactorings↩
Which also gives me a nice escape hatch so that I don’t have to admit that I don’t really understand this framework↩

Behind the scenes

Théophane Hufschmitt

Théophane is a Software Engineer and self-proclaimed Nix guru. He lives in a small house surrounded by awesome castles in the Loire Valley. When he’s not taking care of his four sons or playing some music, you might find him working.

Tech Group

Nix

Sustainable software development with Nix

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← The Varieties of the Haskelling Experience Type-checking plugins, Part II: GHC's constraint solver →