Although the feature is still marked as experimental, the release of Nix 2.4 marks the entry of content-addressed derivations in a released Nix version. This is something to celebrate, but also an occasion to look back and see what this addition means, implementation wise.
This new feature required indeed some deep changes both in the high-level algorithm describing the build loop (and also as a consequence in its low-level details) as well as in the “plumbing” code around it.
Note: Because it touches the implementation, this post will be more technical than the others, and will also assume that you’re somewhat familiar with what content-addressed derivations means, and how it works. If you haven’t done it already, I invite you to read the previous blog articles on the topic (1, 2 and 3), and if you’re really curious, the corresponding RFC.
High-level overview of the build loop
The first − and most obvious − change is that the flow of the build loop itself had to be adapted to take into account this new way of realising derivations.
Before looking at what changed, let’s try to understand how things used to work. What happens (or rather happened) when someone runs
The first thing of course, is that the Nix expression evaluator will happily evaluate this
default.nix file. We won’t detail this process here, suffice it to say that this will return a derivation which represents the thing that we want to build.
This derivation will be passed on to the “build loop”, whose behavior is roughly the following (I’m using a Python-like pseudocode, but the actual implementation is in C++):
def buildDerivation(derivation : Derivation) -> None: remainingOutputs = tryToSubstituteOutputs(derivation) if (remainingOutputs == ): return buildInputs() doBuild() registerOutputPaths()
In plain english, what happens is that the builder will first try to substitute the outputs of the derivation from the configured binary caches. If some couldn’t be substituted, then it will recursively build all the inputs of the derivation, build the derivation itself and eventually register its output paths in the database. Simple and straightforward (although as you might imagine the devil is in the details, like always).
So what changes with content-addressed derivations? Quite a bunch of things in fact.
Building resolved derivations
The first big change is that for early-cutoff to work, we don’t really want to build the derivation that’s given to us, but rather its resolved version, which is the same derivation but where all the inputs are replaced by their content-addressed path (this is explained more in details in the corresponding RFC).
buildDerivation should actually be redefined as
def buildDerivation(derivation : Derivation) -> Set[Realisation]: remainingOutputs = tryToSubstituteOutputs(derivation) if (remainingOutputs == ): return buildInputs() return buildResolvedDerivation(Derivation.resolve(derivation))
buildResolvedDerivation : ResolvedDerivation -> Set[Realisation] is the one that will do most of the job
Calculated output paths
Another change is that while the output paths used to be a given, they are now a product of the build (it’s the whole point of content-addressed derivations). It means that:
- We must register what path each output maps to in the database, and
- This mapping must be passed to any piece of code that needs to access the output paths as it can’t just be inferred anymore.
The first point is handled by a new function
registerRealisation : Realisation -> (), where a
Realisation associates a given derivation output (
foo.drv!out) to a given store path (
For the second point, we must change the
registerOutputPaths function a bit: the way
doBuild works is that it will build everything in a predetermined temporary location. Then
registerOutputPaths will do all the magic to move these paths to their final content-addressed location, as described in my post about self-references. Eventually, this function will return a set of
Realisation materializing the newly built derivation outputs.
buildResolvedDerivation now looks like:
def buildResolvedDerivation(derivation : ResolvedDerivation) -> Set[Realisation]: # Maybe a substituter didn’t know about the original derivation, # but knows about the resolved one, so let’s try substituting # again remainingOutputs = tryToSubstituteOutputs(derivation) if (remainingOutputs == ): return # No need to try building the inputs as by definition # a resolved derivation already has all its inputs available doBuild() newRealisations = registerOutputPaths() for realisation in newRealisations: registerRealisation(realisation) return newRealisations
In addition to the two points above, we must also register the new realisations for the original (non-resolved) derivation, meaning that
buildDerivation should take the set of
buildResolvedDerivation provides, and register them as if they were its own build outputs. Something like:
def buildDerivation(derivation : Derivation) -> Set[Realisation]: remainingOutputs = tryToSubstituteOutputs(derivation) if (remainingOutputs == ): return buildInputs() newRealisations = buildResolvedDerivation( Derivation.resolve(derivation)) registerRealisationsFromResolved(newRealisations) return newRealisations
Mixing content-addressed and non-content-addressed
These changes are obviously making the build process slightly more involved, but in a way this is only making explicit some stuff that was implicitly passed.
The actual build loop is actually even more complex than that right now (even at this level of abstraction), because there’s something else to add on top of this: since content-addressed derivations are an experimental features, this new build loop must only be followed when the feature is enabled. So there’s a bunch of
if (settings.isExperimentalFeatureEnabled(Xp::CaDerivations)) in a bunch of places.
I won’t write the full algorithm, because it’s more messy than interesting, and it’s (hopefully 🤞) only temporary as the feature is supposed to get out of its experimental state one way or another at some point.
The build scheduling system
Getting more into the details and closer to the actual implementation, things don’t look that nicely structured, mostly because everything is plagued with a weird cps-like style that seems to come out of nowhere.
The reason for this style is that all this process has to be asynchronous, so that the internal scheduler can properly supervise an arbitrary number of concurrent builds and keep the user informed as to what’s going on. And because this part of the code base predates all the nice async frameworks that we can have right now, it uses its own framework, which indeed does the job but does impose a cost in terms of maintainability and readability of the code1.
Explaining this mechanism in details would be too long to fit in this post2, so I’ll just give a quick overview of the bits that really matter here.
At its core, this framework is a cooperative scheduling system. Each high-level task is represented by a subclass of
Goal, which is some kind of state machine.
The interface of a
Goal, is essentially:
class Goal: # type State = Goal -> None State = Callable[[Goal], None] state : State dependencies : [Goal] # Called when a goal wants to say that it doesn’t have more to do def amDone(self) -> None: ...
The idea is that
state should point to a method of the goal (which represents the next thing to do for the goal).
The method pointed to by
state should eventually return, and set
state to a new method representing the next thing to run (so morally yielding the control back to the scheduler).
On top of that, a scheduler will look at all the goals, and will schedule all the goals with no dependencies by calling their
state method, until they call
amDone to signal that they can be destroyed.
A consequence of this design is that there’s no function inside a goal, just “actions” (the
States) that take no argument except the goal itself and return nothing.
So any information that must be passed to or returned from a
Sate must be present in a global environment (with respect to the current goal), meaning that it must be a field of the class (which is neither memory-efficient nor nice to work with, but has the advantage of making the whole scheduling system much simpler).
Likewise, no information can be passed from a goal to another, except by side-effects.
For example, a simplified version of the interface of the
DerivationGoal class (the one representing for the
buildDerivation function above) would be something like:
class DerivationGoal(Goal): # type State = DerivationGoal -> None State = Callable[[DerivationGoal], None] derivation : Derivation outputsToBuild : List[OutputName] =  def tryToSubstituteOutputs(self) -> None: # Will fill `outputsToBuild` with the name of the outputs that couldn’t # be substituted. ... state = self.outputsSubstituted def outputsSubstituted(self) -> None: if (outputsToBuild) == : state = 0 # Well that’s not really valid in python, but nevermind else: state = self.doBuild def doBuild(self) -> None: ... state = self.registerOutputPaths newRealisations : List[Realisation] =  def registerOutputPaths(self) -> None: # sets `newRealisations` ... state = self.registerRealisations def registerRealisations(self) -> None: ... self.amDone()
As can be seen above, this “state machine” isn’t very modular since everything has to be in the same scope.
Besides, this is also a minefield, as most
States rely on some implicit state (yes, the wording is a bit tricky) because it’s going to assume that a previous
State has set the right fields of the class.
Failure to meet this requirements will generally result at best in a segfault, at worst in some nonsensical behavior.
Also because of this reliance to an ambient state, sharing code between two different
Goal types is much more complex.
This means that the relatively complex logic needed to build content-addressed derivations has been pretty painful to implement.
In a nutshell, it works by having a single goal
DerivationGoal, implementing both the
buildResolvedDerivation functions above, and that will behave as one or the other depending on the shape of the given derivation.
If the derivation doesn’t have the shape of a resolved one, then right at the end,
tryToSubstituteOutputs will resolve the derivation, declare a dependency on a new
DerivationGoal (with the resolved derivation, this time), and switch to a new
resolvedFinished state that will
- Query the database for the newly registered realisations (since they couldn’t be passed explicitly between the two goals)
- Implement the
registerRealisationsFromResolvedfunction described above. This obviously also meant adding a bunch of new fields in the
DerivationGoalclass to handle all the CA-specific values that had to be passed around between goals.
Outside of the build loop
So far, we’ve been deliberately ignoring everything happening at instantiation time, and focused on what’s happening at build time. Let’s now look at some of the plumbing.
At the interface between the Nix evaluator and the builder lie the “derivations”. These are the data-structures that are returned by the evaluation and which describe in low-level terms how to build store paths.
Amongst other things, a derivation contains
- A set of inputs
- A build script
- A list of outputs
In an input-addressed world, these outputs are plain store paths (computed by hashing all the rest of the derivation).
This isn’t the case in an input-addressed world (as the output paths can only be computed once the build is done).
But we still need a way to refer to these outputs in the derivation.
In particular, we want to export each output path as an environment variable before the start of the build, so that we can do for example
echo Foo > $out to create a file as the
out output of the derivation.
To be able to do that, we assign to each output a “placeholder”, a long chain of characters (which is also computed from a hash of the rest of the derivation so that it is deterministic but can’t be guessed a priori).
Right before the build, this placeholder will be replaced by the temporary output used for the build.
More generally, the fact that we don’t know the output paths in advance led to some global changes in the code base, as a lot of places assumed otherwise.
For example, the logic for
nix build was something along the lines of:
class Buildable: """Represents a derivation that can be built """ drvPath : StorePath outputs : Map[str,StorePath] buildable : Buildable = evalCode(input) buildBuildable(buildable) symlink(buildable.outputs["result"], "result")
Buildable class obviously can’t represent unbuilt content-addressed derivations, so it had to be changed.
In a first step, we changed the type of the
outputs field to
Map[str,Optional[StorePath]] to take into account the fact that for some derivations in some contexts, we don’t know the output paths.
This change isn’t the best one semantically speaking, since in most places we should know statically whether the output paths are known or not.
But it had the advantage of being very simple to implement and get to work with input-addressed derivations (just add some indirection whenever we access the
output field as for input-addressed derivations the field will always be set).
And from that, we could progressively migrate to take into account content-addressed derivations by making sure that we weren’t blindly dereferencing the optional in places where it was potentially empty.
Then, we could go one step further and replace this class by two different ones:
class Buildable: """Something that we might want to build""" drvPath: StorePath outputNames: Set[str] class Built: """Something that we have built""" drvPath: StorePath outputs : Map[str,StorePath]
The only ways to create a
Built value being to either build a
Buildable or query the database to (optionnally) get a prebuilt version of it, we now have our safety back, and we can be sure that internally, Nix will never try to access the path of an unbuilt derivation output.
Another issue we have to deal with, is that Nix is (or can be used as) a distributed system: The most common setup involves a client and a daemon, and distributed builds and binary caches can mean that several versions of the tool can have to work together. Obviously, this means that different versions have to play well together (as much as the intersection of the features of each allows), and the introduction of content-addressed derivations isn’t an excuse for breaking this.
Amongst other things, this means that
- If the daemon or a remote builder doesn’t handle CA derivations, then things must either work or fail graciously.
In the case of the remote builder, we reused a trick from the work of
recursive-nix, which is that content-addressed derivations will only be sent to builders which advertise the
ca-derivationsfeature. That way a network of Nix machines can freely mix ca-aware and non-ca-aware machines.
- Remote caches must also transparently support both content-addressed and input-addressed derivations. Speaking in terms of http binary caches, this was actually quite natural, because as far as remote-caching is concerned, content-addressed derivations build on top of already existing mechanisms, and only requires adding a (totally separate) endpoint to query for the realisations. So the client is free to ignore the substituter if it doesn’t provide that endpoint, and conversely if the client doesn’t query that endpoint… well nothing happens, but that’s precisely what we want.
Overall, implementing this was quite a bumpy ride, but also an awesome occasion to dig into the internals of Nix (and try to improve them by the way). Hopefully, this post was also an occasion for you to join me in this wild ride, and give you some taste of how your favorite package manager works internally. And maybe even make you want to contribute to it?