This is my first post about content-addressability in Nix — a long-awaited feature that is hopefully coming soon! In this post I will show you how this feature will improve the Nix infrastructure. I’ll come back in another post to explain the technical challenges of adding content-addressability to Nix.

Nix has a wonderful model for handling packages. Because each derivation is stored under (aka addressed by) a unique name, multiple versions of the same library can coexist on the same system without issues: each version of the library has a distinct name, as far as Nix is concerned.

What’s more, if openssl is upgraded in Nixpkgs, Nix knows that all the packages that depend on openssl (i.e., almost everything) must be rebuilt, if only so that they point at the name of the new openssl version. This way, a Nix installation will never feature a package built for one version of openssl, but dynamically linked against another: as a user, it means that you will never have an undefined symbol error. Hurray!

The input-addressed store

How does Nix achieve this feat? The idea is that the name of a package is derived from all of its inputs (that is, the complete list of dependencies, as well as the package description). So if you change the git tag from which openssl is fetched, the name changes, if the name of openssl changes, then the name of any package which has openssl in its dependencies changes.

However this can be very pessimistic: even changes that aren’t semantically meaningful can imply mass rebuilding and downloading. As a slightly extreme example, this merge-request on Nixpkgs makes a tiny change to the way openssl is built. It doesn’t actually change openssl, yet requires rebuilding an insane amount of packages. Because, as far as Nix is concerned, all these packages have different names, hence are different packages. In reality, though, they weren’t.

Nevertheless, the cost of the rebuild has to be born by the Nix infrastructure: Hydra builds all packages to populate the cache, and all the newly built packages must be stored. It costs both time, and money (in cpu power, and storage space).

Unnecessary rebuilds?

Most distributions, by default, don’t rebuild packages when their dependencies change, and have a (more-or-less automated) process to detect changes that require rebuilding reverse dependencies. For example, Debian tries to detect ABI changes automatically and Fedora has a more manual process. But Nix doesn’t.

The issue is that the notion of a “breaking change” is a very fuzzy one. Should we follow Debian and consider that only ABI changes are breaking? This criterion only applies for shared libraries, and as the Debian policy acknowledges, only for “well-behaved” programs. So if we follow this criterion, there’s still need for manual curation, which is precisely what Nix tries to avoid.

The content-addressed model

Quite happily, there is a criterion to avoid many useless rebuilds without sacrificing correctness: detecting when changes in a package (or one of its dependencies) yields the exact same output. That might seem like an edge case, but the openssl example above (and many others) shows that there’s a practical application to it. As another example, go depends on perl for its tests, so an upgrade of perl requires rebuilding all the Go packages in Nixpkgs, although it most likely doesn’t change the output of the go derivation.

But, for Nix to recognise that a package is not a new package, the new, unchanged, openssl or go packages must have the same name as the old version. Therefore, the name of a package must not be derived from its inputs which have changed, but, instead, it should be derived from the content of the compiled package. This is called content addressing.

Content addressing is how you can be sure that when you and a colleague at the other side of the world type git checkout 7cc16bb8cd38ff5806e40b32978ae64d54023ce0 you actually have the exact same content in your tree. Git commits are content addressed, therefore the name 7cc16bb8cd38ff5806e40b32978ae64d54023ce0 refers to that exact tree.

Yet another example of content-addressed storage is IPFS. In IPFS storage files can be stored in any number of computers, and even moved from computer to computer. The content-derived name is used as a way to give an intrinsic name to a file, regardless of where it is stored.

In fact, even the particular use case that we are discussing here - avoiding recompilation when a rebuilt dependency hasn’t changed - can be found in various build systems such as Bazel. In build systems, such recompilation avoidance is sometimes known as the early cutoff optimization − see the build systems a la carte paper for example).

So all we need to do is to move the Nix store from an input-addressed model to a content-addressed model, as used by many tools already, and we will be able to save a lot of storage space and CPU usage, by rebuilding many fewer packages. Nixpkgs contributors will see their CI time improved. It could also allow serving a binary cache over IPFS.

Well, like many things with computers, this is actually way harder than it sounds (which explains why this hasn’t already been done despite being discussed nearly 15 years ago in the original paper), but we now believe that there’s a way forward… more on that in a later post.

Conclusion

A content-addressed store for Nix would help reduce the insane load that Hydra has to sustain. While content-addressing is a common technique both in distributed systems and build systems (Nix is both!), getting to the point where it was feasible to integrate content-addressing in Nix has been a long journey.

In a future post, I’ll explain why it was so hard, and how we finally managed to propose a viable design for a content-addressed Nix.