Type-safe data processing pipelines

27 April 2023 — by Georg Rudoy

Computing is all about transforming data. A wide variety of domains, such as multimedia, securities trading or compilers, allow decomposing the corresponding transformations into a sequence of well-defined steps. Moreover, these steps can be combined in different ways, perhaps omitting some or changing the order of others, producing different data processing pipelines tailored to a particular task at hand.

Consider your favorite compiler: it consists of a lexical and syntactical analyzer, a possible preprocessor, some optimization passes (which can be turned on or off), various optional static analysis features, a code generator, and so on. Yet, some of these steps might depend on each other, and some warnings might only make sense before certain other optimizations have happened.

In this post, we look at one way of representing those dependencies. Namely, dependencies are encoded in the types, allowing compile-time checking and serving as the code documentation.

The reader is assumed to be somewhat familiar with the DataKinds and TypeFamilies extensions, but we will review some peculiarities.

Suppose we are writing a compiler for a System F-based language, with the following transformations:

Prenex form conversion — moving all universal quantifiers to the front (so that, for example, ∀a. [a] → ∀b. (a → b) → [b] becomes ∀a. ∀b. [a] → (a → b) → [b]).
Eta-expansion — turning partially applied functions to fully applied ones under a λ-abstraction (so map f becomes λ xs. map f xs).
Monomorphization — applying the types to the polymorphic functions at compile-time (turning map (+ 1) [1, 2, 3] into mapInt (+ 1) [1, 2, 3], where mapInt only works with Ints).
Defunctionalization — replacing functional arguments in higher-order functions with tags denoting the context of those functions (Wikipedia has a good article on this).

These are expressed as functions:

prenex :: Program -> Program
etaExpand :: Program -> Program
monomorphize :: Program -> Program
defunctionalize :: Program -> Program

where Program is the type of (abstract representations of) programs we are compiling.

Prenex conversion and eta-expansion do not have any requirements on the input program. Monomorphization, conversely, is more easily expressed if the input is assumed to be in prenex form, and defunctionalization assumes the input is eta-expanded and monomorphized. But this is not obvious at all from the types above, so our best bet would be to document the requirements in the comments and hope the next person will read them. Even then, however, GHC will not complain if we write myPipeline = monomorphize . etaExpand despite it being incorrect, as we did not put the input into the prenex form!

How can we express these transformations to avoid missing any necessary steps? One way of doing this is via…

Type-level tags

We assume some extensions are enabled:

{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE StandaloneKindSignatures #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}

Then, we introduce a type Xform' (the ' at the end will be handy later):

import Data.Kind (Type)

type Tags = [Type]

type Xform' :: Tags -> Tags -> Type -> Type -> Type
newtype Xform' requires provides a b = Xform (a -> b)

so Xform' requires provides a b is really a function a -> b under the hood, but it also carries around in its type two lists: its requirements on the input, and the guarantees it provides.

A transformation usually does not change the underlying type, so we also introduce a handy alias:

type Xform requires provides a = Xform' requires provides a a

Then the transformations above get the following types:

data Prenex
prenex :: Xform '[] '[Prenex] Program

data EtaExpanded
etaExpand :: Xform '[] '[EtaExpanded] Program

data Monomorphized
monomorphize :: Xform '[Prenex] '[Monomorphized] Program

data Defunctionalized
defunctionalize :: Xform '[EtaExpanded, Monomorphized] '[Defunctionalized] Program

Note that each tag (Prenex, EtaExpanded and so on) is a data type that is defined separately, so the set of tags is open to extension.

In case you are curious about the ' syntax, the tick before the lists is the DataKinds-enabled syntax to promote constructors to the “type level”. So, while [] is a term of type [a], '[] is a type of kind [a], and '[Prenex] is a type of kind [Type]. Compare that to [Prenex], which is a type of kind Type. We will also see the : constructor promoted to the type level, as ':, later.

If you would like to play around with this, ghci comes in handy, with the :t and :k commands. Try entering :k [Prenex] and :k '[Prenex], for example!

Alright, we’ve already got the documentation at the type level! Now let’s see how to compose these transformations.

Composition

We cannot use the usual function composition operator . as the types will not match, and we also need to keep track of the requires and provides of the composite transformation.

So let’s say we have two transformations, x1 and x2, having requirements r1 and r2 respectively, and asserting guarantees p1 and p2 about their outputs, respectively. What is the requirement of the composition? It is all that x1 requires, plus all that x2 requires and that is not satisfied by x1. That is, symbolically, the set r1 ∪ (r2 \ p1). What does the composition provide? There are some choices here, which we will discuss later, but for definiteness we stick with the union of p1 and p2.

So, in code, the composition operator looks like this (replacing \ with more Data.List-ish \\):

(>:>) ::
  Xform' r1 p1 a b ->
  Xform' r2 p2 b c ->
  Xform' (r1 ∪ (r2 \\ p1)) (p1 ∪ p2) a c
Xform f1 >:> Xform f2 = Xform $ f2 . f1

All the dependency tracking magic happens in ∪ and \\, which must be type-level functions. We can use type families to implement them.

The union ∪ is as straightforward as its term-level counterpart ++, modulo the type family syntax:

type family (∪) (xs :: [a]) (ys :: [a]) :: [a] where
  '[] ∪ ys = ys
  (x ': xs) ∪ ys = x ': (xs ∪ ys)

It is not strictly a set-union; it does not remove duplicates from the resulting list. That can also be done, but at the cost of compilation time, and having duplicates in our particular case is fine.

For “set difference” \\ we need a helper that removes all occurrences of a given value from the list, and here things get interesting:

type family Del (xs :: [a]) (d :: a) :: [a] where
  Del '[] _ = '[]
  Del (d ': xs) d = Del xs d
  Del (x ': xs) d = x ': Del xs d

This looks almost like its term-level analog, except… that second equation looks funny, and we also do not need anything like the term-level Eq constraint anywhere!

Unlike their term-level counterparts, type family equations can be non-linear: the second equation reduces types of the form Del (d ': xs) e when d and e are syntactically (that is, structurally) equal. Otherwise, if they are not syntactically equal, the term is reduced using the third equation. By the way, this might remind you of unification in the style of logic programming, and rightfully so!

Having this helper, \\ is easy to implement:

type family (\\) (xs :: [a]) (ys :: [a]) :: [a] where
  xs \\ '[] = xs
  xs \\ (y ': ys) = Del xs y \\ ys

Having all these type families, the above definition of >:> will also type check!

Running the transformation

Now that we have learned to write and combine the transformations, we just need to run them. This is also quite straightforward — we can only run a transformation which has all its dependencies satisfied, so the requires list must be empty:

xform :: Xform' [] provides a b -> a -> b
xform (Xform f) = f

We might even not export the constructor of Xform from the module where we are defining all this, forcing users to use the xform with the lowercase x and ensuring there are no unmet requirements.

This works fine as long as the transformation has all its requirements provided. Otherwise, the error message is hard to decipher:

Pipeline.hs:75:19: error:
    * Couldn't match type: '[Prenex, EtaExpanded]
                     with: '[]
      Expected: Xform'
                  '[]
                  '[Monomorphized, Defunctionalized]
                  Program
                  Program
        Actual: Xform'
                  '[Prenex, EtaExpanded]
                  '[Monomorphized, Defunctionalized]
                  Program
                  Program

It is possible to make sense of it, but can we do better? Luckily, GHC has supported custom error messages for some time, so the answer is yes! The way it works is that every time GHC sees a TypeError (from GHC.TypeLits) somewhere in the types, it produces the corresponding type error.

So let’s introduce a type family ReqsSatisfied of the kind Constraint (another nice feature of GHC), which evaluates to a type error if there are any missing requirements:

type family ReqsSatisfied (requirements :: [tag]) :: Constraint where
  ReqsSatisfied '[] = () ~ ()
  ReqsSatisfied reqs = TypeError ('Text "Unsatisfied transformation pipeline dependencies: " ':<>: 'ShowType reqs)

The first equation matches the case when there are no (unsatisfied) requirements, and the type family then evaluates to the trivial constraint stating that the unit type is equal to itself (the syntax a ~ b means that the type a must be equal to type b). The second equation matches all other cases, where Text, :<>: and ShowType in its right-hand side come from GHC.TypeLits.

If we stick this constraint into the xform function:

xform :: (ReqsSatisfied requires) => Xform' requires provides a b -> a -> b

then GHC will complain with a more readable error message:

    * Unsatisfied transformation pipeline dependencies: '[Prenex,
                                                          EtaExpanded]

when the dependencies are unsatisfied.

Now we can ship this module to users without expecting them to be type family experts!

Design decisions and shortcomings

There are other ways to solve the problem, and the proposed one is not without trade-offs.

Tags or strings

In the above, the different transformations are denoted by different types (data Prenex, for instance). This means that if prenex and monomorphize are defined in separate modules, then monomorphize needs to import the module with the Prenex tag — otherwise it will not be able to name it!

Another approach is to use type-level string literals, so our Tags becomes:

type Tags = [Symbol]    -- Symbol from GHC.TypeLits

and our monomorphize will be:

monomorphize :: Xform '["Prenex"] '["Monomorphized"] Program

not needing the Prenex or Monomorphized tags.

Generally, stringly-typed code seems to be frowned upon in the software engineering community, but in this case this is OK: all the possible errors will be caught at compile-time anyway.

The only thing worth noting is that a string "Prenex" will be the same no matter which module it is written in, while the same sequence of tokens data Prenex written in different modules defines two different types. Depending on your particular use case, either behavior might be the desired one!

Keep or delete

Remember that we calculated the guarantees provided by the composition of f1 :: XForm r1 p1 a and f2 :: XForm r2 p2 a as p1 ∪ p2. This obviously only holds if running f2 does not invalidate any of the assertions in p1, and we assume this is the case for any pair of transformations we are working with. Otherwise, the matters are significantly complicated since every transformation effectively needs to know about any other one and whether its invariants are broken.

Note that we cannot keep just the guarantees provided by the later transformation f2, as this effectively means we always lose information about the earlier f1. And even if we re-do f1, we will analogously lose information about f2 — a vicious cycle with no end!

`Xform` or tagged `Program`

Another approach is to tag the Program type itself, so we would have:

prenex :: Program xs -> Program (Prenex ': xs)

defunctionalize :: (EtaExpanded ∊ xs, Monomorphized ∊ xs) => Program xs -> Program (Defunctionalized ': xs)

where ∊ is a Constraint-kinded type family along the lines of ∪ and \\. So, requirements go to the left of =>, and the guarantees go to the return type — some pleasant symmetry here!

In a sense, this approach specifies the behaviors even more directly in types, and the requirements/guarantees check of the composition (which can be done with the good old . in this case) is delegated to the GHC constraint solver, so we do not need to write ∪ and \\ ourselves. On the other hand, the rest of our compiler needs to be made polymorphic in xs of the Program xs.

If the Program type is used in many places, especially if it is a member of some other product types, then either xs needs to be propagated there as well, or existential types need to come into play. Depending on how much code needs to be changed and whether the rest of the compiler will benefit from having the requirements on the Program stated in its types, this might be either a good or a bad thing.

Overall, this choice is essentially a matter of context, taste, and stylistic judgment.

Lack of checks inside the transformations

In the above, if we omit any of the requirements in, say:

defunctionalize :: Xform '[EtaExpanded, Monomorphized] '[Defunctionalized] Program

then its body still compiles. GHC does not check it to see whether all the assumptions are really expressed in the type. Unfortunately, this is only truly achievable with full dependent types, but indexing the Program type with this list of tags and making different constructors available depending on a particular list might get you reasonably far.

This approach falls quite outside of the scope of this post, but the trees that grow pattern is highly relevant and might be a good source of inspiration here as well.

Run-time composition is hard

All these checks happen at compile-time, which has its merits, but it also means that composing the transformations at run time (think about the motivating example with run-time CLI switches controlling optimization passes) is hard. There are also several ways of solving this, but this post is getting long already, so let’s leave this as a cliffhanger!

Conclusion

To sum up, we have looked at how we can make data transformation pipelines more type safe: the types help ensure all the prerequisites are met for each step in a pipeline. We have looked at some relatively advanced features of the GHC type system and how they help us achieve type safety in a real-world use case.

We have also considered some other alternatives, as well as some shortcomings of the proposed solution.

Behind the scenes

Georg Rudoy

Tech Group

High Assurance Software

Automating the verification of correct behaviour in complex systems.

Tech Group

Programming Languages and Compilers

Research, create, improve and maintain programming languages and their tooling to enhance developer productivity and to deliver reliable, maintainable, correct and performant software with minimum effort.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← Processing medical images at scale on the cloud Announcing Skyscope →