Computing is all about transforming data. A wide variety of domains, such as multimedia, securities trading or compilers, allow decomposing the corresponding transformations into a sequence of well-defined steps. Moreover, these steps can be combined in different ways, perhaps omitting some or changing the order of others, producing different data processing pipelines tailored to a particular task at hand.

Consider your favorite compiler: it consists of a lexical and syntactical analyzer, a possible preprocessor, some optimization passes (which can be turned on or off), various optional static analysis features, a code generator, and so on. Yet, some of these steps might depend on each other, and some warnings might only make sense before certain other optimizations have happened.

In this post, we look at one way of representing those dependencies. Namely, dependencies are encoded in the types, allowing compile-time checking and serving as the code documentation.

The reader is assumed to be somewhat familiar with the `DataKinds`

and `TypeFamilies`

extensions,
but we will review some peculiarities.

Suppose we are writing a compiler for a System F-based language, with the following transformations:

- Prenex form conversion — moving all universal quantifiers to the front (so that, for example,
`∀a. [a] → ∀b. (a → b) → [b]`

becomes`∀a. ∀b. [a] → (a → b) → [b]`

). - Eta-expansion — turning partially applied functions to fully applied ones under a λ-abstraction (so
`map f`

becomes`λ xs. map f xs`

). - Monomorphization — applying the types to the polymorphic functions at compile-time
(turning
`map (+ 1) [1, 2, 3]`

into`mapInt (+ 1) [1, 2, 3]`

, where`mapInt`

only works with`Int`

s). - Defunctionalization — replacing functional arguments in higher-order functions with tags denoting the context of those functions (Wikipedia has a good article on this).

These are expressed as functions:

```
prenex :: Program -> Program
etaExpand :: Program -> Program
monomorphize :: Program -> Program
defunctionalize :: Program -> Program
```

where `Program`

is the type of (abstract representations of) programs we are compiling.

Prenex conversion and eta-expansion do not have any requirements on the input program.
Monomorphization, conversely, is more easily expressed if the input is assumed to be in prenex form,
and defunctionalization assumes the input is eta-expanded and monomorphized.
But this is not obvious at all from the types above,
so our best bet would be to document the requirements in the comments
and hope the next person will read them.
Even then, however, GHC will not complain if we write `myPipeline = monomorphize . etaExpand`

despite it being incorrect, as we did not put the input into the prenex form!

How can we express these transformations to avoid missing any necessary steps? One way of doing this is via…

## Type-level tags

We assume some extensions are enabled:

```
{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE StandaloneKindSignatures #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}
```

Then, we introduce a type `Xform'`

(the `'`

at the end will be handy later):

```
import Data.Kind (Type)
type Tags = [Type]
type Xform' :: Tags -> Tags -> Type -> Type -> Type
newtype Xform' requires provides a b = Xform (a -> b)
```

so `Xform' requires provides a b`

is really a function `a -> b`

under the hood,
but it also carries around in its type two lists:
its requirements on the input, and the guarantees it provides.

A transformation usually does not change the underlying type, so we also introduce a handy alias:

`type Xform requires provides a = Xform' requires provides a a`

Then the transformations above get the following types:

```
data Prenex
prenex :: Xform '[] '[Prenex] Program
data EtaExpanded
etaExpand :: Xform '[] '[EtaExpanded] Program
data Monomorphized
monomorphize :: Xform '[Prenex] '[Monomorphized] Program
data Defunctionalized
defunctionalize :: Xform '[EtaExpanded, Monomorphized] '[Defunctionalized] Program
```

Note that each tag (`Prenex`

, `EtaExpanded`

and so on) is a data type
that is defined separately, so the set of tags is open to extension.

In case you are curious about the `'`

syntax,
the tick before the lists
is the `DataKinds`

-enabled syntax to promote constructors to the “type level”.
So, while `[]`

is a *term* of *type* `[a]`

,
`'[]`

is a *type* of *kind* `[a]`

,
and `'[Prenex]`

is a *type* of *kind* `[Type]`

.
Compare that to `[Prenex]`

, which is a *type* of *kind* `Type`

.
We will also see the `:`

constructor promoted to the type level, as `':`

, later.

If you would like to play around with this, `ghci`

comes in handy,
with the `:t`

and `:k`

commands.
Try entering `:k [Prenex]`

and `:k '[Prenex]`

, for example!

Alright, we’ve already got the documentation at the type level! Now let’s see how to compose these transformations.

## Composition

We cannot use the usual function composition operator `.`

as the types will not match,
and we also need to keep track of the `requires`

and `provides`

of the composite transformation.

So let’s say we have two transformations, `x1`

and `x2`

, having requirements `r1`

and `r2`

respectively,
and asserting guarantees `p1`

and `p2`

about their outputs, respectively.
What is the requirement of the composition?
It is all that `x1`

requires, plus all that `x2`

requires *and* that is not satisfied by `x1`

.
That is, symbolically, the set `r1 ∪ (r2 \ p1)`

.
What does the composition provide?
There are some choices here, which we will discuss later,
but for definiteness we stick with the union of `p1`

and `p2`

.

So, in code, the composition operator looks like this (replacing `\`

with more `Data.List`

-ish `\\`

):

```
(>:>) ::
Xform' r1 p1 a b ->
Xform' r2 p2 b c ->
Xform' (r1 ∪ (r2 \\ p1)) (p1 ∪ p2) a c
Xform f1 >:> Xform f2 = Xform $ f2 . f1
```

All the dependency tracking magic happens in `∪`

and `\\`

, which must be type-level functions.
We can use type families to implement them.

The union `∪`

is as straightforward as its term-level counterpart `++`

, modulo the type family syntax:

```
type family (∪) (xs :: [a]) (ys :: [a]) :: [a] where
'[] ∪ ys = ys
(x ': xs) ∪ ys = x ': (xs ∪ ys)
```

It is not strictly a set-union; it does not remove duplicates from the resulting list. That can also be done, but at the cost of compilation time, and having duplicates in our particular case is fine.

For “set difference” `\\`

we need a helper that removes all occurrences of a given value from the list,
and here things get interesting:

```
type family Del (xs :: [a]) (d :: a) :: [a] where
Del '[] _ = '[]
Del (d ': xs) d = Del xs d
Del (x ': xs) d = x ': Del xs d
```

This looks almost like its term-level analog, except… that second equation looks funny,
and we also do not need anything like the term-level `Eq`

constraint anywhere!

Unlike their term-level counterparts, type family equations can be
non-linear:
the second equation reduces types of the form `Del (d ': xs) e`

when `d`

and `e`

are syntactically (that is, structurally) equal.
Otherwise, if they are not syntactically equal,
the term is reduced using the third equation.
By the way, this might remind you of unification in the style of logic programming,
and rightfully so!

Having this helper, `\\`

is easy to implement:

```
type family (\\) (xs :: [a]) (ys :: [a]) :: [a] where
xs \\ '[] = xs
xs \\ (y ': ys) = Del xs y \\ ys
```

Having all these type families, the above definition of `>:>`

will also type check!

## Running the transformation

Now that we have learned to write and combine the transformations, we just need to run them.
This is also quite straightforward — we can only run a transformation which has all its dependencies satisfied,
so the `requires`

list must be empty:

```
xform :: Xform' [] provides a b -> a -> b
xform (Xform f) = f
```

We might even *not* export the constructor of `Xform`

from the module where we are defining all this,
forcing users to use the `xform`

with the lowercase `x`

and ensuring there are no unmet requirements.

This works fine as long as the transformation has all its requirements provided. Otherwise, the error message is hard to decipher:

```
Pipeline.hs:75:19: error:
* Couldn't match type: '[Prenex, EtaExpanded]
with: '[]
Expected: Xform'
'[]
'[Monomorphized, Defunctionalized]
Program
Program
Actual: Xform'
'[Prenex, EtaExpanded]
'[Monomorphized, Defunctionalized]
Program
Program
```

It is possible to make sense of it, but can we do better?
Luckily, GHC has supported
custom error messages
for some time, so the answer is yes!
The way it works is that every time GHC sees a `TypeError`

(from `GHC.TypeLits`

)
somewhere in the types, it produces the corresponding type error.

So let’s introduce a type family `ReqsSatisfied`

of the kind `Constraint`

(another nice feature of GHC),
which evaluates to a type error if there are any missing requirements:

```
type family ReqsSatisfied (requirements :: [tag]) :: Constraint where
ReqsSatisfied '[] = () ~ ()
ReqsSatisfied reqs = TypeError ('Text "Unsatisfied transformation pipeline dependencies: " ':<>: 'ShowType reqs)
```

The first equation matches the case when there are no (unsatisfied) requirements,
and the type family then evaluates to the trivial constraint
stating that the unit type is equal to itself
(the syntax `a ~ b`

means that the type `a`

must be equal to type `b`

).
The second equation matches all other cases,
where `Text`

, `:<>:`

and `ShowType`

in its right-hand side come from `GHC.TypeLits`

.

If we stick this constraint into the `xform`

function:

`xform :: (ReqsSatisfied requires) => Xform' requires provides a b -> a -> b`

then GHC will complain with a more readable error message:

```
* Unsatisfied transformation pipeline dependencies: '[Prenex,
EtaExpanded]
```

when the dependencies are unsatisfied.

Now we can ship this module to users without expecting them to be type family experts!

## Design decisions and shortcomings

There are other ways to solve the problem, and the proposed one is not without trade-offs.

### Tags or strings

In the above, the different transformations are denoted by different types (`data Prenex`

, for instance).
This means that if `prenex`

and `monomorphize`

are defined in separate modules,
then `monomorphize`

needs to import the module with the `Prenex`

tag — otherwise it will not be able to name it!

Another approach is to use type-level string literals, so our `Tags`

becomes:

`type Tags = [Symbol] -- Symbol from GHC.TypeLits`

and our `monomorphize`

will be:

`monomorphize :: Xform '["Prenex"] '["Monomorphized"] Program`

not needing the `Prenex`

or `Monomorphized`

tags.

Generally, stringly-typed code seems to be frowned upon in the software engineering community, but in this case this is OK: all the possible errors will be caught at compile-time anyway.

The only thing worth noting is that a string `"Prenex"`

will be the same
no matter which module it is written in,
while the same sequence of tokens `data Prenex`

written in different modules
defines two different types.
Depending on your particular use case, either behavior might be the desired one!

### Keep or delete

Remember that we calculated the guarantees provided by the
composition of `f1 :: XForm r1 p1 a`

and `f2 :: XForm r2 p2 a`

as `p1 ∪ p2`

.
This obviously only holds if running `f2`

does not invalidate any of the assertions in `p1`

,
and we assume this is the case for any pair of transformations we are working with.
Otherwise, the matters are significantly complicated
since every transformation effectively needs to know about any other one
and whether its invariants are broken.

Note that we cannot keep just the guarantees provided by the later transformation `f2`

,
as this effectively means we always lose information about the earlier `f1`

.
And even if we re-do `f1`

, we will analogously lose information about `f2`

—
a vicious cycle with no end!

`Xform`

or tagged `Program`

Another approach is to tag the `Program`

type itself, so we would have:

```
prenex :: Program xs -> Program (Prenex ': xs)
defunctionalize :: (EtaExpanded ∊ xs, Monomorphized ∊ xs) => Program xs -> Program (Defunctionalized ': xs)
```

where `∊`

is a `Constraint`

-kinded type family along the lines of `∪`

and `\\`

.
So, requirements go to the left of `=>`

, and the guarantees go to the return type —
some pleasant symmetry here!

In a sense, this approach specifies the behaviors even more directly in types,
and the requirements/guarantees check of the composition
(which can be done with the good old `.`

in this case)
is delegated to the GHC constraint solver,
so we do not need to write `∪`

and `\\`

ourselves.
On the other hand, the rest of our compiler
needs to be made polymorphic in `xs`

of the `Program xs`

.

If the `Program`

type is used in many places,
especially if it is a member of some other product types,
then either `xs`

needs to be propagated there as well,
or existential types need to come into play.
Depending on how much code needs to be changed
and whether the rest of the compiler will benefit
from having the requirements on the `Program`

stated in its types,
this might be either a good or a bad thing.

Overall, this choice is essentially a matter of context, taste, and stylistic judgment.

### Lack of checks inside the transformations

In the above, if we omit any of the requirements in, say:

`defunctionalize :: Xform '[EtaExpanded, Monomorphized] '[Defunctionalized] Program`

then its body still compiles.
GHC does not check it to see
whether all the assumptions are really expressed in the type.
Unfortunately, this is only truly achievable with full dependent types,
but indexing the `Program`

type with this list of tags
and making different constructors available depending on a particular list
might get you reasonably far.

This approach falls quite outside of the scope of this post, but the trees that grow pattern is highly relevant and might be a good source of inspiration here as well.

### Run-time composition is hard

All these checks happen at compile-time, which has its merits, but it also means that composing the transformations at run time (think about the motivating example with run-time CLI switches controlling optimization passes) is hard. There are also several ways of solving this, but this post is getting long already, so let’s leave this as a cliffhanger!

## Conclusion

To sum up, we have looked at how we can make data transformation pipelines more type safe: the types help ensure all the prerequisites are met for each step in a pipeline. We have looked at some relatively advanced features of the GHC type system and how they help us achieve type safety in a real-world use case.

We have also considered some other alternatives, as well as some shortcomings of the proposed solution.

## About the author

If you enjoyed this article, you might be interested in joining the Tweag team.