a tutorialMaking two garbage collectors be good neighbours

(using linear types)Parallelising your array code

12 October 2017

|*This is the fourth post in a series about array programming in Haskell — you might be interested in the first, second, and third, too.*

In the previous post of this series, we explored the basic, down-to-earth, index-based array interfaces that have their roots in the Haskell standard libraries. We also discussed the need for *strictness*, *unboxing*, and a *two-phase* initialisation strategy, freezing mutable structures after they have been populated. With those basics out of the way, we can now explore the design of higher-level, *collection-oriented* array interfaces whose efficient implementation critically relies on those basic techniques.

The vector package is probably the most popular array library in Haskell, and it is a prime example of this approach. The package, implemented by Roman Leshchinskiy, is a spin off from the Data Parallel Haskell project . Specifically, Data Parallel Haskell is organised as multiple layers of array libraries, and vector is a generalisation of what used to be the lowest layer: sequential, int-indexed arrays combined with a powerful array-fusion framework that makes the composition of successive collective operations efficient.

Collective operations that process all elements of an array in a single operation, such as `map`

, `fold`

, `zip`

, `filter`

, and so forth, fit more naturally into a functional style than the index-based loops that originated in FORTRAN. This is not just to favour combinator-based code, but also because index-based, one-element-at-a-time processing quickly leads to the use of mutable arrays. Indexed-based operations are sometimes inevitable, but they ought to be used sparingly; often we only need to extract individual or only a few elements of an array. In cases, where random data-driven access is required, collective index-based operations often suffice — such as, for example, `backpermute`

:

```
backpermute (fromList ['a', 'b', 'c', 'd']) (fromList [0, 3, 2, 3, 1, 0])
= fromList ['a', 'd', 'c', 'd', 'b', 'a']
```

The function `backpermute`

returns a result of the same length as its second argument, where the integer values of the second argument are used to determine the index positions of the elements to pick from the first argument. (The function `fromList`

turns a list into a `Vector`

.)

Consider the following pipeline of collective vector operations, where we assume `import Data.Vector as V`

to disambiguate `vector`

operations from the standard list operations of the same name:

```
V.map fst
$ V.filter (\t -> snd t > 0)
$ V.zip points cs
```

First, we combine a vector of `points :: Vector (Double, Double)`

with a vector `cs :: Vector Double`

by way of `V.zip`

to obtain a value of type `Vector ((Double, Double), Double)`

. Then, we `V.filter`

to only retain those `((Double, Double), Double)`

triples whose last component is strictly positive. Finally, we project the points out of the filtered vector. Overall, this pipeline drops all `points`

whose corresponding (index-wise) `cs`

value is zero or negative.

While this is a compact functional specification, its naive implementation would be rather inefficient. Each of the three collective operations creates a new immutable vector. All this, only to finally get a vector containing a subset of the elements of `points`

. The main problem with this is not that it is wasteful to allocate those intermediate arrays, but that initialising and then consuming those intermediates generates a lot of memory traffic — this is rather unfortunate as memory bandwidth is the most severe limiting factor of current hardware.

Hence, it is absolutely crucial that a library, such as `vector`

, performs *array fusion*, a compiler optimisation that eliminates intermediate arrays and combines multiple successive array traversals into one. In fact, the `vector`

package, compiles the above code consisting of `V.map`

, `V.filter`

, and `V.zip`

into a single traversal without any intermediates. It achieves this using a fusion framework known as stream fusion.

The `vector`

package is architected in a flexible manner with generic type classes `Vector`

(for immutable vectors) and `MVector`

(for mutable vectors), defined in `Data.Vector.Generic`

. `Vector`

is a subclass of `MVector`

, as the latter provides mutating methods on top of the pure ones.

The modules `Data.Vector`

and `Data.Vector.Unboxed`

instantiate the generic framework with concrete boxed and unboxed vector types — both again providing an immutable and a mutable flavour. In addition, there is variant of unboxed vectors, called *storable*, in `Data.Vector.Storable`

, which ties in with the `Storable`

class of the foreign function interface and is useful for interoperability with C code.

Due to this generic set up, array algorithms can be implemented generically and, then, used with all of the concrete instances. It is also perfectly possible to extend `vector`

with your own flavour of arrays; for example, for interoperability with another array library.

One of the algorithms included in the benchmarks that are part of `vector`

is the quickhull algorithm for computing the convex hull of a set of points. Quickhull is —just like quicksort— a split-based divide-and-conquer algorithm. It begins by cutting the set of points into two by drawing a line between the leftmost and rightmost point in the set. Then, it recursively operates on these two sets of points. In each recursive invocation, the point furthest from the last dividing line gives rise to two new lines (using the two end points of that last dividing line). This is illustrated by the following animation from Wikimedia Commons.

Given a `Vector`

of x/y-coordinates, we can implement the initial step of quickhull, namely determining the left and rightmost point, as follows:

```
import Data.Vector.Unboxed as V
quickhull :: Vector (Double, Double) -> Vector (Double, Double)
quickhull points = hsplit points pmin pmax V.++ hsplit points pmax pmin
where
xs = V.map fst points
pmin = points V.! V.minIndex xs
pmax = points V.! V.maxIndex xs
```

The functions `V.minIndex`

and `V.maxIndex`

return the index position of the smallest and largest element of the argument vector, respectively (here, of the x-coordinates `xs`

of all points). We use those to extract the leftmost (`pmin`

) and rightmost (`pmax`

) point. When invoking the recursive component of quickhull —here, implemented by `hsplit`

— we pass the points and the dividing line in the form of its starting and end point. As the latter are swapped in the two recursive calls, one call processes the points above the dividing line and the other call processes the points below that line.

The function `hsplit`

, first, computes the distance `cs`

of all `points`

to the line from `p1`

to `p2`

. Second, it uses the pipeline of collective operations that we discussed above to remove all `points`

that are below that line. The remaining points, `packed`

, are passed into the two recursive calls together with `pm`

, the point that is furthest from the line.

```
hsplit :: (Ord b, Num b, Unbox b)
=> Vector (b, b) -> (b, b) -> (b, b) -> Vector (b, b)
hsplit points p1 p2
| V.length packed < 2 = p1 `V.cons` packed
| otherwise = hsplit packed p1 pm V.++ hsplit packed pm p2
where
cs = V.map (\p -> cross p p1 p2) points
packed = V.map fst
$ V.filter (\t -> snd t > 0)
$ V.zip points cs
pm = points V.! V.maxIndex cs
-- cross p p1 p2 = distance of p from line between p1 and p2
cross :: Num a => (a, a) -> (a, a) -> (a, a) -> a
cross (x,y) (x1,y1) (x2,y2) = (x1-x)*(y2-y) - (y1-y)*(x2-x)
```

The function `cross`

that we use to compute the distance of a point `p`

from the line between points `p1`

and `p2`

returns a signed distance; i.e., a positive distance value is for a point above the line and a negative distance value for a point below the line. This is crucial for `V.filter (\t -> snd t > 0)`

to remove all points located on or below the line.

The `vector`

package optimises the quickhull code rather well and gets within about a factor of two of the performance of hand-optimised C code. However, there are two kinds of obstacles that may prevent stream fusion and other fusion systems from optimising code as expected: (1) the interaction with other compiler optimisations and (2) the strength of the fusion system.

Concerning Point (1), the interaction of multiple compiler optimisations and finding the right order in which to apply them are a topic for PhD theses as well as a black art of skilled engineers. In the Glasgow Haskell Compiler (GHC), it is especially *inlining*, *let floating*, *strictness analysis*, and various forms of *rewriting* and *specialisation* that can have a huge impact on the effectiveness of fusion. The details are beyond the scope of this post, but some of it is covered in the rather extensive GHC optimization and fusion.

Concerning Point (2), practical fusion systems generally cannot fuse all fusable programs. In other words, given a particular fusion system, it will spot and optimise certain types of computations that can be fused. However, there are other types of computations that can also be fused, but this particular fusion system will not be able to recognise and exploit those opportunities.

In fact, the algorithmic structure of quickhull does expose a specific limit of `vector`

’s array fusion framework (i.e., stream fusion for arrays). That limit is the reason why `vector`

can only get within about a factor of two of the performance of hand-optimised C code for quickhull. The details of this limitation as well as a proposal to overcome it are explored in the paper Data flow fusion with series expressions in Haskell (PDF). The paper also contains concrete benchmark results, including results for quickhull. Finally, the latest insights into overcoming this specific kind of limitation are described in Machine fusion: merging merges, more or less (PDF).

The `vector`

package is a tool that should be in every Haskell developers tool box. It is versatile, efficient, and widely used. Consequently, there is a range of add ons; in particular, vector-algorithms, which implements search and sorting algorithms, is quite popular.

The `vector`

package works hard to achieve excellent sequential performance. However, it, by design, ignores the option of improving performance by parallelisation. This will be the topic of the next post in this series.