a tutorialMaking two garbage collectors be good neighbours

(using linear types)Parallelising your array code

16 November 2017

|*This is the fifth post in a series about array programming in Haskell — you might be interested in the first, second, third, and fourth, too.*

A recurring theme in array programming is performance. After all, many algorithms in numerical computing and data science are computationally intensive. Once the sequential implementation of an array program has been fully optimised, the natural next step is to use one or multiple forms of parallelism to achieve further performance improvements. This can be parallelism within one computational core (SIMD parallelism), multicore parallelism, or distributed multi-machine parallelism. Unfortunately, at this point matters become much more complicated, because parallel programming comes with its own set of serious challenges.

In this post, we will focus on multicore parallelism for computations operating on multi-dimensional arrays. In other words, in relation to the `vector`

package, which we discussed in the last post, we have two new ingredients. Firstly, instead of *one-dimensional* Int-indexed arrays, we have *multi-dimensional* Int-indexed arrays. Secondly, the collective operations provided on these arrays come with parallel implementations. In fact, the library API is designed to favour collective operations that have good parallel implementations. Similarly, the move to explicitly multi-dimensional arrays is motivated by being able to provide parallel implementations that take the array shape into account, wherever that is an advantage.

To make matters concrete, we will discuss the `Repa`

library. Internally it uses many of the same techniques as `vector`

, including *strictness*, *unboxing*, and a *two-phase* initialisation strategy. However, it uses a second array fusion strategy in addition to `vector`

’s *stream fusion*. More precisely, `Repa`

internally uses `vector`

to represent plain boxed and unboxed arrays and to execute sequential computations on those, which still benefit from stream fusion. However, `Repa`

introduces additional array representations, such as *delayed arrays*, to also achieve fusion across parallel computations.

This additional complication is necessary as stream fusion, by itself, tends to turn parallel into sequential code. In other words, one of the challenges of high-performance parallel array implementations that are built on collective operations is *to apply fusion while preserving parallelism*. To really get good performance, we need to simultaneously optimize along two orthogonal dimensions: get more done simultaneously, by parallelizing, but also make each sequential unit of work run faster.

A second consequence of targeting a parallelisation-friendly API is a very limited use of mutable arrays. Mutable structures generally interact badly with concurrency and parallelism, opening the door to a whole range of hard to diagnose faults. In fact, the focus on immutable arrays for parallel programming is *one of the most compelling conceptual improvements of functional over imperative parallel array programming*. (To be precise, `Repa`

’s API does provide access to the mutable array structures used to implement two-phase initialisation, but it is usually not necessary to use them directly.)

The obvious structure for indexing multi-dimensional Int-indexed arrays are tuples of `Int`

s. However, they come with two severe drawbacks: (1) they force us to fix the dimensionality of all functions over arrays and (2) they are not sufficient to characterise operations on lower-dimensional subarrays of an array (e.g., a two-dimensional plane within a three-dimensional cube).

As an example of the first drawback, consider a fold function that given a three-dimensional cube, reduces it along, say, the x-axis to a two-dimensional plane of sums. The only difference of that operation compared to a fold that sums a two-dimensional plane across one axis to a one-dimensional vector is the number of dimensions that we do not reduce along. Now, we could have a family of fold functions (`fold1`

, `fold2`

, and so on), one for each possible dimension of argument array. But that is hardly satisfactory.

Instead, Repa uses a custom datatype for indexing. Index types are built from the infix constructor `(:.)`

and the constant `Z`

, representing a zero-dimensional array (which is the special cases of a singleton array). For example, the type of two-dimensional indices is `Z :. Int :. Int`

and one of its values is `Z :. 3 :. 5`

. By using a type variable instead of `Z`

, we can denote indices with a particular minimum dimensionality. For instance, `sh :. Int`

has at least one dimension, but it might have more, depending on how the type variable `sh`

is instantiated — in any case, instances of `sh`

need to be drawn from the class `Shape`

. On the basis of this index representation, we can capture the entire family of multi-dimensional fold functions in a single type:

```
foldS :: (Shape sh, Source r a, Unbox a)
=> (a -> a -> a) -> a -> Array r (sh :. Int) a -> Array U sh a
```

The function `foldS`

implements a sequential, multi-dimensional reduction; hence, the `S`

suffix. It gets three arguments:

`a -> a -> a`

is the type of the binary reduction function, which needs to be associative,`a`

is the reduction function’s neutral (i.e, together they form a monoid), and`Array e (sh :. Int) a`

is an at least one-dimensional array of elements of type`a`

, which the type constraint`Unbox a`

requires to be a type that has an associated unboxed representation.

Finally, the result of type `Array U sh a`

has one dimension less than the argument array, but contains elements of the same type `a`

. This leaves us with wondering about the meaning of the first type argument of `Array`

— `r`

and `U`

, respectively— as well as the type constraint `Source r a`

.

The first type argument of `Array`

determines the array *representation*. The available representations include boxed (`V`

) and unboxed (`U`

) representations, but also *delayed* (`D`

) and *cursored* (`C`

) representations. The latter are guaranteed to be removed by fusion, but can lead to the superfluous recomputation of array elements that are used more than once. Repa makes the choice of representation explicit to place it under programmer control — experience shows that compiler heuristics for automatic representation selection tend to be fragile and unreliable.

A consequence of a representation that is fused away, such as delayed `D`

and cursored `C`

, is that it can only be a data `Source`

of a computation. Hence, the type class of the same name provides elementary array access functions for arrays. The opposite, a `Target`

, provides the functionality to fill an array as part of two-phase initialisation and is only available to *manifest* representations, such as the boxed `V`

and unboxed `U`

representation. A manifest representation is one which, in contrast to a fused-away delayed representation, is actually stored in memory.

In addition to concrete representations, Repa representation tags can also include meta information, such as the *interleaving* hint `I`

. An array tagged `I U`

uses an unboxed interleaved representation, which improves parallel load balancing in parallel computations where the amount of work strongly varies between different regions in the parallel array. A standard example is computing a Mandelbrot set, where black pixels are significantly more expensive than others.

As we saw above with `foldS`

, Repa follows the convention of adding an `S`

to sequential array operations. Similarly, it uses a `P`

as a suffix for parallel functions. For example, we have

```
foldP :: (Shape sh, Source r a, Unbox a, Monad m)
=> (a -> a -> a) -> a -> Array r (sh :. Int) a -> m (Array U sh a)
```

for the parallel version of fold. The distinction between sequential and parallel functions is an important one, since Repa does not support nested parallelism. That is, a parallel function (e.g., `foldP`

) cannot use another parallel function as an argument (e.g., as the combination function).

In addition to the suffix, the parallel fold distinguishes itself from the sequential by the use of a not further specified monad. The purpose of this monad is to ensure the one-by-one execution of pipelines of parallel computations. This is important to prevent inadvertent nesting of parallel computations as Haskell is a lazy language and we might otherwise feed a suspended (i.e., not yet evaluated) parallel computation into another parallel computation.

As a simple example of a parallel computation, consider the multiplication of two matrices `arr`

and `brr`

of type `Array U DIM2 Double`

(two-dimensional, unboxed arrays), where `type DIM2 = Z :. Int :. Int`

:

```
mmultP :: Monad m
=> Array U DIM2 Double
-> Array U DIM2 Double
-> m (Array U DIM2 Double)
mmultP arr brr
= do trr <- transpose2P brr
computeP (fromFunction (Z :. h1 :. w2) dotp)
where
(Z :. h1 :. _) = extent arr
(Z :. _ :. w2) = extent brr
dotp ix = sumAllS $
zipWith (*)
(slice arr (Any :. (row ix) :. All))
(slice trr (Any :. (col ix) :. All))
```

We assume the existence of a helper function `transpose2P`

, which transposes a matrix in parallel — for example, by using Repa’s `backpermute`

function. Then, we generate the manifest result array by computing all elements of `fromFunction (Z :. h1 :. w2) dotp`

in parallel with `computeP`

. The shape (i.e., the size of the dimensions) of the result is `h1`

times `w2`

, and `fromFunction`

turns a function, which takes an array index to the corresponding array element , into a delayed array:

```
fromFunction :: sh -> (sh -> a) -> Array D sh a
```

At each index `ix`

of the resulting array, we evaluate `dotp`

, which only involves a sequential computation. It’s sequential nature is important for two reasons. Firstly, as mentioned, Repa does not support nested parallelism, so the computations on each result array index triggered by `computeP`

in parallel may themselves not be parallel. Secondly, the work complexity of matrix multiplication is *n*^3 — that is the number of scalar multiplications that need to be performed. Performing them all in parallel would lead to (a) too much and (b) too fine-grained parallelism. Both too much parallelism and parallel workloads that are each too little work lead to bad performance as they result in too much administrative overhead.

In contrast, the sequential computation performed by `dotp`

obtains a row of the matrix `arr`

and a column of `brr`

(actually, a row of the transposed `brr`

, which is `trr`

) with `slice`

, which extracts an entire subarray from an array. Then, it multiples the row and column pointwise with `zipWith (*)`

and sums up the products with `sumAllS`

, where

```
zipWith :: (Shape sh, Source r1 a, Source r2 b)
=> (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c
sumAllS :: (Shape sh, Source r a, Num a) => Array r sh a -> a
```

This example highlights how reasoning about the decomposition of an algorithm into parallel and sequential components is crucial for good parallel performance. This is assisted by Repa’s clear distinction between sequential and parallel operations.

Repa went through three major iterations before arriving at the current interface. The underlying concepts are described and supported by benchmarks in the papers Regular, shape-polymorphic, parallel arrays in Haskell, Efficient Parallel Stencil Convolution in Haskell, and Guiding Parallel Array Fusion with Indexed Types, respectively. In addition, Data Flow Fusion with Series Expressions in Haskell proposes a further improvement to the fusion system. However, this has not been integrated into the main package.