Parallelising your array code

16 November 2017 — by Manuel M T Chakravarty

This is the fifth post in a series about array programming in Haskell — you might be interested in the first, second, third, and fourth, too.

A recurring theme in array programming is performance. After all, many algorithms in numerical computing and data science are computationally intensive. Once the sequential implementation of an array program has been fully optimised, the natural next step is to use one or multiple forms of parallelism to achieve further performance improvements. This can be parallelism within one computational core (SIMD parallelism), multicore parallelism, or distributed multi-machine parallelism. Unfortunately, at this point matters become much more complicated, because parallel programming comes with its own set of serious challenges.

In this post, we will focus on multicore parallelism for computations operating on multi-dimensional arrays. In other words, in relation to the vector package, which we discussed in the last post, we have two new ingredients. Firstly, instead of one-dimensional Int-indexed arrays, we have multi-dimensional Int-indexed arrays. Secondly, the collective operations provided on these arrays come with parallel implementations. In fact, the library API is designed to favour collective operations that have good parallel implementations. Similarly, the move to explicitly multi-dimensional arrays is motivated by being able to provide parallel implementations that take the array shape into account, wherever that is an advantage.

To make matters concrete, we will discuss the Repa library. Internally it uses many of the same techniques as vector, including strictness, unboxing, and a two-phase initialisation strategy. However, it uses a second array fusion strategy in addition to vector’s stream fusion. More precisely, Repa internally uses vector to represent plain boxed and unboxed arrays and to execute sequential computations on those, which still benefit from stream fusion. However, Repa introduces additional array representations, such as delayed arrays, to also achieve fusion across parallel computations.

This additional complication is necessary as stream fusion, by itself, tends to turn parallel into sequential code. In other words, one of the challenges of high-performance parallel array implementations that are built on collective operations is to apply fusion while preserving parallelism. To really get good performance, we need to simultaneously optimize along two orthogonal dimensions: get more done simultaneously, by parallelizing, but also make each sequential unit of work run faster.

A second consequence of targeting a parallelisation-friendly API is a very limited use of mutable arrays. Mutable structures generally interact badly with concurrency and parallelism, opening the door to a whole range of hard to diagnose faults. In fact, the focus on immutable arrays for parallel programming is one of the most compelling conceptual improvements of functional over imperative parallel array programming. (To be precise, Repa’s API does provide access to the mutable array structures used to implement two-phase initialisation, but it is usually not necessary to use them directly.)

Multiple dimensions

The obvious structure for indexing multi-dimensional Int-indexed arrays are tuples of Ints. However, they come with two severe drawbacks: (1) they force us to fix the dimensionality of all functions over arrays and (2) they are not sufficient to characterise operations on lower-dimensional subarrays of an array (e.g., a two-dimensional plane within a three-dimensional cube).

As an example of the first drawback, consider a fold function that given a three-dimensional cube, reduces it along, say, the x-axis to a two-dimensional plane of sums. The only difference of that operation compared to a fold that sums a two-dimensional plane across one axis to a one-dimensional vector is the number of dimensions that we do not reduce along. Now, we could have a family of fold functions (fold1, fold2, and so on), one for each possible dimension of argument array. But that is hardly satisfactory.

Instead, Repa uses a custom datatype for indexing. Index types are built from the infix constructor (:.) and the constant Z, representing a zero-dimensional array (which is the special cases of a singleton array). For example, the type of two-dimensional indices is Z :. Int :. Int and one of its values is Z :. 3 :. 5. By using a type variable instead of Z, we can denote indices with a particular minimum dimensionality. For instance, sh :. Int has at least one dimension, but it might have more, depending on how the type variable sh is instantiated — in any case, instances of sh need to be drawn from the class Shape. On the basis of this index representation, we can capture the entire family of multi-dimensional fold functions in a single type:

foldS :: (Shape sh, Source r a, Unbox a)
      => (a -> a -> a) -> a -> Array r (sh :. Int) a -> Array U sh a

The function foldS implements a sequential, multi-dimensional reduction; hence, the S suffix. It gets three arguments:

a -> a -> a is the type of the binary reduction function, which needs to be associative,
a is the reduction function’s neutral (i.e, together they form a monoid), and
Array e (sh :. Int) a is an at least one-dimensional array of elements of type a, which the type constraint Unbox a requires to be a type that has an associated unboxed representation.

Finally, the result of type Array U sh a has one dimension less than the argument array, but contains elements of the same type a. This leaves us with wondering about the meaning of the first type argument of Array — r and U, respectively— as well as the type constraint Source r a.

Indexed arrays

The first type argument of Array determines the array representation. The available representations include boxed (V) and unboxed (U) representations, but also delayed (D) and cursored (C) representations. The latter are guaranteed to be removed by fusion, but can lead to the superfluous recomputation of array elements that are used more than once. Repa makes the choice of representation explicit to place it under programmer control — experience shows that compiler heuristics for automatic representation selection tend to be fragile and unreliable.

A consequence of a representation that is fused away, such as delayed D and cursored C, is that it can only be a data Source of a computation. Hence, the type class of the same name provides elementary array access functions for arrays. The opposite, a Target, provides the functionality to fill an array as part of two-phase initialisation and is only available to manifest representations, such as the boxed V and unboxed U representation. A manifest representation is one which, in contrast to a fused-away delayed representation, is actually stored in memory.

In addition to concrete representations, Repa representation tags can also include meta information, such as the interleaving hint I. An array tagged I U uses an unboxed interleaved representation, which improves parallel load balancing in parallel computations where the amount of work strongly varies between different regions in the parallel array. A standard example is computing a Mandelbrot set, where black pixels are significantly more expensive than others.

Parallelism

As we saw above with foldS, Repa follows the convention of adding an S to sequential array operations. Similarly, it uses a P as a suffix for parallel functions. For example, we have

foldP :: (Shape sh, Source r a, Unbox a, Monad m)
      => (a -> a -> a) -> a -> Array r (sh :. Int) a -> m (Array U sh a)

for the parallel version of fold. The distinction between sequential and parallel functions is an important one, since Repa does not support nested parallelism. That is, a parallel function (e.g., foldP) cannot use another parallel function as an argument (e.g., as the combination function).

In addition to the suffix, the parallel fold distinguishes itself from the sequential by the use of a not further specified monad. The purpose of this monad is to ensure the one-by-one execution of pipelines of parallel computations. This is important to prevent inadvertent nesting of parallel computations as Haskell is a lazy language and we might otherwise feed a suspended (i.e., not yet evaluated) parallel computation into another parallel computation.

Parallel matrix multiplication

As a simple example of a parallel computation, consider the multiplication of two matrices arr and brr of type Array U DIM2 Double (two-dimensional, unboxed arrays), where type DIM2 = Z :. Int :. Int:

mmultP :: Monad m
       => Array U DIM2 Double
       -> Array U DIM2 Double
       -> m (Array U DIM2 Double)
mmultP arr brr
  = do trr <- transpose2P brr
       computeP (fromFunction (Z :. h1 :. w2) dotp)
  where
    (Z :. h1  :. _)  = extent arr
    (Z :. _   :. w2) = extent brr

    dotp ix = sumAllS $
                zipWith (*)
                  (slice arr (Any :. (row ix) :. All))
                  (slice trr (Any :. (col ix) :. All))

We assume the existence of a helper function transpose2P, which transposes a matrix in parallel — for example, by using Repa’s backpermute function. Then, we generate the manifest result array by computing all elements of fromFunction (Z :. h1 :. w2) dotpin parallel with computeP. The shape (i.e., the size of the dimensions) of the result is h1 times w2, and fromFunction turns a function, which takes an array index to the corresponding array element , into a delayed array:

fromFunction :: sh -> (sh -> a) -> Array D sh a

At each index ix of the resulting array, we evaluate dotp, which only involves a sequential computation. It’s sequential nature is important for two reasons. Firstly, as mentioned, Repa does not support nested parallelism, so the computations on each result array index triggered by computeP in parallel may themselves not be parallel. Secondly, the work complexity of matrix multiplication is n^3 — that is the number of scalar multiplications that need to be performed. Performing them all in parallel would lead to (a) too much and (b) too fine-grained parallelism. Both too much parallelism and parallel workloads that are each too little work lead to bad performance as they result in too much administrative overhead.

In contrast, the sequential computation performed by dotp obtains a row of the matrix arr and a column of brr (actually, a row of the transposed brr, which is trr) with slice, which extracts an entire subarray from an array. Then, it multiples the row and column pointwise with zipWith (*) and sums up the products with sumAllS, where

zipWith :: (Shape sh, Source r1 a, Source r2 b)
        => (a -> b -> c) -> Array r1 sh a -> Array r2 sh b -> Array D sh c
sumAllS :: (Shape sh, Source r a, Num a) => Array r sh a -> a

This example highlights how reasoning about the decomposition of an algorithm into parallel and sequential components is crucial for good parallel performance. This is assisted by Repa’s clear distinction between sequential and parallel operations.

About the author

Manuel M T Chakravarty

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← Nix on the Windows Subsystem for Linux Making two garbage collectors be good neighbours (using linear types) →