Here You See the Small Porcupine Perched in Its Tree, Preparing and Crunching Some Data with Me

30 October 2019 — by Yves Parès

Porcupines are large rodents with coats of sharp spines, or quills, that protect them against predators. The term covers two families of animals: the Old World porcupines of family Hystricidae, and the New World porcupines of family Erethizontidae.

The New World porcupines […] live in wooded areas and can climb trees, where some species spend their entire lives.

Porcupines have a relatively high longevity and had held the record for being the longest-living rodent, with one individual living to 27 years, until the record was broken in 2002 by a naked mole-rat living to 28 years. (source)

So, long story short, Porcupine is a framework aimed at making long-lived (in the sense that they are robust, shareable, and reusable), portable and customizable data pipelines. That is, Porcupine provides a team writing a data-hungry analytics application the tools to write a directed graph of tasks, where each task can depend on a tree of resources (think “a filesystem tree”). Porcupine does this while remaining oblivious of the actual physical locations of these resources, of their format, and of the protocols used to address them (that’s the “portable” part), and can expose any configuration parameter to the outside world (that’s the “customizable” part).

Porcupine provides tools for three different professions: software developers, data scientists, and DevOps engineers. But by itself, porcupine-core (the main package) is neither a data science nor a DevOps tool. Instead, it implements a central, principled basis for data applications to be used by people with these different skill sets. It unites their worlds by cleanly separating them: the software developer can stay focused on data serialization and managing different data sources, the data scientist can stay focused on analytics, while the DevOps engineer can see, administer, and modify the entire pipeline’s configuration and inputs/outputs.

I’m going to focus on the developer part of the API in this blog post. For a broader overview of Porcupine (notably from an analyst’s point of view), you can watch my talk at the latest Haskell Exchange.

Abstracting over data sources and sinks

Cleanly separating sources of data from the code that processes them while maintaining memory consumption guarantees is something that has already been extensively studied in Haskell, including on this very blog. Porcupine relies on streaming quite extensively. So let’s see where these streams originate and end in Porcupine:

class ( MonadMask m, MonadIO m, TypedLocation (LocOf l) ) =>
    LocationAccessor m (l::Symbol) where
  data LocOf l :: *
  locExists :: LocOf l -> m Bool
  writeBSS :: LocOf l -> BSS.ByteString m r -> m r
  readBSS :: LocOf l -> m (BSS.ByteString m ())

This LocationAccessor class is a slightly simplified version of the one actually in use, but it shows the main idea. The LocationAccessor is, in the end, a monad m, but since the same final monad stack can act as several location accessors, we also need this type-level String (the Symbol) to disambiguate the implementation we are targeting at a given moment. So each implementation defines a new backend for opening and writing data byte streams.

This is the instance for local resources (i.e., local files):

instance (MonadResource m, MonadMask m) => LocationAccessor m "resource" where
  newtype LocOf "resource" = L URL
    deriving (Functor, Foldable, Traversable, ToJSON, TypedLocation)
  ...

Here we declare that "resource" is a LocationAccessor in any monad that can provide MonadResource. The LocOf type is a data family associated with the class. This gives every backend the capacity to declare its own resource identifier type as long as this type satisfies the TypedLocation constraint (mostly, that this type is representable in JSON and that it contains some notion of “filetype”, like an extension). A filepath-based or URL-based backend (like "resource") can just wrap the URL type provided by Porcupine. The other location accessors currently provided are "aws" (to read and write S3 objects) and "http", and other backends could be added in their specific packages in the future.

Tasks

Let’s now take a little detour to talk about the tasks. The main type here is PTask, which is the frontier between the developer part of the API and that of the data scientist. Here’s how you could write a simple task in Porcupine:

oneTask :: (LogThrow m) => PTask m () ()
oneTask =
  loadData myInput >>> arr processData >>> writeData myOutput

PTask m i o means that our task runs on some base monad m, takes i as an input, and returns o as an output. The (>>>) operator sequentially composes two tasks, and arr lifts a pure function to a PTask. So oneTask fetches data from myInput, feeds it into processData, whose result is in turn fed to writeData to store the result in myOutput. The loadData and writeData calls are the ones which internally pull and write byte streams via the LocationAccessor typeclass.

The type of oneTask indicates that our task needs to run on a base monad that supports throwing exceptions and logging. Eventually, it’s the same m as the one we had in the instances of LocationAccessor: it is the final monad in which our application runs. However, you can notice that the constraints are not the same, namely that our task doesn’t have to care at all that in the end, m needs to implement MonadResource and MonadMask if we want to access local files: tasks and sources are decorrelated.

The myInput and myOutput bits are called VirtualFiles. They are the frontier between the data scientist part of the API and that of the DevOps engineer. They identify and expose a resource to the outside world so that resource can be mapped to an actual physical resource before the pipeline runs. Both myInput and myOutput have a virtual path decided by the user (say, for the sake of simplicity, "/myInput,json" and "/myOutput.json"), and they contain a collection of possible serialization and deserialization functions, that are selected at execution depending on the type of the file to which a VirtualFile is mapped.

We showed here the sequential composition of tasks with (>>>), but you can also compose them in a parallel fashion with (***). You can also use the arrow notation for some syntactic sugar akin to the monad do notation.

Configuration and execution

So what happens when oneTask finally runs? This is done via a call to runPipelineTask:

main = runPipelineTask
       (FullConfig "simple-stuff" "myconfig.yaml" "." ())
       (baseContexts "")
       oneTask
       ()

We are not going to describe every parameter here, but this is where everything gets glued together. The baseContexts call is where the final monad is created, by composing together the various capabilities our application will need. baseContexts is where the implementations of logging and MonadResource comes from. So it provides the "resource" location accessor, and you can stack extra location accessors like "http" or "aws" on top of it.

Once compiled, this pipeline can output its default configuration in the myconfig.yaml file. It looks like this:

locations:
  /: .
  /myInput: _.csv
  /myOutput: _.json

This gives you a view of the Porcupine Tree (the tree of resources) required by oneTask. Here you can see the virtual paths that we mentioned before, associated to some mappings. runPipelineTask is able to just expose these virtual paths to the outside world via this configuration file, no extra work is needed. For concision, mappings can start with an underscore token, in which case the physical resource full path is derived from the mapping of the root virtual path and from the mapped virtual path itself. Here, the root virtual path is mapped to ".", the current working directory. So the configuration here above tells the pipeline to look for myInput in ./myInput.csv, and to write myOutput to ./myOutput.json.

So what happens if we change the configuration? For instance, if we say:

locations:
  /: .
  /myInput: http://someapi.org/api/v1/input.csv
  /myOutput: _.json

Well, we mentioned that runPipelineTask puts together a stack of LocationAccessors, and that each one of these accessors has one LocOf associated datatype which is parsable from JSON (or YAML). Well, for every physical path or URL mapped to a virtual path in the YAML config, Porcupine goes through the stack of LocationAccessors and takes the first one that can parse this physical location. If some location cannot be parsed by any of the LocationAccessors in place, then Porcupine stops with an error before trying to run anything. This is also true if a VirtualFile is mapped to a file type for which it doesn’t know any serialization/deserialization function (say, if the end user mapped it to a *.json file but the only serialization function we know is for csv). This way, you get an early failure with an error message that tells you which mapping is problematic, and your pipeline doesn’t even start running. This detection is invaluable when running long pipelines that run CPU-intensive operations for minutes or hours: without Porcupine’s early failure feature this computation time would be wasted should the program crash because of an invalid URL to write the result to.

Closing words

Porcupine is built on top of our funflow open source library. funflow is also an arrow-based framework that gives us cacheable tasks and a remote cache. Porcupine has been developed as part of a project funded by Novadiscovery, which uses it internally to express systems biology simulation pipelines.

porcupine-core is released at the same time as two companion libraries: reader-soup, which provides an automated ReaderT pattern and is the machinery runPipelineTask uses to automatically create the Reader monad it runs the application in, and docrecords, an extensible records library based on vinyl that permits us to express records of parameters with docstrings and default values which the pipeline can then expose to the outside world.

I hope you enjoyed this quick introduction to Porcupine’s API for developers. Please have a look at the examples on the GitHub repository, such as these for the "http" location accessor, and watch my Haskell Exchange talk to learn more. We have a Gitter for Porcupine’s users and developers where any issues or questions are welcome. Please hop in and say hi! :)

Behind the scenes

Yves Parès

Tech Group

Data Engineering

Data is a critical resource feeding software systems and decision-making. Open source tools which allow for correct and efficient processing of data are essential in order to effectively leverage this resource.

Tech Group

Programming Languages and Compilers

Research, create, improve and maintain programming languages and their tooling to enhance developer productivity and to deliver reliable, maintainable, correct and performant software with minimum effort.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← Markov chain Monte Carlo (MCMC) Sampling, Part 1: The Basics Probabilistic Programming with monad‑bayes, Part 2: Linear Regression →