Parser combinators are one of the prides of the Haskell community. They’re a craft that we continue to polish to this day. Yet, there’s something unsatisfactory about parser combinators.
See, when I write a parser, I frequently write a pretty-printer as well1, and the pretty-printer is almost the same as the parser. This makes maintenance harder, if only because parser and pretty-printer have to be kept in sync. Besides, it simply feels like unnecessary duplication.
This blog post is the story of the latest developments in the quest for more general grammar combinators—or as Mathieu Boespflug and I have been calling them, format descriptors—and how it led me to publish a new library, called Pup. For further reading, you can also check the paper that Mathieu and I wrote about it for Olivier Danvy’s festschrift, held at the ICFP/SPLASH conference in Singapore last October.
A library of format descriptors
The Pup library lets you write format descriptors in a familiar style. In fact, Pup uses existing parser combinator libraries as a backend. But in addition to parsing, these format descriptors let you pretty-print the code (using an existing pretty-printing library as a backend).
An example is better than a long explanation, so let me show you a format descriptor for simplified S-expressions. It lets you parse expressions such as this one:
((a-fun 57 :tag) "a string" (list 1 "somewhat" :long "list of mixed types"
(list "at least" 3) (list "but maybe" :more)))and then pretty-print the result (or any abstract syntax tree, really), which looks like this:
((a-fun 57 :tag)
"a string"
(list
1
"somewhat"
:long
"list of mixed types"
(list "at least" 3)
(list "but maybe" :more)))I didn’t implement any parsing or printing algorithm. Parsing is done by Megaparsec and pretty-printing by Prettyprinter. These are the only backends that I have implemented so far—it’s a new library, but the library is written to be easily extensible to other parsing and pretty-printing backends. What Pup gives you is the surface syntax of the format descriptor DSL.
So here’s our example format descriptor. In the rest of the blog post I’ll
explain what’s going on (in particular why there are those labels like
#SSymb and #":" rather than constructors SSymb and (:)):
sexpr =
group (nest 2 (#SList <* try ("(") <* space <*> try sexpr `sepBy` space1 <* space <* ")"))
<|> #SSymb <*> try symbol
<|> #SInt <*> try nat
<|> #SStr <* try ("\"") <*> takeWhileP Nothing (/= '"') <* "\""
where
symbol = #":" <*> symbol_lead <*> many (try symbol_other)
symbol_lead = oneOf (':' : ['a' .. 'z'] ++ ['A' .. 'Z'])
symbol_other = oneOf (['a' .. 'z'] ++ ['A' .. 'Z'] ++ ['0' .. '9'] ++ ['-'])
data SExpr
= SList [SExpr]
| SSymb String
| SStr Text
| SInt Int
deriving (Generic, Show, Eq)
-- For completeness parse the example string, then pretty-print the result
do
let str =
-- Did you know that GHC has multi-line strings now?
"""
((a-fun 57 :tag) "a string" (list 1 "somewhat" :long "list of mixed types"
(list "at least" 3) (list "but maybe" :more)))
"""
case parse sexpr "<test>" str of
Left e -> putStrLn e
Right expr -> do
let doc = Maybe.fromJust (Pup.print sexpr expr)
let str' = Prettyprinter.renderStrict $ Prettyprinter.layoutPretty Prettyprinter.defaultLayoutOptions doc
putStrLn $ Text.unpack str'The sexpr format descriptor is almost what you would write with Megaparsec.
There are two main differences
- The second line reads
#SSymb <*> try symbol. In Megaparsec you would seeSymb <$> try symbol. We call#SSymba lead. It is in charge not only of building anSSymbwhen parsing, but also the reading out of anSSymbwhen printing. This is also why we have to use the applicative<*>instead of the functorial<$>. The#SSymblead is automatically generated because our type implementsGeneric. - There are pretty-printing combinators
group (nest 2 (…)). These are Wadler-Leijen type combinators as implemented in the Prettyprinter library. They’re responsible for breaking lines and inserting indentation. You’ll need some of these since a standard grammar can tell you what can be parsed, but not what is pretty.
Other than that, if you’re familiar with Megaparsec, you should feel right at
home. You even have access to the monadic >>= if you need it (and yes, it
pretty-prints).
A word of caution: the Pup library helps you to make sure that the output of the pretty-printer can be reparsed to the same AST, but it doesn’t guarantee it. It’s probably unreasonable to hope for parsers and pretty-printers to be inverses by construction with only Haskell’s type system (barring some very drastic restrictions on possible parsers).
With that said. I think the library is cool, and you can go use it today! I hope you enjoy it, and please let me know if you’re missing anything. And if you want to know more about what makes the library tick: read on.
Tuple troubles
The type of the sexpr format descriptor is maybe not what you’re expecting:
sexpr :: Pup' (SExpr -> r) r SExprWhy are there three type arguments rather than one? What is this r type
variable about? These are where Pup’s secret sauce lies.
Fundamentally, a parser for the type a is something of the form String -> a,
and a printer something of the form a -> String. Parsers are covariant
functors and printers are contravariant functors. So our bidirectional format
descriptor should be something like P0 a = (a -> String, String -> a), which
is neither a covariant nor a contravariant functor. The only map-like
transformation that P0 supports is something like
isomap :: (a -> b, b -> a) -> P0 a -> P0 bWe also need something to chain parsers. To that effect, we can
introduce an andThen combinator (we’d really need to enrich the P0 type to be
able to do that, but let’s pretend):
andThen :: P0 a -> P0 b -> P0 (a, b)Together this adds up to a rather so-so experience.
p0 :: P0 (A, (B, (C, D)))
p0 = foo `andThen` bar `andThen` baz `andThen` buzz
p0' :: P0 (A, B, C, D)
p0' = isomap (\(a, (b, (c, d))) -> (a, b, c, d), \(a, b, c, d) -> (a, (b, (c, d)))) p0Part of what makes this style uncomfortable is the deep nesting of product. And when you get to reshuffle them, due to the need of passing two functions at the same time, it adds a lot of visual boilerplate. This boilerplate would probably dominate the parsing logic. This is what in our paper we call “tuple troubles”.
In the case of covariant functors, there’s a ready-made solution: use an
applicative functor interface. It’s equivalent to pairing, it just works much
better in Haskell. But P0 isn’t a functor, and there’s really little we can do
to improve on this interface.
In order to address these shortcomings let’s use a classic technique and decorrelate the type of what is parsed from the type of what is printed2.
type P1 a b = (a -> String, String -> b)A P1 a b can parse a b and print an a. Of course we are generally
interested in using a P1 a a, but we now have the ability to locally change
one and not the other. So that we can use:
-- If we know how to parse a `b` we can parse a `c`
rmap :: (b -> c) -> P1 a b -> P1 a c
-- If we know how to print an `a` we can print a `d`
lmap :: (d -> a) -> P1 a b -> P1 d bIn Haskell parlance, P1 is a profunctor.
And now, since P1 a is a functor for any a, we can also equip it with an
applicative functor structure. The paper Composing Bidirectional Programs
Monadically even shows how to equip P1 a with a monadic structure. Pup
uses their technique for monadic parsing. Now our example reads as follows:
p1 :: P1 (A, B, C, D) (A, B, C, D)
p1 = (,,,) <$> lmap fst foo <*> lmap snd bar <*> lmap thd baz <*> lmap fth buzzThis looks much more reasonable, the expression has the familiar structure of
applicative expressions. But the actual parsing logic is still drowned in all
these lmap (and I’ve been quite generous here, as I’ve invented projections
fst…fth for 4-tuples). We really need a better way to combine printers.
Indexing with a stack
To improve on the profunctor style, we need a better way to combine the contravariant side.
The style I propose, in the Pup library, looks like the following
p2 :: P2 (A -> B -> C -> D -> r) r (A, B, C, D)
p2 = (,,,) <$> foo <*> bar <*> baz <*> buzzI’ll come back to the definition of P2 in a moment. But in the meantime, here’s
how I want you to read the type:
- In some descriptor
descr :: P2 r r' a,ais the return type, andrandr'represent a stack of arguments. Withrrepresenting the shape of the stack beforedescr, andr'the state of the stack afterdescr. - A stack of the form
A -> B -> C -> D -> ris a stack with at least four elements, and whose topmost elements are of typeA,B,C, andDin that order. The rest of the stack is unknown, which is represented by a type variabler. - The format descriptor
foo, for instance, has an argument of typeA(used when printing). To obtain this argument,foopops the topmost element of the argument stack, sofoohas typefoo :: P2 (A -> r) r A.
This usage pattern of passing a different argument to each format descriptor in a sequence, as how it’s naturally done by a stack, appears to be by far the most common in format descriptors, so Pup optimises for it.
You might have noticed the catch, though: the type arguments r and r' vary
during a sequence of actions. This means that P2 isn’t quite an applicative
functor and <*>, above, isn’t the combinator we’re used to. Instead P2 is
an indexed applicative functor (and possibly an indexed monad too).
The type of <*> for indexed applicatives f is:
(<*>) :: f r r' (a -> b) -> f r' r'' a -> f r r'' bOn the other hand, <$> is the usual <$> from the Functor type class.
Specialised to an indexed applicative f, its type is:
(<$>) :: (a -> b) -> f r r' a -> f r r' bWith this established, we can see how the type of individual syntax descriptors can be sequenced:
foo :: P2 (A -> r1) r1 A
bar :: P2 (B -> r2) r2 B
-- `r1` is universally quantified
-- so we can unify `r1` with `B -> r2`
q :: P2 (A -> B -> r2) r2 (A, B)
q :: (,) <$> foo <*> barSequencing an action which pops an A and an action which pops a B yields an
action which pops both an A and a B. Precisely what we needed. Notice how,
crucially, r1 in foo and r2 in bar are universally quantified variables,
so that they can be refined when a subsequent value requires a value from the
stack.
The existing support for indexed applicatives and monads isn’t very good (mostly
it consists of the somewhat minimalistic and pretty dusty indexed
library). This is why I’m also releasing a new, modern,
indexed-applicative-and-monads companion library Stacked, which makes use of
modern GHC facilities such as quantified constraints and qualified do. Pup is
built on top of Stacked.
The last piece of the puzzle is what we’ve been calling “leads”, which are
actions which neither print nor parse anything, but serve as a bidirectional
version of constructors. For types which implement the Generic type class,
leads are derived automatically. And with that we can complete our example:
#"(,,,)" :: P2 ((a, b, c, d) -> r) (a -> b -> c -> d -> r) (a -> b -> c -> d -> (a, b, c, d))
p2' :: P2 ((A, B, C, D) -> r) r (A, B, C, D)
p2' = #"(,,,)" <*> foo <*> bar <*> baz <*> buzzThis uses the OverloadedLabels extension. Equivalently,
you can use lead @"(,,,)" instead of #"(,,,)".
A note on divisible
There’s an existing structure on contravariant functors which is analogous to
applicative functors on covariant functors: the Divisible.
However, it doesn’t help us with our goal in this section, since the divisible
structure is already used implicitly in P1’s instance of Applicative.
In fact, here’s a theorem (the proof is left as an exercise to the reader): let d and f be a
divisible and an applicative functor, respectively, then t a b = (d a, f b) is
applicative (with respect to b).
Implemented with continuations
This concludes the motivation behind the library’s design. You can treat the fact
that we use the function arrow (->) as a stack constructor as an
implementation detail. But if you’re interested, let’s now get into how we can
possibly implement P2.
The idea goes back to Olivier Danvy’s Functional
unparsing, where Danvy presents a straightforward implementation of
combinators for printf-like format strings using continuations. The idea was
that just like the original printf, all the arguments to be formatted were just
additional argument passed to the printf function. This is the technique we
used. With that in mind, P2 can be defined as:
type P2 r1 r2 b = ((String -> r2) -> r1, String -> b)
-- Compare:
-- P2 (a -> r) r b = ((String -> r) -> a -> r, String -> b)
-- P1 a b = (a -> String, String -> b)With this definition, we naturally have a stack defined in terms.
Final touches
Now, this isn’t the definition of the real Pup' type. This definition of P2
can’t deal with data types with several constructors. To deal with type
constructors, we need a way for parsers to signal failures and for failures to be handled. And,
maybe surprisingly, we need a way for printers to signal failures, and for
those failures to be caught. Even though we typically think of pretty-printers as
total, we are going to build them piecewise, constructor by constructor, and
those partial printers can fail (another way to think about it is that a format
descriptor for a parser which can only parse into some of the constructors of a
type, can only print from those same constructors).
Failures are handled by the (<|>) combinator which we can see in the
S-expression example. These are reminiscent of the Alternative and MonadPlus
classes from the Base library. But because we are dealing with indexed
applicatives and monads, it’s a different combinator, which is provided, in
the Stacked library, by the Additive class:
class Additive a where
empty :: a
(<|>) :: a -> a -> aThe standard way to add failures in Haskell is the Maybe monad. We can, in
fact, use the Maybe monad in the parser to signal failure (in the actual Pup
library, we simply reuse the failure mechanism from the parsing backend). But it
doesn’t work for the pretty-printer, as the Maybe monad would interfere with the
continuation-based stacks (this is explained in more details in §5.2 of the
paper).
As it turns out, it would seem that the only way to add failures and handlers to the pretty-printer is to add a second continuation. So our type will look like this:
type P3 r1 r2 b = ((String -> r2 -> r2) -> r1 -> r1, String -> Maybe b)That’s about it. As you can see, the types get pretty hairy. But it’s kept decently under control by defining the right abstractions (of course, indexed applicatives and monads are such abstractions, but we also propose some new ones). In particular, the most central abstraction in Pup’s design is the Stacked type class. All this is explained in our paper (spoiler, because I’m quite happy with this one: it features comonads).
In conclusion
Of course Pup is pretty fresh, and as such there will be some rough edges. But I think it’s already quite useful. I expect it to be straightforward to convert a parser implemented using Megaparsec to Pup. So if you have one of those, you should consider it.
Alternatively, you can check Mathieu’s proof-of-concept library cassette, whose current design is described in our paper as well. Cassette’s format descriptors combine better than Pup’s. But cassette doesn’t support monadic parsing, or using existing parsers (yet?).
But really, the techniques that we’ve developed for Pup, ostensibly for parsing and pretty-printing format descriptors, can presumably be used in many bidirectional programming situations. In fact, I assume that you can often replace profunctors by stacked applicatives. It’s just completely unexplored. I’m curious to hear your thoughts about that.
-
A pretty-printer, by the way, is not the same thing as a formatter. A pretty-printer takes an abstract syntax tree and turns it into something human-readable. A formatter takes text written by a human and normalises it according to some styling rule. This blog post is about pretty-printing not formatting. For our take on formatting see Ormolu and Topiary.↩
-
Here are some Haskell libraries which use this decoupling trick
- tomland a bidirectional Toml parsing library
- autodocodec a bidirectional Json parsing library
- distributors a generic bidirectional parsing library
The same idea can even be found in pure mathematics in the concept of dinatural transformation.↩
Behind the scenes
Arnaud is Tweag's former head of R&D and former blog chief editor. Currently based in Tokyo, Japan, he shares his time at Modus between promoting open source as a school of software engineering and his research on programming languages (very much including linear types).
If you enjoyed this article, you might be interested in joining the Tweag team.