In my functional programming course to Master Students of Telecom Nancy, I like to use parsing as an example of monadic programming, relying on the megaparsec library. My only concern with megaparsec is that its official tutorial is long: at the time I’m writing, it’s 15000 words long.
Unlike the official megaparsec tutorial, this blog post is intended to be smaller, and is aimed at an audience with only a basic understanding of Haskell and monadic programming.
All the Haskell material from this blogpost is available on our GitHub: https://github.com/tweag/minimal-megaparsec-tutorial. You can fork this repository to get a full-fledged setup (including CI and Haskell Language Server support) for experimenting with a megaparsec parser 🚀
Running example
My running example is a parser for a domain-specific language that I designed for the class. This language uses primitive drawing commands to represent ASCII art roguelike maps. It looks like this:
HLine 0 0 8; HLine 0 4 8; VLine 0 0 5; VLine 7 0 2; VLine 7 3 2
HLine 8 1 2; HLine 8 3 2
HLine 10 0 6; HLine 10 4 6; VLine 10 0 2; VLine 10 3 2; VLine 15 0 5
Start 2 1
Cell 13 3 ~; Cell 14 3 ~; Cell 14 2 ~
Here, HLine x y len
and VLine x y len
draw horizontal and vertical walls respectively.
The Start x y
command marks the player’s starting point and Cell x y ~
places special terrain.
Roguelike maps typically consist of rectangular rooms and connecting corridors, where
walls are shown as #
, water as ~
, and walkable spaces as dots (.
)
For example, the snippet above draws a map with two connected rooms.
The room on the left contains the player’s start location (>
), while some water appears in the lower right corner of the room on the right:
######## ######
#.>....####....#
#.............~#
#......####..~~#
######## ######
Walkable floor cells are omitted from the domain-specific language, as they can be inferred by computing the set of cells reachable from the starting point. In implementations of roguelikes, maps like this one are translated into an array of arrays of symbols, with some symbols being walkable (e.g. dot cells and water cells) and some symbols being blockers (walls). The top-level array is then used to compute possible moves and collisions.
The Parsec
monad
To use megaparsec, we define our main monad type using the Parsec e s a
type.
It has three arguments:
- The type of errors returned by the parser,
- the type of stream accepted as input by the parser, and
- the type of data returned upon successful parsing of an input stream.
For a simple parser, we define:
- The error type to be
Text
, for simplicity. In a production parser, you would use a structured error type, that distinguishes the different error cases; so that you can handle them differently. - The input stream to be
Text
, because this is the most idiomatic choice in the Haskell ecosystem:
import Data.Text (Text)
import Text.Megaparsec
type Error = Text
type Input = Text
-- | @Parser a@ is a parser that accepts @Text@ as input and returns an @a@ upon
-- successful parsing.
type Parser a = Parsec Error Input a
Our first parser
Parsers are built from
primitive combinators (e.g. lookAhead
, notFollowedBy
, end of file eof
)
and combinators derived from them (e.g. oneOf
, anySingle
, satisfy
).
These combinators are designed to consume a few symbols, not complex structures (more on this later).
Combinators return parsers in any MonadParsec
monad,
which means that they have a signature where the head is MonadParsec e s m => ...
and the return type is of the form m a
1.
In our context, it suffices to know that m a
is instantiated to Parser a
, so we can
use these combinators for our parsers.
Let’s parse the different kinds of symbols we usually find in ASCII art roguelike maps,
using the anySingle
function, which parses a single token. In our case, since the input type
is Text
, the type of tokens is Char
(see the ShareInput
case of
Stream’s documentation,
as well as the instances of Stream
):
-- | A symbol in the map of an ASCII roguelike
data Symbol
= -- | A wall, depicted by a # character
Wall
| -- | A water cell, depicted by a ~ character
Water
deriving (Eq, Show)
-- | A parser for the symbol of a single cell. Used in 'parseElement' below.
parseSymbol :: Parser Symbol
parseSymbol = do
c <- anySingle
case c of
'#' -> return Wall
'~' -> return Water
_ -> fail $ "Unknown symbol: " <> [c] -- See below for how to avoid this case altogether (in parseLineElement)
Parser combinators
By virtue of MonadParsecs
being monads, parsers can be built using functions that are common in monadic Haskell code
(including functions from Functor
, Applicative
, etc.). Let’s demonstrate this
to build a parser for more advanced roguelike map constructs:
data Element
= -- | Horizontal wall, starting at @(x,y)@ with @length@ cells (ending at @(x+length-1,y)@)
HorizontalLine Int Int Int
| -- | Vertical wall, starting at @(x,y)@ with @length@ cells (ending at @(x,y+length-1)@)
VerticalLine Int Int Int
| -- | A cell at @(x,y)@ with a symbol
Cell Int Int Symbol
| -- | The starting point of the player
Start Int Int
deriving (Eq, Show)
The parser for the HorizontalLine
and VerticalLine
cases can be written as follows:
import Control.Monad (void)
import Control.Monad.Extra (when)
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer
parseLineElement :: Parser Element
parseLineElement = do
constructor <- choice [string "HLine" >> return HorizontalLine, string "VLine" >> return VerticalLine]
space1 -- One or more space
x <- decimal
space1
y <- decimal
space1
len <- decimal
when (len < 1) $ fail $ "Length must be greater than 0, but got " <> show len
return $ constructor x y len
The first two lines either parse the string HLine
or the string VLine
and use the
choice
function to encode the two possibilities. Also, because each line in a do
block encodes a step
in the computation, writing monadic parsers is natural: each line consumes some of the input,
until enough is consumed to return the desired value. Another example of using a regular monadic function
is to use when
to stop parsing when an incorrect value is consumed.
Running parsers
Since our parser takes Text
as input, it can be tested in a pure context.
Megaparsec provides the runParser
function for this. To be able to print errors of our parser, our error type must be an instance of ShowErrorComponent
;
and then we can define a convenient runMyParser
function that returns either an error or the parsed value:
import Data.Text (pack, unpack)
-- | Instance required for 'runMyParser'
instance ShowErrorComponent Error where
showErrorComponent = unpack
-- | A variant of megaparsec's 'runParser', instantiated to our context.
-- Successfully parses an @a@ or returns an error message.
runMyParser :: Parser a -> Input -> Either Text a
runMyParser parser input =
case runParser parser "" input of
Left err -> Left $ pack $ errorBundlePretty err
Right x -> Right x
Parsing expressions, lists, etc.
Megaparsec not only provides building blocks for parsing tokens and combining parsers. It also provides parsers for common constructs found in programming languages and domain-specific languages, such as expressions and lists. Megaparsec does this by relying on the parser-combinators package.
I don’t want to go into the details of parsing expressions here (e.g. parsing 1 + 2 - 3
…), but let me emphasize that it is
a bad idea to write your own expression parser. Instead, think about what kind of operators you need and encode them, using the
Operator
type.
List parsing, on the other hand, is done with various sep… functions. In our case of roguelike maps, we allow different elements to be separated by a semicolon, or by one or more newlines. This is encoded as follows:
parseElements :: Parser [Element]
parseElements = parseElement `sepBy1` separator
where
separator = do
hspace -- Optional horizontal (non-newline) space
choice [void $ char ';', void $ some eol] -- Either a single ';' or many newlines
hspace
parseElement :: Parse Element
parseElement = choice [parseLineElement, parseStart, parseCell]
where
parseStart = do
void $ string "Start"
space1
(x, y) <- parseCoord
return $ Start x y
parseCell = do
void $ string "Cell"
space1
(x, y) <- parseCoord
space1
symbol <- parseSymbol
return $ Cell x y symbol
parseCoord = do
x <- decimal
space1
y <- decimal
return (x, y)
Conclusion
We’ve presented how to parse simple constructs using megaparsec
and how to run our parsers.
This blog post is less than 1500 words long: mission accomplished presenting megaparsec in a shorter way than the official tutorial 🥳
If you want to use the code from this blog post as a starting point, feel free to clone https://github.com/tweag/minimal-megaparsec-tutorial. And once your project is moving away from a minimal viable product, head over to megaparsec’s official tutorial to learn about more advanced ways to use megaparsec!
- This is an instance of the monad transformer pattern.↩
About the author
Clément is a Director of Engineering, leading the Build Systems department. He studied Computer Science at Telecom Nancy and received his PhD from Université Nice Sophia Antipolis, where he proved multithreaded programs using linear logic. His technical background includes functional programming, compilers, provers, distributed systems, and build systems.
If you enjoyed this article, you might be interested in joining the Tweag team.