Deciphering Haskell's applicative and monadic parsers

This post follows the construction of parsers described in Graham Hutton's "Programming in Haskell" (2nd edition). It's my attempt to work through chapter 13 in this book and understand the details of applicative and monadic combination of parsers presented therein.

Basic definitions for the Parser type

A parser parameterized on some type a is:

newtype Parser a = P (String -> [(a,String)])

It's a function taking a String and returning a list of (a,String) pairs, where a is a value of the parameterized type and String is (by convention) the unparsed remainder of the input. The returned list is potentially empty, which signals a failure in parsing [1]. It might have made more sense to define Parser as a type alias for the function, but types can't be made into instances of typeclasses; therefore, we use netwype with a dummy constructor named P.

With this Parser type, the act of actually parsing a string is expressed with the following helper function. It's not strictly necessary, but it helps make code cleaner by hiding P from users of the parser.

parse :: Parser a -> String -> [(a,String)]
parse (P p) inp = p inp

The most basic parsing primitive plucks off the first character from a given string:

item :: Parser Char
item = P (\inp -> case inp of
                    []      -> []
                    (x:xs)  -> [(x,xs)])

Here's how it works in practice:

> parse item "foo"
[('f',"oo")]
> parse item "f"
[('f',"")]
> parse item ""
[]

Parser as a Functor

We'll start by making Parser an instance of Functor:

instance Functor Parser where
  -- fmap :: (a -> b) -> Parser a -> Parser b
  fmap g p = P (\inp -> case parse p inp of
                          []        -> []
                          [(v,out)] -> [(g v,out)])

With fmap we can create a new parser from an existing parser, with a function applied to the parser's output. For example:

> parse (fmap toUpper item) "foo"
[('F',"oo")]
> parse (fmap toUpper item) ""
[]

Let's check that the functor laws work for this definition. The first law:

fmap id = id

Is fairly obvious when we substitute id for g in the definition of fmap. We get:

fmap id p = P (\inp -> case parse p inp of
                        []        -> []
                        [(v,out)] -> [(id v,out)])

Which takes the parse result of p and passes it through without modification. In other words, it's equivalent to p itself, and hence the first law holds.

Verifying the second law:

fmap (g . h) = fmap g . fmap h

... is similarly straightforward and is left as an exercise to the reader.

While it's not obvious why a Functor instance for Parser is useful in its own right, it's actually required to make Parser into an Applicative, and also when combining parsers using applicative style.

Parser as an Applicative

Consider parsing conditional expressions in a fictional language:

if <expr> then <expr> else <expr>

To parse such expressions we'd like to say:

Parse the token if
Parse an <expr>
Parse the token then
Parse an <expr>
Parse the token else
Parse an <expr>
If all of this was successful, combine all the parsed expressions into some sort of result, like an AST node.

Such sequences, along with alternation (an expression is either this or that) are two of the critical basic blocks of constructing non-trivial parsers. Let's see a popular way to accomplish this in Haskell (for a complete example demonstrating how to construct a parser for this particular conditional expression, see the last section in this post).

Parser combinators is a popular technique for constructing complex parsers from simpler parsers, by means of higher-order functions. In Haskell, one of the ways in which parsers can be elegantly combined is using applicative style. Here's the Applicative instance for Parser.

instance Applicative Parser where
  -- pure :: a -> Parser a
  pure v = P (\inp -> [(v,inp)])

  -- <*> :: Parser (a -> b) -> Parser a -> Parser b
  pg <*> px = P (\inp -> case parse pg inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g px) out)

Recall how we created a parser that applied toUpper to its result using fmap? We can now do the same in applicative style:

> parse (pure toUpper <*> item) "foo"
[('F',"oo")]

Let's see why this works. While not too exciting on its own, this application of a single-argument function is a good segue to more complicated use cases.

Looking at the Applicative instance, pure toUpper translates to P (\inp -> [(toUpper,inp)] - a parser that passes its input through unchanged, returning toUpper as a result. Now, substituting item into the definition of <*> we get:

pg <*> item = P (\inp -> case parse pg inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)

... pg is (pure toUpper), the parsing of which always succeeds, returning
    [(toUpper,inp)]

pg <*> item = P (\inp -> parse (fmap toUpper item) inp)

In other words, this is exactly the example we had for Functor by fmap-ing toUpper onto item.

The more interesting case is applying functions with multiple parameters. Here's how we define a parser that parses three items from the input, dropping the middle result:

dropMiddle :: Parser (Char,Char)
dropMiddle =
  pure selector <*> item <*> item <*> item
  where selector x y z = (x,z)

Following the application of nested <*> operators is tricky because it builds a run-time chain of functions referring to other functions. This chain is only collapsed when the parser is used to actually parse some input, so it is necessary to keep a lot of context "on the fly". To better understand how this works, we can break the definition of dropMiddle into parts as follows (since <*> is left-associative):

dropMiddle =
  ((pure selector <*> item) <*> item) <*> item
  where selector x y z = (x,z)

Applying the first <*>:

pg <*> item = P (\inp -> case parse pg inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)

... pg is (pure selector), the parsing of which always succeeds, returning
    [(selector,inp)]

pg <*> item = P (\inp -> parse (fmap selector item) inp)  --= app1

Let's call this parser app1 and apply the second <*> in the sequence.

app1 <*> item = P (\inp -> case parse app1 inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)  --= app2

We'll call this app2 and move on. Similarly, applying the third <*> in the sequence produces:

app2 <*> item = P (\inp -> case parse app2 inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)

This is dropMiddle. It's a chain of parsers expressed as a compbination of higher-order functions (closures, actually).

To see how this combined parser actually parses input, let's trace through the execution of:

> parse dropMiddle "pumpkin"
[(('p','m'),"pkin")]

dropMiddle is app2 <*> item, so we have:

-- parse dropMiddle

parse P (\inp -> case parse app2 inp of
                   []         -> []
                   [(g,out)]  -> parse (fmap g item) out)
      "pumpkin"

.. substituting "pumpkin" into inp

case parse app2 "pumpkin" of
 []         -> []
 [(g,out)]  -> parse (fmap g item) out

Now parse app2 "pumpkin" is going to be invoked; app2 is app1 <*> item:

-- parse app2

case parse app1 "pumpkin" of
 []         -> []
 [(g,out)]  -> parse (fmap g item) out

Similarly, we get to parse app1 "pumpkin":

-- parse app1

parse (fmap selector item) "pumpkin"

.. following the definition of fmap

parse P (\inp -> case parse item inp of
                  []        -> []
                  [(v,out)] -> [(selector v,out)])
      "pumpkin"

.. Since (parse item "pumpkin") returns [('p',"umpkin")], we get:

[(selector 'p',"umpkin")]

Now going back to parse app2, knowing what parse app1 "pumpkin" returns:

parse (fmap (selector 'p') item) "umpkin"

.. following the definition of fmap

parse P (\inp -> case parse item inp of
                  []        -> []
                  [(v,out)] -> [(selector 'p' v,out)])
      "umpkin"

[(selector 'p' 'u',"mpkin")]

Finally, dropMiddle:

app2 <*> item = P (\inp -> case parse app2 inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)

.. Since (parse app2 "pumpkin") returns [(selector 'p' 'u',"mpkin")]

parse (fmap (selector 'p' "u") item) "mpkin"

.. If we follow the definition of fmap again, we'll get:

[(selector 'p' 'u' 'm',"pkin")]

This is the final result of applying dropMiddle to "pumpkin", and when selector is invoked we get [(('p','m'),"pkin")], as expected.

Parser as a Monad

Parsers can also be expressed and combined using monadic style. Here's the Monad instance for Parser:

instance Monad Parser where
  -- return :: a -> Parser a
  return = pure

  -- (>>=) :: Parser a -> (a -> Parser b) -> Parser b
  p >>= f = P (\inp -> case parse p inp of
                          []        -> []
                          [(v,out)] -> parse (f v) out)

Let's take the simple example of applying toUpper to item again, this time using monadic operators:

> parse (item >>= (\x -> return $ toUpper x)) "foo"
[('F',"oo")]

Substituting in the definition of >>=:

item >>= (\x -> return $ toUpper x) =
  P (\inp -> case parse item inp of
                []        -> []
                [(v,out)] -> parse (return $ toUpper v) out)

... if item succeeds, this is a parser that will always succeed with
    the upper-cased result of item

When writing in monadic style, however, we won't typically be using the >>= operator explicitly; instead, we'll use the do notation. Recall that in the general multi-parameter case, this:

m1 >>= \x1 ->
  m2 >>= \x2 ->
    ...
      mn >>= \xn -> f x1 x2 ... xn

Is equivalent to this:

do x1 <- m1
   x2 <- m2
   ...
   xn <- mn
   f x1 x2 ... xn

So we can also rewrite our example as:

> parse (do x <- item; return $ toUpper x) "foo"
[('F',"oo")]

The do notation starts looking much more attractive for multiple parameters, however. Here's dropMiddle in monadic style written directly [2]:

dropMiddleM :: Parser (Char,Char)
dropMiddleM = item >>= \x ->
                item >>= \_ ->
                  item >>= \z -> return (x,z)

And now rewritten using do:

dropMiddleM' :: Parser (Char,Char)
dropMiddleM' =
  do  x <- item
      item
      z <- item
      return (x,z)

Let's do a detailed breakdown of what's happening here to better understand the monadic sequencing mechanics. I'll be using the direct style (dropMiddleM) to unravel the applications of >>=:

item >>= \x ->
  item >>= \_ ->
    item >>= \z -> return (x,z)

.. applying the first >>=, calling the right-hand side rhsX

P (\inp -> case parse item inp of
              []        -> []
              [(v,out)] -> parse (rhsX v) out)

.. the result of parsing the first item is passed in as the argument to rhsX,
   which then returns the next application of >>=; As usual, we acknowledge
   the error propagation and ignore it for simplicity.

P (\inp -> case parse item inp of
              []        -> []
              [(v,out)] -> parse (rhsY v) out)

... and similarly for rhsZ; the final result is invoking "parse return (x,z)"
    where x is the result of parsing the first item and z the result of
    parsing the third.

A complete example

As a complete example, I've expanded the parser grammar found in the book to support conditional expressions. The full example is available here. Recall that wa want to parse expressions of the form:

if <expr> then <expr> else <expr>

This is the monadic parser [3]:

ifexpr :: Parser Int
ifexpr = do symbol "if"
            cond <- expr
            symbol "then"
            thenExpr <- expr
            symbol "else"
            elseExpr <- expr
            return (if cond == 0 then elseExpr else thenExpr)

And this is the equivalent applicative version (<$> is just an infix synonym for fmap):

ifexpr' :: Parser Int
ifexpr' =
  selector <$> symbol "if" <*> expr
           <*> symbol "then" <*> expr
           <*> symbol "else" <*> expr
  where selector _ cond _ t _ e = if cond == 0 then e else t

Which one is better? It's really a matter of personal taste. Since both the monadic and applicative styles deal in Parsers, they can be freely mixed and combined.

[1]	Failures could also be signaled by using `Maybe`, but a list lets us express multiple results (for example a string that can be parsed in multiple ways). We're not going to be using multiple results in this article, but it's good to keep this option open.

[2]	We could also use the monadic operator `>>` for statements that don't create a new assignment, but using `>>=` everywhere for consistency makes it a bit easier to understand.

[3]	The return value of this parser is `Int`, because it evaluates the parsed expression on the fly - this technique is called Syntax Directed Translation in the Dragon book. Note also that the conditional clauses are evaluated eagerly, which is valid only when no side effects are present.