Eli Bendersky's website - Recursive descent parsing

Ungrammar in Go and resilient parsing

2023-07-08T06:12:00-07:00

It won't be news to the readers of this blog that I have some interest in compiler front-ends. So when I heard about a new(-ish) DSL for concrete syntax trees (CST), I couldn't resist playing with it a bit.

Ungrammar is used in rust-analyzer to define and access a CST for Rust. This blog post by its creator provides much more details. According to the author, Ungrammar is "the ASDL for concrete syntax trees". This sounded interesting, since I've been dabbling in ASDL in the past, and also have experience with similar techniques for defining pycparser ASTs.

The result is go-ungrammar, a re-implementation of Ungrammar in Go. The input is an Ungrammar file defining some CST; for example, here's a simple calculator language:

Program = Stmt*

Stmt = AssignStmt | Expr

AssignStmt = 'set' 'ident' '=' Expr

Expr =
    Literal
  | UnaryExpr
  | ParenExpr
  | BinExpr

UnaryExpr = op:('+' | '-') Expr

ParenExpr = '(' Expr ')'

BinExpr = lhs:Expr op:('+' | '-' | '*' | '/' | '%') rhs:Expr

Literal = 'int_literal' | 'ident'

Ungrammar looks a bit like EBNF, but not quite (hence the name "ungrammar"). It's much simpler because it doesn't need to concern itself with precedence, ambiguities and so on, also leaving all the (often complex) lexical rules to the lexer. It simply defines a tree that can be used to represent parsed language. It's also different from ASTs in that it preserves all tokens, including delimiters and other syntax elements. This is useful for tools like language servers that need a full-fidelity representation of the source code.

Implementation notes

go-ungrammar uses a classical hand-written lexical analyzer and a recursive descent parser. Just for fun, I spent more time on error recovery than strictly necessary for such a simple input language. The lexer never gives up when encountering non-sensical input; it simply emits an ERROR token and keeps going. The parser doesn't quit on the first error either; instead, it collects all the errors it encounters and tries to recover from each one (the synchronize() method in the parser code). As an example of this in action, consider this faulty Ungrammar input:

foo = @
bar = ( joe
x = y

At first glance, there are at least a couple of issues here:

@ is not a valid Ungrammar token
The ( in the second rule is unterminated; as all programmers know, unterminated grouping elements spell trouble because the compiler can get easily confused until it finds a valid terminator

When go-ungrammar runs it will report an error that looks like this:

1:7: unknown token starting with '@' (and 2 more errors)

The concrete error type returned by the parser collects all the errors, so we can iterate over them and display them all:

1:7: unknown token starting with '@'
2:1: expected rule, got bar
3:1: expected ')', got x

The parser recovers after the first error expecting to see the RHS (right-hand-side) for the foo rule, but doesn't find any. This is a good place to discuss parser recovery. The Ungrammar language has a significant ambiguity:

foo = bar baz = barn

Are bar baz the RHS sequence for rule foo, or is baz = the beginning of a new rule? Note that the language is whitespace-insensitive, so this really does come up; just look at the example calculator Ungrammar above - this is encountered on pretty much any new rule.

The way go-ungrammar resolves the ambiguity is by using an NODE = lookahead, deciding it's the beginning of a new rule (NODE is an Ungrammar term for "plain identifier").

Back to our recovery example: the second error is the parser complaining that it expected some rule after foo = but found none; an empty RHS is invalid in Ungrammar and the @ was reported and skipped. So the parser complains that it found a new rule definition instead of the RHS for an existing rule. At this point it re-synchronizes and parses the bar = rule. Then it runs into the third error - the ( is unterminated. Still, the parser recovers and keeps going.

Even with all these errors, the parser will produce a partial result - a tree equivalent to this input:

bar = joe
x = y

For foo there was simply nothing to parse. For bar, the parser reported the missing ) but parsed the contents anyway. It then fully recovered and was able to parse x = y properly. Being able to parse incomplete input and produce partial trees is very important for error recovery, and especially for tools like language servers that need to be resilient in the presence of partial input the user is busy typing in.

I enjoyed coding this resilient parser; while it's probably an overkill for a language as simple as Ungrammar, it's a good kata for frontend construction.

Deciphering Haskell's applicative and monadic parsers

2017-11-27T05:28:00-08:00

This post follows the construction of parsers described in Graham Hutton's "Programming in Haskell" (2nd edition). It's my attempt to work through chapter 13 in this book and understand the details of applicative and monadic combination of parsers presented therein.

Basic definitions for the Parser type

A parser parameterized on some type a is:

newtype Parser a = P (String -> [(a,String)])

It's a function taking a String and returning a list of (a,String) pairs, where a is a value of the parameterized type and String is (by convention) the unparsed remainder of the input. The returned list is potentially empty, which signals a failure in parsing [1]. It might have made more sense to define Parser as a type alias for the function, but types can't be made into instances of typeclasses; therefore, we use netwype with a dummy constructor named P.

With this Parser type, the act of actually parsing a string is expressed with the following helper function. It's not strictly necessary, but it helps make code cleaner by hiding P from users of the parser.

parse :: Parser a -> String -> [(a,String)]
parse (P p) inp = p inp

The most basic parsing primitive plucks off the first character from a given string:

item :: Parser Char
item = P (\inp -> case inp of
                    []      -> []
                    (x:xs)  -> [(x,xs)])

Here's how it works in practice:

> parse item "foo"
[('f',"oo")]
> parse item "f"
[('f',"")]
> parse item ""
[]

Parser as a Functor

We'll start by making Parser an instance of Functor:

instance Functor Parser where
  -- fmap :: (a -> b) -> Parser a -> Parser b
  fmap g p = P (\inp -> case parse p inp of
                          []        -> []
                          [(v,out)] -> [(g v,out)])

With fmap we can create a new parser from an existing parser, with a function applied to the parser's output. For example:

> parse (fmap toUpper item) "foo"
[('F',"oo")]
> parse (fmap toUpper item) ""
[]

Let's check that the functor laws work for this definition. The first law:

fmap id = id

Is fairly obvious when we substitute id for g in the definition of fmap. We get:

fmap id p = P (\inp -> case parse p inp of
                        []        -> []
                        [(v,out)] -> [(id v,out)])

Which takes the parse result of p and passes it through without modification. In other words, it's equivalent to p itself, and hence the first law holds.

Verifying the second law:

fmap (g . h) = fmap g . fmap h

... is similarly straightforward and is left as an exercise to the reader.

While it's not obvious why a Functor instance for Parser is useful in its own right, it's actually required to make Parser into an Applicative, and also when combining parsers using applicative style.

Parser as an Applicative

Consider parsing conditional expressions in a fictional language:

if <expr> then <expr> else <expr>

To parse such expressions we'd like to say:

Parse the token if
Parse an <expr>
Parse the token then
Parse an <expr>
Parse the token else
Parse an <expr>
If all of this was successful, combine all the parsed expressions into some sort of result, like an AST node.

Such sequences, along with alternation (an expression is either this or that) are two of the critical basic blocks of constructing non-trivial parsers. Let's see a popular way to accomplish this in Haskell (for a complete example demonstrating how to construct a parser for this particular conditional expression, see the last section in this post).

Parser combinators is a popular technique for constructing complex parsers from simpler parsers, by means of higher-order functions. In Haskell, one of the ways in which parsers can be elegantly combined is using applicative style. Here's the Applicative instance for Parser.

instance Applicative Parser where
  -- pure :: a -> Parser a
  pure v = P (\inp -> [(v,inp)])

  -- <*> :: Parser (a -> b) -> Parser a -> Parser b
  pg <*> px = P (\inp -> case parse pg inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g px) out)

Recall how we created a parser that applied toUpper to its result using fmap? We can now do the same in applicative style:

> parse (pure toUpper <*> item) "foo"
[('F',"oo")]

Let's see why this works. While not too exciting on its own, this application of a single-argument function is a good segue to more complicated use cases.

Looking at the Applicative instance, pure toUpper translates to P (\inp -> [(toUpper,inp)] - a parser that passes its input through unchanged, returning toUpper as a result. Now, substituting item into the definition of <*> we get:

pg <*> item = P (\inp -> case parse pg inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)

... pg is (pure toUpper), the parsing of which always succeeds, returning
    [(toUpper,inp)]

pg <*> item = P (\inp -> parse (fmap toUpper item) inp)

In other words, this is exactly the example we had for Functor by fmap-ing toUpper onto item.

The more interesting case is applying functions with multiple parameters. Here's how we define a parser that parses three items from the input, dropping the middle result:

dropMiddle :: Parser (Char,Char)
dropMiddle =
  pure selector <*> item <*> item <*> item
  where selector x y z = (x,z)

Following the application of nested <*> operators is tricky because it builds a run-time chain of functions referring to other functions. This chain is only collapsed when the parser is used to actually parse some input, so it is necessary to keep a lot of context "on the fly". To better understand how this works, we can break the definition of dropMiddle into parts as follows (since <*> is left-associative):

dropMiddle =
  ((pure selector <*> item) <*> item) <*> item
  where selector x y z = (x,z)

Applying the first <*>:

pg <*> item = P (\inp -> case parse pg inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)

... pg is (pure selector), the parsing of which always succeeds, returning
    [(selector,inp)]

pg <*> item = P (\inp -> parse (fmap selector item) inp)  --= app1

Let's call this parser app1 and apply the second <*> in the sequence.

app1 <*> item = P (\inp -> case parse app1 inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)  --= app2

We'll call this app2 and move on. Similarly, applying the third <*> in the sequence produces:

app2 <*> item = P (\inp -> case parse app2 inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)

This is dropMiddle. It's a chain of parsers expressed as a compbination of higher-order functions (closures, actually).

To see how this combined parser actually parses input, let's trace through the execution of:

> parse dropMiddle "pumpkin"
[(('p','m'),"pkin")]

dropMiddle is app2 <*> item, so we have:

-- parse dropMiddle

parse P (\inp -> case parse app2 inp of
                   []         -> []
                   [(g,out)]  -> parse (fmap g item) out)
      "pumpkin"

.. substituting "pumpkin" into inp

case parse app2 "pumpkin" of
 []         -> []
 [(g,out)]  -> parse (fmap g item) out

Now parse app2 "pumpkin" is going to be invoked; app2 is app1 <*> item:

-- parse app2

case parse app1 "pumpkin" of
 []         -> []
 [(g,out)]  -> parse (fmap g item) out

Similarly, we get to parse app1 "pumpkin":

-- parse app1

parse (fmap selector item) "pumpkin"

.. following the definition of fmap

parse P (\inp -> case parse item inp of
                  []        -> []
                  [(v,out)] -> [(selector v,out)])
      "pumpkin"

.. Since (parse item "pumpkin") returns [('p',"umpkin")], we get:

[(selector 'p',"umpkin")]

Now going back to parse app2, knowing what parse app1 "pumpkin" returns:

parse (fmap (selector 'p') item) "umpkin"

.. following the definition of fmap

parse P (\inp -> case parse item inp of
                  []        -> []
                  [(v,out)] -> [(selector 'p' v,out)])
      "umpkin"

[(selector 'p' 'u',"mpkin")]

Finally, dropMiddle:

app2 <*> item = P (\inp -> case parse app2 inp of
                            []        -> []
                            [(g,out)] -> parse (fmap g item) out)

.. Since (parse app2 "pumpkin") returns [(selector 'p' 'u',"mpkin")]

parse (fmap (selector 'p' "u") item) "mpkin"

.. If we follow the definition of fmap again, we'll get:

[(selector 'p' 'u' 'm',"pkin")]

This is the final result of applying dropMiddle to "pumpkin", and when selector is invoked we get [(('p','m'),"pkin")], as expected.

Parser as a Monad

Parsers can also be expressed and combined using monadic style. Here's the Monad instance for Parser:

instance Monad Parser where
  -- return :: a -> Parser a
  return = pure

  -- (>>=) :: Parser a -> (a -> Parser b) -> Parser b
  p >>= f = P (\inp -> case parse p inp of
                          []        -> []
                          [(v,out)] -> parse (f v) out)

Let's take the simple example of applying toUpper to item again, this time using monadic operators:

> parse (item >>= (\x -> return $ toUpper x)) "foo"
[('F',"oo")]

Substituting in the definition of >>=:

item >>= (\x -> return $ toUpper x) =
  P (\inp -> case parse item inp of
                []        -> []
                [(v,out)] -> parse (return $ toUpper v) out)

... if item succeeds, this is a parser that will always succeed with
    the upper-cased result of item

When writing in monadic style, however, we won't typically be using the >>= operator explicitly; instead, we'll use the do notation. Recall that in the general multi-parameter case, this:

m1 >>= \x1 ->
  m2 >>= \x2 ->
    ...
      mn >>= \xn -> f x1 x2 ... xn

Is equivalent to this:

do x1 <- m1
   x2 <- m2
   ...
   xn <- mn
   f x1 x2 ... xn

So we can also rewrite our example as:

> parse (do x <- item; return $ toUpper x) "foo"
[('F',"oo")]

The do notation starts looking much more attractive for multiple parameters, however. Here's dropMiddle in monadic style written directly [2]:

dropMiddleM :: Parser (Char,Char)
dropMiddleM = item >>= \x ->
                item >>= \_ ->
                  item >>= \z -> return (x,z)

And now rewritten using do:

dropMiddleM' :: Parser (Char,Char)
dropMiddleM' =
  do  x <- item
      item
      z <- item
      return (x,z)

Let's do a detailed breakdown of what's happening here to better understand the monadic sequencing mechanics. I'll be using the direct style (dropMiddleM) to unravel the applications of >>=:

item >>= \x ->
  item >>= \_ ->
    item >>= \z -> return (x,z)

.. applying the first >>=, calling the right-hand side rhsX

P (\inp -> case parse item inp of
              []        -> []
              [(v,out)] -> parse (rhsX v) out)

.. the result of parsing the first item is passed in as the argument to rhsX,
   which then returns the next application of >>=; As usual, we acknowledge
   the error propagation and ignore it for simplicity.

P (\inp -> case parse item inp of
              []        -> []
              [(v,out)] -> parse (rhsY v) out)

... and similarly for rhsZ; the final result is invoking "parse return (x,z)"
    where x is the result of parsing the first item and z the result of
    parsing the third.

A complete example

As a complete example, I've expanded the parser grammar found in the book to support conditional expressions. The full example is available here. Recall that wa want to parse expressions of the form:

if <expr> then <expr> else <expr>

This is the monadic parser [3]:

ifexpr :: Parser Int
ifexpr = do symbol "if"
            cond <- expr
            symbol "then"
            thenExpr <- expr
            symbol "else"
            elseExpr <- expr
            return (if cond == 0 then elseExpr else thenExpr)

And this is the equivalent applicative version (<$> is just an infix synonym for fmap):

ifexpr' :: Parser Int
ifexpr' =
  selector <$> symbol "if" <*> expr
           <*> symbol "then" <*> expr
           <*> symbol "else" <*> expr
  where selector _ cond _ t _ e = if cond == 0 then e else t

Which one is better? It's really a matter of personal taste. Since both the monadic and applicative styles deal in Parsers, they can be freely mixed and combined.

[1]	Failures could also be signaled by using `Maybe`, but a list lets us express multiple results (for example a string that can be parsed in multiple ways). We're not going to be using multiple results in this article, but it's good to keep this option open.

[2]	We could also use the monadic operator `>>` for statements that don't create a new assignment, but using `>>=` everywhere for consistency makes it a bit easier to understand.

[3]	The return value of this parser is `Int`, because it evaluates the parsed expression on the fly - this technique is called Syntax Directed Translation in the Dragon book. Note also that the conditional clauses are evaluated eagerly, which is valid only when no side effects are present.

Parsing expressions by precedence climbing

2012-08-02T05:48:43-07:00

I've written previously about the problem recursive descent parsers have with expressions, especially when the language has multiple levels of operator precedence.

There are several ways to attack this problem. The Wikipedia article on operator-precedence parsers mentions three algorithms: Shunting Yard, top-down operator precedence (TDOP) and precedence climbing. I have already covered Shunting Yard and TDOP in this blog. Here I aim to present the third method (and the one that actually ends up being used a lot in practice) - precedence climbing.

Precedence climbing - what it aims to achieve

It's not necessary to be familiar with the other algorithms for expression parsing in order to understand precedence climbing. In fact, I think that precedence climbing is the simplest of them all. To explain it, I want to first present what the algorithm is trying to achieve. After this, I will explain how it does this, and finally will present a fully functional implementation in Python.

So the basic goal of the algorithm is the following: treat an expression as a bunch of nested sub-expressions, where each sub-expression has in common the lowest precedence level of the the operators it contains.

Here's a simple example:

2 + 3 * 4 * 5 - 6

Assuming that the precedence of + (and -) is 1 and the precedence of * (and /) is 2, we have:

2 + 3 * 4 * 5 - 6

|---------------|   : prec 1
    |-------|       : prec 2

The sub-expression multiplying the three numbers has a minimal precedence of 2. The sub-expression spanning the whole original expression has a minimal precedence of 1.

Here's a more complex example, adding a power operator ^ with precedence 3:

2 + 3 ^ 2 * 3 + 4

|---------------|   : prec 1
    |-------|       : prec 2
    |---|           : prec 3

Associativity

Binary operators, in addition to precedence, also have the concept of associativity. Simply put, left associative operators stick to the left stronger than to the right; right associative operators vice versa.

Some examples. Since addition is left associative, this:

2 + 3 + 4

Is equivalent to this:

(2 + 3) + 4

On the other hand, power (exponentiation) is right associative. This:

2 ^ 3 ^ 4

Is equivalent to this:

2 ^ (3 ^ 4)

The precedence climbing algorithm also needs to handle associativity correctly.

Nested parenthesized sub-expressions

Finally, we all know that parentheses can be used to explicitly group sub-expressions, beating operator precedence. So the following expression computes the addition before the multiplication:

2 * (3 + 5) * 7

As we'll see, the algorithm has a special provision to cleverly handle nested sub-expressions.

Precedence climbing - how it actually works

First let's define some terms. Atoms are either numbers or parenthesized expressions. Expressions consist of atoms connected by binary operators [1]. Note how these two terms are mutually dependent. This is normal in the land of grammars and parsers.

The algorithm is operator-guided. Its fundamental step is to consume the next atom and look at the operator following it. If the operator has precedence lower than the lowest acceptable for the current step, the algorithm returns. Otherwise, it calls itself in a loop to handle the sub-expression. In pseudo-code, it looks like this [2]:

compute_expr(min_prec):
  result = compute_atom()

  while cur token is a binary operator with precedence >= min_prec:
    prec, assoc = precedence and associativity of current token
    if assoc is left:
      next_min_prec = prec + 1
    else:
      next_min_prec = prec
    rhs = compute_expr(next_min_prec)
    result = compute operator(result, rhs)

  return result

Each recursive call here handles a sequence of operator-connected atoms sharing the same minimal precedence.

An example

To get a feel for how the algorithm works, let's start with an example:

2 + 3 ^ 2 * 3 + 4

It's recommended to follow the execution of the algorithm through this expression with, on paper. The computation is kicked off by calling compute_expr(1), because 1 is the minimal operator precedence among all operators we've defined. Here is the "call tree" the algorithm produces for this expression:

* compute_expr(1)                # Initial call on the whole expression
  * compute_atom() --> 2
  * compute_expr(2)              # Loop entered, operator '+'
    * compute_atom() --> 3
    * compute_expr(3)
      * compute_atom() --> 2
      * result --> 2             # Loop not entered for '*' (prec < '^')
    * result = 3 ^ 2 --> 9
    * compute_expr(3)
      * compute_atom() --> 3
      * result --> 3             # Loop not entered for '+' (prec < '*')
    * result = 9 * 3 --> 27
  * result = 2 + 27 --> 29
  * compute_expr(2)              # Loop entered, operator '+'
    * compute_atom() --> 4
    * result --> 4               # Loop not entered - end of expression
  * result = 29 + 4 --> 33

Handling precedence

Note that the algorithm makes one recursive call per binary operator. Some of these calls are short lived - they will only consume an atom and return it because the while loop is not entered (this happens on the second 2, as well as on the second 3 in the example expression above). Some are longer lived. The initial call to compute_expr will compute the whole expression.

The while loop is the essential ingredient here. It's the thing that makes sure that the current compute_expr call handles all consecutive operators with the given minimal precedence before exiting.

Handling associativity

In my opinion, one of the coolest aspects of this algorithm is the simple and elegant way it handles associativity. It's all in that condition that either sets the minimal precedence for the next call to the current one, or current one plus one.

Here's how this works. Assume we have this sub-expression somewhere:

8 * 9 * 10

  ^
  |

The arrow marks where the compute_expr call is, having entered the while loop. prec is 2. Since the associativity of * is left, next_min_prec is set to 3. The recursive call to compute_expr(3), after consuming an atom, sees the next * token:

Since the precedence of * is 2, while min_prec is 3, the while loop never runs and the call returns. So the original compute_expr will get to handle the second multiplication, not the internal call. Essentially, this means that the expression is grouped as follows:

(8 * 9) * 10

Which is exactly what we want from left associativity.

In contrast, for this expression:

8 ^ 9 ^ 10

The precedence of ^ is 3, and since it's right associative, the min_prec for the recursive call stays 3. This will mean that the recursive call will consume the next ^ operator before returning to the original compute_expr, grouping the expression as follows:

8 ^ (9 ^ 10)

Handling sub-expressions

The algorithm pseudo-code presented above doesn't explain how parenthesized sub-expressions are handled. Consider this expression:

2000 * (4 - 3) / 100

It's not clear how the while loop can handle this. The answer is compute_atom. When it sees a left paren, it knows that a sub-expression will follow, so it calls compute_expr on the sub expression (which lasts until the matching right paren), and returns its result as the result of the atom. So compute_expr is oblivious to the existence of sub-expressions.

Finally, in order to stay short the pseudo-code leaves some interesting details out. What follows is a full implementation of the algorithm that fills all the gaps.

A Python implementation

Here is a Python implementation of expression parsing by precedence climbing. It's kept short for simplicity, but can be be easily expanded to cover a more real-world language of expressions. The following sections present the code in small chunks. The whole code is available here.

I'll start with a small tokenizer class that breaks text into tokens and keeps a state. The grammar is very simple: numeric expressions, the basic arithmetic operators +, -, *, /, ^ and parens - (, ).

Tok = namedtuple('Tok', 'name value')


class Tokenizer(object):
    """ Simple tokenizer object. The cur_token attribute holds the current
        token (Tok). Call get_next_token() to advance to the
        next token. cur_token is None before the first token is
        taken and after the source ends.
    """
    TOKPATTERN = re.compile("\s*(?:(\d+)|(.))")

    def __init__(self, source):
        self._tokgen = self._gen_tokens(source)
        self.cur_token = None

    def get_next_token(self):
        """ Advance to the next token, and return it.
        """
        try:
            self.cur_token = self._tokgen.next()
        except StopIteration:
            self.cur_token = None
        return self.cur_token

    def _gen_tokens(self, source):
        for number, operator in self.TOKPATTERN.findall(source):
            if number:
                yield Tok('NUMBER', number)
            elif operator == '(':
                yield Tok('LEFTPAREN', '(')
            elif operator == ')':
                yield Tok('RIGHTPAREN', ')')
            else:
                yield Tok('BINOP', operator)

Next, compute_atom:

def compute_atom(tokenizer):
    tok = tokenizer.cur_token
    if tok.name == 'LEFTPAREN':
        tokenizer.get_next_token()
        val = compute_expr(tokenizer, 1)
        if tokenizer.cur_token.name != 'RIGHTPAREN':
            parse_error('unmatched "("')
        tokenizer.get_next_token()
        return val
    elif tok is None:
            parse_error('source ended unexpectedly')
    elif tok.name == 'BINOP':
        parse_error('expected an atom, not an operator "%s"' % tok.value)
    else:
        assert tok.name == 'NUMBER'
        tokenizer.get_next_token()
        return int(tok.value)

It handles true atoms (numbers in our case), as well as parenthesized sub-expressions.

Here is compute_expr itself, which is very close to the pseudo-code shown above:

# For each operator, a (precedence, associativity) pair.
OpInfo = namedtuple('OpInfo', 'prec assoc')

OPINFO_MAP = {
    '+':    OpInfo(1, 'LEFT'),
    '-':    OpInfo(1, 'LEFT'),
    '*':    OpInfo(2, 'LEFT'),
    '/':    OpInfo(2, 'LEFT'),
    '^':    OpInfo(3, 'RIGHT'),
}

def compute_expr(tokenizer, min_prec):
    atom_lhs = compute_atom(tokenizer)

    while True:
        cur = tokenizer.cur_token
        if (cur is None or cur.name != 'BINOP'
                        or OPINFO_MAP[cur.value].prec < min_prec):
            break

        # Inside this loop the current token is a binary operator
        assert cur.name == 'BINOP'

        # Get the operator's precedence and associativity, and compute a
        # minimal precedence for the recursive call
        op = cur.value
        prec, assoc = OPINFO_MAP[op]
        next_min_prec = prec + 1 if assoc == 'LEFT' else prec

        # Consume the current token and prepare the next one for the
        # recursive call
        tokenizer.get_next_token()
        atom_rhs = compute_expr(tokenizer, next_min_prec)

        # Update lhs with the new value
        atom_lhs = compute_op(op, atom_lhs, atom_rhs)

    return atom_lhs

The only difference is that this code makes token handling more explicit. It basically follows the usual "recursive-descent protocol". Each recursive call has the current token available in tokenizer.cur_tok, and makes sure to consume all the tokens it has handled (by calling tokenizer.get_next_token()).

One additional small piece is missing. compute_op simply performs the arithmetic computation for the supported binary operators:

def compute_op(op, lhs, rhs):
    lhs = int(lhs); rhs = int(rhs)
    if op == '+':   return lhs + rhs
    elif op == '-': return lhs - rhs
    elif op == '*': return lhs * rhs
    elif op == '/': return lhs / rhs
    elif op == '^': return lhs ** rhs
    else:
        parse_error('unknown operator "%s"' % op)

In the real world - Clang

Precedence climbing is being used in real world tools. One example is Clang, the C/C++/ObjC front-end. Clang's parser is hand-written recursive descent, and it uses precedence climbing for efficient parsing of expressions. If you're interested to see the code, it's Parser::ParseExpression in lib/Parse/ParseExpr.cpp [3]. This method plays the role of compute_expr. The role of compute_atom is played by Parser::ParseCastExpression.

Other resources

Here are some resources I found useful while writing this article:

The Wikipedia page for Operator-precedence parsing.
The article by Keith Clarke (PDF), one of the early inventors of the technique.
This page by Theodore Norvell, about parsing expressions by recursive descent.
The Clang source code (exact locations given in the previous section).

Update (2016-11-02): Andy Chu notes that precedence climbing and TDOP are pretty much the same algorithm, formulated a bit differently. I tend to agree, and also note that Shunting Yard is again the same algorithm, except that the explicit recursion is replaced by a stack.

[1]

There are a couple of simplifications made here on purpose. First, I assume only numeric expressions. Identifiers that represent variables can also be viewed as atoms. Second, I ignore unary operators. These are quite easy to incorporate into the algorithm by also treating them as atoms. I leave them out for succinctness.

[2]	In this article I present a parser that computes the result of a numeric expression on-the-fly. Modifying it for accumulating the result into some kind of a parse tree is trivial.

[3]	Clang's source code is constantly in flow. This information is correct at least for the date the article was written.

How Clang handles the type / variable name ambiguity of C/C++

2012-07-05T19:35:22-07:00

My previous articles on the context sensitivity and ambiguity of the C/C++ grammar (one, two, three) can probably make me sound pessimistic about the prospect of correctly parsing C/C++, which couldn't be farther from the truth. My gripe is not with the grammar itself (although I admit it's needlessly complex), it's with the inability of Yacc-generated LALR(1) parsers to parse it without considerable hacks. As I've mentioned numerous times before, industrial-strength compilers for C/C++ exist after all, so they do manage to somehow parse these languages.

One of the newest, and in my eyes the most exciting of C/C++ compilers is Clang. Originally developed by Apple as a front-end to LLVM, it's been a vibrant open-source project for the past couple of years with participation from many companies and individuals (although Apple remains the main driving force in the community). Clang, similarly to LLVM, features a modular library-based design and a very clean C++ code-base. Clang's parser is hand-written, based on a standard recursive-descent parsing algorithm.

In this post I want to explain how Clang manages to overcome the ambiguities I mentioned in the previous articles.

No lexer hack

There is no "lexer hack" in Clang. Information flows in a single direction - from the lexer to the parser, not back. How is this managed?

The thing is that the Clang lexer doesn't distinguish between user-defined types and other identifiers. All are marked with the identifier token.

For this code:

typedef int mytype;
mytype bb;

The Clang parser encounters the following tokens (-dump-tokens):

typedef 'typedef'   [StartOfLine]   Loc=<z.c:1:1>
int 'int'           [LeadingSpace]  Loc=<z.c:1:9>
identifier 'mytype' [LeadingSpace]  Loc=<z.c:1:13>
semi ';'                            Loc=<z.c:1:19>
identifier 'mytype' [StartOfLine]   Loc=<z.c:2:1>
identifier 'bb'     [LeadingSpace]  Loc=<z.c:2:8>
semi ';'                            Loc=<z.c:2:10>
eof ''                              Loc=<z.c:4:1>

Note how mytype is always reported as an identifier, both before and after Clang figures out it's actually a user-defined type.

Figuring out what's a type

So if the Clang lexer always reports mytype as an identifier, how does the parser figure out when it is actually a type? By keeping a symbol table.

Well, actually it's not the parser that keeps the symbol table, it's Sema. Sema is the Clang module responsible for semantic analysis and AST construction. It gets invoked from the parser through a generic "actions" interface, which in theory could serve a different client. Although conceptually the parser and Sema are coupled, the actions interface provides a clean separation in the code. The parser is responsible for driving the parsing process, and Sema is responsible for handling semantic information. In this particular case, the symbol table is semantic information, so it's handled by Sema.

To follow this process through, we'll start in Parser::ParseDeclarationSpecifiers [1]. In the C/C++ grammar, type names are part of the "specifiers" in a declaration (that also include things like extern or inline), and following the "recursive-descent protocol", Clang will usually feature a parsing method per grammar rule. When this method encounters an identifier (tok::identifier), it asks Sema whether it's actually a type by calling Actions.getTypeName [2].

Sema::getTypeName calls Sema::LookupName to do the actual name lookup. For C, name lookup rules are relatively simple - you just climb the lexical scope stack the code belongs to, trying to find a scope that defines the name as a type. I've mentioned before that all names in C (including type names) obey lexical scoping rules. With this mechanism, Clang implements the required nested symbol table. Note that this symbol table is queried by Clang in places where a type is actually expected and allowed, not only in declarations. For example, it's also done to disambiguate function calls from casts in some cases.

How does a type actually get into this table, though?

When the parser is done parsing a typedef (and any declarator, for that matter), it calls Sema::ActOnDeclarator. When the latter notices a new typedef and makes sure everything about it is kosher (e.g. it does not re-define a name in the same scope), it adds the new name to the symbol table at the current scope.

In Clang's code this whole process looks very clean and intuitive, but in a generated LALR(1) parser it would be utterly impossible, because leaving out the special token for type names and merging it with identifier would create a tons of unresolvable reduce-reduce conflicts in the grammar. This is why Yacc-based parsers require a lexer hack to handle this issue.

Class-wide declarations in C++

In the previous post I mentioned how C++ makes this type lookup problem much more difficult by forcing declarations inside a class to be visible throughout the class, even in code that appears before them. Here's a short reminder:

int aa(int arg) {
    return arg;
}

class C {
    int foo(int bb) {
        return (aa)(bb);
    }

    typedef int aa;
};

In this code, even though the typedef appears after foo, the parser must figure out that (aa)(bb) is a cast of bb to type aa, and not the function call aa(bb).

We've seen how Clang can manage to figure out that aa is a type. However, when it parses foo it hasn't even seen the typedef yet, so how does that work?

Delayed parsing of inline method bodies

To solve the problem described above, Clang employs a clever technique. When parsing an inline member function declaration/definition, it does full parsing and semantic analysis of the declaration, leaving the definition for later.

Specifically, the body of an inline method definition is lexed and the tokens are kept in a special buffer for later (this is done by Parser::ParseCXXInlineMethodDef). Once the parser has finished parsing the class, it calls Parser::ParseLexedMethodDefs that does the actual parsing and semantic analysis of the saved method bodies. At this point, all the types declared inside the class are available, so the parser can correctly disambiguate wherever required.

Annotation tokens

Although the above is enough to understand how Clang approaches the problem, I want to mention another trick it uses to make parsing more efficient in some cases.

The Sema::getTypeName method mentioned earlier can be costly. It performs a lookup in a set of nested scopes, which may be expensive if the scopes are deeply nested and a name is not actually a type (which is probably most often the case). It's alright (and inevitable!) to do this lookup once, but Clang would like to avoid repeating it for the same token when it backtracks trying to parse a statement in a different way.

A word on what "backtracks" means in this context. Recursive descent parsers are naturally (by their very structure) backtracking. That is, they may try a number of different ways to parse a single grammatical production (be that a statement, an expression, a declaration, or whatever), before finding an approach that succeeds. In this process, the same token may need to be queried more than once.

To avoid this, Clang has special "annotation tokens" it inserts into the token stream. The mechanism is used for other things as well, but in our case we're interested in the tok::annot_typename token. What happens is that the first time the parser encounters a tok::identifier and figures out it's a type, this token gets replaced by tok::annot_typename. The next time the parser encounters this token, it won't have to lookup whether it's a type once again, because it's no longer a generic tok::identifier [3].

Disclaimer and conclusion

It's important to keep in mind that the cases examined in this post do not represent the full complexity of the C++ grammar. In C++, constructs like qualified names (foo::bar::baz) and templates complicate matters considerably. However, I just wanted to focus on the cases I specifically discussed in previous posts, explaining how Clang addresses them.

To conclude, we've seen how Clang's recursive descent parser manages some of the ambiguities of the C/C++ grammar. For a task that complex, it's inevitable for the code to become non-trivial [4]. That said, Clang has actually managed to keep its code-base relatively clean and logically structured, while at the same time sticking to its aggressive performance goals. Someone with a general understanding of how front-ends work shouldn't require more than a few hours of immersion in the Clang code-base to be able to answer questions about "how does it do that".

[1]	As a rule, all `Parser` code lives in `lib/Parse` in the Clang source tree. `Sema` code lives in `lib/Sema`.

[2]	Here and later I'll skip a lot of details and variations, focusing only on the path I want to use in the example.

[3]	It's very important to note that only this instance of the token in the token stream is replaced. The next instance may have already become a type (or we may have even changed the scope), so it wouldn't be semantically correct to reason about it.

[4]	That Clang parses Objective-C and various extensions like CUDA or OpenCL in the same code-base doesn't help in this respect.

Top-Down operator precedence (Pratt) parsing

2010-01-02T17:08:12-08:00

Introduction

Recursive-descent parsers have always interested me, and in the past year and a half I wrote a few articles on the topic. Here they are in chronological order:

The third article describes a method that combines RD parsing with a different algorithm for parsing expressions to achieve better results. This method is actually used in the real-world, for example in GCC and Parrot (source).

An alternative parsing algorithm was discovered by Vaughan Pratt in 1973. Called Top Down Operator Precedence, it shares some features with the modified RD parser, but promises to simplify the code, as well as provide better performance. Recently it was popularized again by Douglas Crockford in his article, and employed by him in JSLint to parse Javascript.

I encountered Crockford's article in the Beautiful Code book, but found it hard to understand. I could follow the code, but had a hard time grasping why the thing works. Recently I became interested in the topic again, tried to read the article once more, and again was stumped. Finally, by reading Pratt's original paper and Fredrik Lundh's excellent Python-based piece [1], I understood the algorithm.

So this article is my usual attempt to explain the topic to myself, making sure that when I forget how it works in a couple of months, I will have a simple way of remembering.

The fundamentals

Top down operator precedence parsing (TDOP from now on) is based on a few fundamental principles:

A "binding power" mechanism to handle precedence levels
A means of implementing different functionality of tokens depending on their position relative to their neighbors - prefix or infix.
As opposed to classic RD, where semantic actions are associated with grammar rules (BNF), TDOP associates them with tokens.

Binding power

Operator precedence and associativity is a fundamental topic to be handled by parsing techniques. TDOP handles this issue by assigning a "binding power" to each token it parses.

Consider a substring AEB where A takes a right argument, B a left, and E is an expression. Does E associate with A or with B? We define a numeric binding power for each operator. The operator with the higher binding power "wins" - gets E associated with it. Let's examine the expression:

1 + 2 * 4

Here it is once again with A, E, B identified:

1 + 2 * 4
  ^ ^ ^
  A E B

If we want to express the convention of multiplication having a higher precedence than addition, let's define the binding power (bp) of * to be 20 and that of + to be 10 (the numbers are arbitrary, what's important is that bp(*) > bp(+)). Thus, by the definition we've made above, the 2 will be associated with *, since its binding power is higher than that of +.

Prefix and infix operators

To parse the traditional infix-notation expression languages [2], we have to differentiate between the prefix form and infix form of tokens. The best example is the minus operator (-). In its infix form it is subtraction:

a = b - c  # a is b minus c

In its prefix form, it is negation:

a = -b   # b has a's magnitude but an opposite sign

To accommodate this difference, TDOP allows for different treatment of tokens in prefix and infix contexts. In TDOP terminology the handler of a token as prefix is called nud (for "null denotation") and the handler of a token as infix is called led (for "left denotation").

The TDOP algorithm

Here's a basic TDOP parser:

def expression(rbp=0):
    global token
    t = token
    token = next()
    left = t.nud()
    while rbp < token.lbp:
        t = token
        token = next()
        left = t.led(left)

    return left

class literal_token(object):
    def __init__(self, value):
        self.value = int(value)
    def nud(self):
        return self.value

class operator_add_token(object):
    lbp = 10
    def led(self, left):
        right = expression(10)
        return left + right

class operator_mul_token(object):
    lbp = 20
    def led(self, left):
        return left * expression(20)

class end_token(object):
    lbp = 0

We only have to augment it with some support code consisting of a simple tokenizer [3] and the parser driver:

import re
token_pat = re.compile("\s*(?:(\d+)|(.))")

def tokenize(program):
    for number, operator in token_pat.findall(program):
        if number:
            yield literal_token(number)
        elif operator == "+":
            yield operator_add_token()
        elif operator == "*":
            yield operator_mul_token()
        else:
            raise SyntaxError('unknown operator: %s', operator)
    yield end_token()

def parse(program):
    global token, next
    next = tokenize(program).next
    token = next()
    return expression()

And we have a complete parser and evaluator for arithmetic expressions involving addition and multiplication.

Now let's figure out how it actually works. Note that the token classes have several attributes (not all classes have all kinds of attributes):

lbp - the left binding power of the operator. For an infix operator, it tells us how strongly the operator binds to the argument at its left.
nud - this is the prefix handler we talked about. In this simple parser it exists only for the literals (the numbers)
led - the infix handler.

The key to enlightenment here is to notice how the expression function works, and how the operator handlers call it, passing in a binding power.

When expression is called, it is provided the rbp - right binding power of the operator that called it. It consumes tokens until it meets a token whose left binding power is equal or lower than rbp. Specifically, it means that it collects all tokens that bind together before returning to the operator that called it.

Handlers of operators call expression to process their arguments, providing it with their binding power to make sure it gets just the right tokens from the input.

Let's see, for example, how this parser handles the expression:

3 + 1 * 2 * 4 + 5

Here's the call trace of the parser's functions when parsing this expression:

<<expression with rbp 0>>
    <<literal nud = 3>>
    <<led of "+">>
    <<expression with rbp 10>>
       <<literal nud = 1>>
       <<led of "*">>
       <<expression with rbp 20>>
          <<literal nud = 2>>
       <<led of "*">>
       <<expression with rbp 20>>
          <<literal nud = 4>>
    <<led of "+">>
    <<expression with rbp 10>>
       <<literal nud = 5>>

The following diagram shows the calls made to expression on various recursion levels:

The arrows show the tokens on which each execution of expression works, and the number above them is the rbp given to expression for this call.

Apart from the initial call (with rbp=0) which spans the whole input, expression is called after each operator (by its led handler) to collect the right-side argument. As the diagram clearly shows, the binding power mechanism makes sure expression doesn't go "too far" - only as far as the precedence of the invoking operator allows. The best place to see it is the long arrow after the first +, that collects all the multiplication terms (they must be grouped together due to the higher precedence of *) and returns before the second + is encountered (when the precedence ceases being higher than its rbp).

Another way to look at it: at any point in the execution of the parser, there's an instance of expression for each precedence level that is active at that moment. This instance awaits the results of the higher-precedence instance and keeps going, until it has to stop itself and return its result to its caller.

If you understand this example, you understand TDOP parsing. All the rest is really just more of the same.

Enhancing the parser

The parser presented so far is very rudimentary, so let's enhance it to be more realistic. First of all, what about unary operators?

As I've mentioned in the section on prefix and infix operators, TDOP makes an explicit distinction between the two, encoding it in the difference between the nud and led methods. Adding the subtraction operator handler [4]:

class operator_sub_token(object):
    lbp = 10
    def nud(self):
        return -expression(100)
    def led(self, left):
        return left - expression(10)

nud handles the unary (prefix) form of minus. It has no left argument (since it's prefix), and it negates its right argument. The binding power passed into expression is high, since unary minus has a high precedence (higher than multiplication). led handles the infix case similarly to the handlers of + and *.

Now we can handle expressions like:

3 - 2 + 4 * -5

And get a correct result (-19).

How about right-associative operators? Let's implement exponentiation (using the caret sign ^). To make the operation right-associative, we want the parser to treat subsequent exponentiation operators as sub-expressions of the first one. We can do that by calling expression in the handler of exponentiation with a rbp lower than the lbp of exponentiation:

class operator_pow_token(object):
    lbp = 30
    def led(self, left):
        return left ** expression(30 - 1)

When expression gets to the next ^ in its loop, it will find that still rbp < token.lbp and won't return the result right away, but will collect the value of the sub-expression first.

And how about grouping with parentheses? Since each token can execute actions in TDOP, this can be easily handled by adding actions to the ( token.

class operator_lparen_token(object):
    lbp = 0
    def nud(self):
        expr = expression()
        match(operator_rparen_token)
        return expr

class operator_rparen_token(object):
    lbp = 0

Where match is the usual RD primitive:

def match(tok=None):
    global token
    if tok and tok != type(token):
        raise SyntaxError('Expected %s' % tok)
    token = next()

Note that ( has lbp=0, meaning that it doesn't bind to any token on its left. It is treated as a prefix, and its nud collects an expression after the (, right until ) is encountered (which stops the expression parser since it also has lbp=0). Then it consumes the ) itself and returns the result of the expression [5].

Here's the code for the complete parser, handling addition, subtraction, multiplication, division, exponentiation and grouping by parentheses:

import re

token_pat = re.compile("\s*(?:(\d+)|(.))")

def tokenize(program):
    for number, operator in token_pat.findall(program):
        if number:
            yield literal_token(number)
        elif operator == "+":
            yield operator_add_token()
        elif operator == "-":
            yield operator_sub_token()
        elif operator == "*":
            yield operator_mul_token()
        elif operator == "/":
            yield operator_div_token()
        elif operator == "^":
            yield operator_pow_token()
        elif operator == '(':
            yield operator_lparen_token()
        elif operator == ')':
            yield operator_rparen_token()
        else:
            raise SyntaxError('unknown operator: %s', operator)
    yield end_token()


def match(tok=None):
    global token
    if tok and tok != type(token):
        raise SyntaxError('Expected %s' % tok)
    token = next()


def parse(program):
    global token, next
    next = tokenize(program).next
    token = next()
    return expression()


def expression(rbp=0):
    global token
    t = token
    token = next()
    left = t.nud()
    while rbp < token.lbp:
        t = token
        token = next()
        left = t.led(left)
    return left

class literal_token(object):
    def __init__(self, value):
        self.value = int(value)
    def nud(self):
        return self.value

class operator_add_token(object):
    lbp = 10
    def nud(self):
        return expression(100)
    def led(self, left):
        right = expression(10)
        return left + right

class operator_sub_token(object):
    lbp = 10
    def nud(self):
        return -expression(100)
    def led(self, left):
        return left - expression(10)

class operator_mul_token(object):
    lbp = 20
    def led(self, left):
        return left * expression(20)

class operator_div_token(object):
    lbp = 20
    def led(self, left):
        return left / expression(20)

class operator_pow_token(object):
    lbp = 30
    def led(self, left):
        return left ** expression(30 - 1)

class operator_lparen_token(object):
    lbp = 0
    def nud(self):
        expr = expression()
        match(operator_rparen_token)
        return expr

class operator_rparen_token(object):
    lbp = 0

class end_token(object):
    lbp = 0

Sample usage:

>>> parse('3 * (2 + -4) ^ 4')
48

Closing words

When people consider parsing methods to implement, the debate usually goes between hand-coded RD parsers, auto-generated LL(k) parsers, or auto-generated LR parsers. TDOP is another alternative [6]. It's an original and unusual parsing method that can handle complex grammars (not limited to expressions), relatively easy to code, and is quite fast.

What makes TDOP fast is that it doesn't need deep recursive descents to parse expressions - only a couple of calls per token are required, no matter how the grammar looks. If you trace the token actions in the example parser I presented in this article, you'll notice that on average, expression and one nud or led method are called per token, and that's about it. Fredrik Lundh compares the performance of TDOP with several other methods in his article, and gets very favorable results.

[1]	Which is also the source for most of the code in this article - so the copyright is Fredrik Lundh's

[2]	Like C, Java, Python. An example of a language that doesn't have infix notation is Lisp, which has prefix notation for expressions.

[3]	This tokenizer just recognizes numbers and single-character operators.

[4]	Note that to allow our parser actually recognize `-`, an appropriate dispatcher should be added to the `tokenize` function - this is left as an exercise to the reader.

[5]	Quiz: is it useful having a `led` handler for a left paren as well? Hint: how would you implement function calls?

[6]	By the way, I have no idea where to categorize it on the LL/LR scale? Any ideas?

A recursive descent parser with an infix expression evaluator

2009-03-20T18:01:09-07:00

Last week I wrote about some of the inherent problems of recursive-descent parsers. An elegant solution to the operator associativity problem was shown, but another problem remained - and that is of the unwieldy handling of expressions, mainly performance-wise.

Here I want to present one alternative to the pure-RD approach, and that is intermixing RD with another parsing method.

The code

I'll begin by pointing to the code for this article. It contains several Python files and a readme.txt explaining what is what. Throughout the article I'll present short snippets from the code, but it's encouraged to run it on your own. The code is self-contained and only requires Python (version 2.5) to run.

Extending the grammar

To illuminate some of the points I'm presenting better, I've greatly extended the EBNF grammar we'll be parsing. Here's the new grammar (taken from the top of the rd_parser_ebnf.py in the code .zip):

# EBNF:
#
# <stmt>        : <assign_stmt>
#               | <if_stmt>
#               | <cmp_expr>
#
# <assign_stmt> : set <id> = <cmp_expr>
#
## Note 'else' binds to the innermost 'if', like in C
#
# <if_stmt>     : if <cmp_expr> then <stmt> [else <stmt>]
#
# <cmp_expr>    : <bitor_expr> [== <bitor_expr>]
#               | <bitor_expr> [!= <bitor_expr>]
#               | <bitor_expr> [> <bitor_expr>]
#               | <bitor_expr> [< <bitor_expr>]
#               | <bitor_expr> [>= <bitor_expr>]
#               | <bitor_expr> [<= <bitor_expr>]
#
# <bitor_expr>  | <bitxor_expr> {| <bitxor_expr>}
#
# <bitxor_expr> | <bitand_expr> {^ <bitand_expr>}
#
# <bitand_expr> | <shift_expr> {& <shift_expr>}
#
# <shift_expr>  | <arith_expr> {<< <arith_expr>}
#               : <arith_expr> {>> <arith_expr>}
#
# <arith_expr>  : <term> {+ <term>}
#               | <term> {- <term>}
#
# <term>        : <power> {* <power>}
#               | <power> {/ <power>}
#
# <power>       : <power> ** <factor>
#               | <factor>
#
# <factor>      : <id>
#               | <number>
#               | - <factor>
#               | ( <cmp_expr> )
#
# <id>          : [a-zA-Z_]\w+
# <number>      : \d+

As you can see, this simple calculator is starting to approach a real programming language, as it supports a plethora of mathematical and logical expressions, as well as conditional statements (if ... then ... else) and assignments. I've added a simplistic "prompt" so you can experiment with the calculator from the command line:

D:\zzz\rd_parser_calc>rd_parser_ebnf.py -p
Welcome to the calculator. Press Ctrl+C to exit.
--> set x = 2 + 2 * 3
8
--> set y = (x - 1) * (x - 2)
42
--> if y > x then set y = x else set x = y
8
--> x
8
--> y
8
--> x ** ((y - 10) * -3)
262144
--> ... Thanks for using the calculator.

Note that since a separate expression "level" is required for each precedence, the resulting code is somewhat repetitive. I'll get back to this point later on.

Evaluating infix expressions

An alternative method of evaluating expressions is required, then. Luckily, such a need arose early enough (in the 1950s and 60s, when first compilers and interpreters were constructed) and some luminaries examined this problem in detail. In particular, Edsger W. Dijkstra proposed an efficient and intuitive algorithm for converting from infix notation to RPN, called the Shunting Yard algorithm.

I will not describe the algorithm here, as it's been done several times already. If the Wikipedia article is not enough, here's another good source (which I've actually used as the basis for my implementation).

The algorithm employs two stacks to resolve the precedence dilemmas of infix notation. One stack is for storing operators of relatively low precedence that await results from computations with higher precedence. The other stack keeps the result accumulated so far. The result can either be a RPN expression, an AST or just the computed result (a number) of the computation.

In my code, the file rd_parser_infix_exper.py implements a hybrid parser, using Shunting Yard to evaluate expressions and a top-level RD parser for statements and combining everything together. It's instructive to examine the implementation and see how things fit together.

The grammar this parser accepts is exactly the same as the pure RD EBNF parser presented eariler. The statements (assign_stmt, if_stmt, and stmt) are evaluated by traditional RD, but getting deeper into expressions is done with an "infix evaluator", the gateway to which is the _infix_eval method [1]:

def _infix_eval(self):
    """ Run the infix evaluator and return the result.
    """
    self.op_stack = []
    self.res_stack = []

    self.op_stack.append(self._sentinel)
    self._infix_eval_expr()
    return self.res_stack[-1]

This method prepares the Shunting Yard stacks and begins evaluating the expression, terminating with returning its results.

Note that the connection to the RD parser is seamless. When _infix_eval is called, it assumes that the current token is the beginning of an expression (just like any RD rule), and consumes as much tokens as required to parse the full expression before returning the result.

The rest of the implementation (the _infix_eval_expr, _infix_eval_atom, _push_op and _pop_op methods) is pretty much a word by word translation of the algorithm described in this article into Python.

Adding expressions

Here's a big advantage of this hybrid parser: adding new expressions and/or changing precedence levels is much simpler and requires far less code. In the pure RD parser, the operators and their precedences are determined by the structure of recursive calls between methods. Adding a new operator requires a new method, as well as modifying some of the other methods [2]. Changing the precedence of some operator is also troublesome and requires moving around lots of code.

Not so in the infix expression parser. Once the Shunting Yard machinery is in place, all we have to do to add new operators or modify existing ones is update the _ops table:

_ops = {
    'u-':   Op('unary -', operator.neg, 90, unary=True),
    '**':   Op('**', operator.pow, 70, right_assoc=True),
    '*':    Op('*', operator.mul, 50),
    '/':    Op('/', operator.div, 50),
    '+':    Op('+', operator.add, 40),
    '-':    Op('-', operator.sub, 40),
    '<<':   Op('<<', operator.lshift, 35),
    '>>':   Op('>>', operator.rshift, 35),
    '&':    Op('&', operator.and_, 30),
    '^':    Op('^', operator.xor, 29),
    '|':    Op('|', operator.or_, 28),
    '>':    Op('>', operator.gt, 20),
    '>=':   Op('>=', operator.ge, 20),
    '<':    Op('<', operator.lt, 20),
    '<=':   Op('<=', operator.le, 20),
    '==':   Op('==', operator.eq, 15),
    '!=':   Op('!=', operator.ne, 15),
}

I also find this table much more descriptive in the sense of understanding how the operators relate to one another than the parallel 9 methods required to implement them in the pure RD version (rd_parser_ebnf.py).

Performance

Now here is the funny thing. My initial motivation for examining the infix expression hybrid was the allegedly poor performance of the RD parser for parsing expressions (as described in the previous article). But the performance hasn't improved! In fact, the new hybrid parser is a bit slower than the pure RD parser!

And the annoying thing is that it's entirely unclear to me how to optimize it, since profiling shows that the runtime divides rather evenly between the various methods of the algorithm. Yes, the pure RD parser requires the full precedence-chain of methods called for each single terminal, but the infix version has more method calls in total.

If anything, this has been a lesson in optimization, as profiling initially showed that the vast majority of the time is spent in the lexer [3]. So I've managed to optimize my lexer (by precompiling all its regexes into a single large one using alternation), which greatly reduced the runtime.

Conclusion

This article has presented an alternative to the pure recursive-descent parser. The hybrid parser developed here combines RD with infix expression evaluation using the Shunting Yard algorithm.

We've seen that the new code is more manageable for operator-rich grammars. If even more operators are to be added to the parser (such as the full set of operators supported by C), it's much simpler to implement into the parser, and the operator table is a single place summarizing the operators, their associativities and precedences, making the parser more readable.

However, this has not made the parser any faster. The pure-RD implementation is lean enough to be efficient even when the grammar consists of many precedence levels. This is an important lesson in optimization - it's difficult to assess the relative runtimes of complex chunks of code in advance, without actually trying them out and profiling them.

[1]	It would be a swell idea to read the description of the algorithm and have an intuitive understanding of it from this point and on in the article.

[2]	Suppose we had no multiplication and division and had to add the `term` rule. In addition to writing the code for the new rule, we must modify the `arith_expr` rule to now call `term` instead of `power`.

[3]	Which makes lots of sense, as it's well known that lexing/tokenization is usually the most time consuming stage of parsing. This is because the lexer has to examine every single character of the input, while the parser above it works on the level of whole tokens.

Some problems of recursive descent parsers

2009-03-14T11:24:39-07:00

Reminder - recursive descent (RD) parsers

Here's an article I wrote on the subject a few months ago. It provides a good introduction on how RD parsers are constructed and what grammars they can parse.

Here I want to focus on a couple of problems with the RD parser developed in that article, and propose solutions.

Problem #1: operator associativity

If you recall from the previous article, the expr rule of the parser looks like this (BNF notation):

<expr>    : <term> + <expr>
          | <term> - <expr>
          | <term>

It's built this way (expr on the right-hand side of the expression, term on the left-hand side), to avoid left-recursion in the grammar, which can crash a RD parser by sending it wheeling in an infinite loop.

But as I hinted in the footnotes (and some readers caught on in the comments), this injects an associativity problem into the grammar. Let's see why.

Wikipedia is much better than me at explaining what operator associativity is, so I'll assume you've read and understood it.

In short, however, left associativity of the minus operator means that 5 - 1 - 2 = (5 - 1) - 2 and not 5 - (1 - 2) (which returns a different result).

But if you run 5 - 1 - 2 in the parser with the above BNF for expr, you'll get 6 instead of 2. So what went wrong?

The problem is in the grammar definition (BNF) itself. The way the expr rule is defined makes it inherently right-associative instead of left-associative. The hierarchy of the rules implicitly defines their associativity, because it defines what will be grouped together. To understand it better, perhaps the code implementing the expr rule will help:

def _expr(self):
    lval = self._term()

    if self.cur_token.type == '+':
        self._match('+')
        op = lambda a, b: a + b
    elif self.cur_token.type == '-':
        self._match('-')
        op = lambda a, b: a - b
    else:
        print 'returning lval = %s' % lval
        return lval

    rval = self._expr()
    print 'lval = %s, rval = %s, res = %s' % (
        lval, rval, op(lval, rval))
    return op(lval, rval)

Note that the first term is parsed, and then the rule recursively calls itself for the next one. So the expression is being built from right to left, and this causes its right-associativity.

As you can see, I've added a couple of printouts to better show what's going on. When run on the expression 5 - 1 - 2, this prints:

returning lval = 2
lval = 1, rval = 2, res = -1
lval = 5, rval = -1, res = 6

We clearly see the problem here. The actual returns are done from right to left because of the recursion.

Note that this grammar evaluates addition, multiplication, subtraction and division in a right-associative way. This causes problems for both subtraction and division, but not for addition and multiplication, because these operations compute the same whether right-to-left or left-to-right [1].

A solution for the associativity problem

I suppose the problem can be solved by rewriting the BNF rules in some sophisticated way that makes them both left-associative and not left-recursive [2], but I'll pick another way.

BNF is somewhat limiting, since it doesn't really allow much options when defining rules. All the rules must have a very strict structure, and if you want to customize something you must resort to defining sub-rules and referencing them recursively.

Enter EBNF. It was developed to fix some of the deficiencies of plain BNF. One of those is the addition of repetition of sub-rules. For instance, we can write the expr rule in EBNF as follows:

<expr>    : <term> {+ <term>}
          | <term> {- <term>}

Note the braces { ... }. In EBNF, these mean "repeated 0 or more times". This is still a LL(1) grammar, but now it's expressed a bit more comfortably. Such a representation is very suitable for coding, because the repetition can be expressed naturally with a loop.

Here's a re-implementation of the expr rule using this idiom:

def _expr(self):
    lval = self._term()

    while ( self.cur_token.type == '+' or
            self.cur_token.type == '-'):
        if self.cur_token.type == '+':
            self._match('+')
            lval += self._term()
        elif self.cur_token.type == '-':
            self._match('-')
            lval -= self._term()

    return lval

Note the while loop "eating up" all successive terms in the expression and accumulating the result in the expected left-to-right manner. Now the computation 5 - 1 - 2 will correctly produce 2.

The code

This is a good place to refer to the code. In here you will find the source of both the old (BNF-based) parser and the new (EBNF-based) one, along with the lexer module that implements the tokenizer. Each of the parsers is self contained and can be used separately. Note that they were developed and tested with Python 2.5

Right-associative operators

Some operators are inherently right-associative. Exponentiation, for example. 2^3^2 = 2^(3^2) = 512, and not (2^3)^2 (which equals 64).

We can leave these operators defined as before, using a recursive rule that naturally results in right-associativity. Here's the code of the power rule that was added to the EBNF-based parser to support exponentiation:

# <power>   : <factor> ** <power>
#           | <factor>
#
def _power(self):
    lval = self._factor()

    if self.cur_token.type == '**':
        self._match('**')
        lval **= self._power()

    return lval

Intermission

We now have a correct recursive descent parser that uses EBNF-based rules to parse expressions with the desired associativity for each operator. This parser can be readily employed to parse simple languages - it is production-use ready. The next "problem" I present only has to do with the parser's efficiency, so it is probably of no concern unless performance is crucial.

Problem #2: efficiency

There's an inherent performance problem with recursive-descent parsers when dealing with expressions. This problem stems from the need to define operator precedence, and in RD parsers the only way to define this precedence is by using recursive sub-rules. For example (from the EBNF-based code):

<expr>    : <term> {+ <term>}
          | <term> {- <term>}
<term>    : <power> {* <power>}
          | <power> {/ <power>}

The nesting of these rules defines the relative precedence of addition and multiplication. It tells the parser: between plus signs, dive into the expression and collect all sub-terms connected by multiply signs. In other words, it tells it to group the expression: 5 + 2 * 2 as 5 + (2 * 2) and not as (5 + 2) * 2.

To see the problem this nesting causes, I've inserted simple printouts into each of the expr, term, power and factor rules to show which functions get called while parsing. Let's see what happens when the trivial expression 42 is parsed:

expr called with NUMBER(42) at 0
term called with NUMBER(42) at 0
power called with NUMBER(42) at 0
factor called with NUMBER(42) at 0

Yikes!!! 4 function calls just to parse the single-token input 42! Unfortunately, while this problem may look simple on the surface, it is not. There's simply no other way to express precedence in RD parsers - you have to use nested rules, and this nesting turns out to be inefficient for parsing expressions.

The solution to this problem is to use a hybrid parser instead of a pure RD one. Some algorithms were developed to efficiently parse infix expressions. This article provides a good survey. One such algorithm can be combined with RD to provide a general-purpose parser for both expressions and higher programming language constructs.

In a future article I will discuss an implementation of such a parser.

[1]	To be more precise, addition and multiplication are associative binary operators in the mathematical sense.

[2]	But I'm too lazy to look for such a way at the moment. Let me know if you find it.

Recursive descent, LL and predictive parsers

2008-09-26T12:29:10-07:00

Introduction

Although I've written some recursive-descent (RD) parsers by hand, the theory behind them eluded me for some time. I had a good understanding of the theory behind bottom-up LR parsers, and have used tools (like Yacc and PLY) to generate LALR parsers for languages, but I didn't really dig into the books about LL.

This week I've finally decided to understand what's going on. I tried to write a simple RD parser in Python (previously I've written RD parsers in C++ and Lisp), and ran into a problem which got me thinking hard about LL parsers. So, I've opened the Dragon Book, and now I know much more about LL(1), LL(k), predictive, recursive-descent parsers with and without backtracking, and what's between them.

This article is a summary of my findings, written for myself to read in a few months when I forget it :-)

Recursive descent parsers

From Wikipedia:

A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

RD parsers are the most general form of top-down parsing, and the most popular type of parsers to write by hand. However, being so general, they have several problems, like requiring backtracking (which is difficult to code correctly and efficiently).

Usually, it is enough to use less general and powerful parsers for all practical needs, like parsing programming languages (and domain specific languages). This is where LL parsers come in.

LL parsers

An LL parser is a top-down parser for a subset of the context-free grammars. It parses the input from Left to right, and constructs a Leftmost derivation of the sentence (hence LL, compared with LR parser). The class of grammars which are parsable in this way is known as the LL grammars.

LL parsers are further classified by the amount of lookup they need. LL(1) parsers require 1 character of lookup, LL(k) require k, and so on. Usually, LL(1) is enough for most practical needs.

LL parsers are also called predictive, because it's possible predict the exact path to take by a certain amount of lookup symbols, without backtracking.

The example

This week I tried to construct a RD parser for this simple calculator grammar:

<expr>      :=  <term> + <expr>
            |   <term> - <expr>
            |   <term>
<term>      :=  <factor> * <term>
                <factor> / <term>
                <factor>
<factor>    :=  <number>
            |   <id>
            |   ( <expr> )
<number>    :=  \d+
<id>        :=  [a-zA-Z_]\w+

This grammar is LL(1) and hence parseable by a simple predictive parser with a single token lookahead. However, I then tried to add the following rule to allow input of commands into an interactive calculator prompt:

<command>   :=  <expr>
            |   <id> = <expr>

With this rule added, the grammar is no longer LL(1), because looking at the first token I can't say which one of the two options of <command> it is. In order to be able to differentiate between an assignment and a single expression, I must see the = token, and for this I need to see 2 tokens forward, and not just one. So, this grammar turns into a LL(2).

LL(2) grammars are much more difficult to code by hand than LL(1) grammars, and they are also much more difficult to turn into code automatically by parser generators. This is probably why for most languages LL(1) suffices.

LL parser generators

Unlike LR parsers, for which everyone uses parser generators [1], LL parsers are commonly written by hand. It even appears that some of the most popular compilers (such as GCC) use hand-written RD parsers to parse whole languages like C. As with anything, you get maximal flexibility and efficiency when you hand-code something, as you're not constrained by the limitations of the tools and libraries you're using.

Indeed, writing a simple predictive parser as a set of mutually recursive routines is simple, and can also be very educational. If you have a very small parsing task to perform, perhaps you'll be better off hand-coding a RD parser.

However, automatic tools for generating LL parsers exist. The most popular are probably ANTLR and Boost.Spirit. I haven't tried them, but both are widely used to write complex parsers. Both have a clear advantage over hand-written parsers - they can generate parsers with any lookup length, guessing the required length from the grammar. Hand-written parsers, as I mentioned earlier, get much more complex for any k > 1.

Left recursion

Had my expr rule been written like this:

<expr>      :=  <expr> + <term>
            |   <expr> - <term>
            |   <term>

It would have been left recursive, because the non-terminal expr appears as the first (leftmost) symbol in its own production. Since RD parsers work top-down, to recognize <expr> it has to first recognize <expr>, but for that it again has to recognize <expr> and so on, ad infinitum. This infinite recursion is the reason why RD parsers can't handle left recursion.

Left recursion can also be indirect:

<a>   :=  <b> <x>
      |   <c>
<b>   :=  <a> <y>
      |   <d>

Here we can have the infinite derivation: <a> -> <b> <x> -> <a> <y> <x> and so on.

Techniques exist to remove left recursion from some grammars. For more information see this. The grammar shown in the example above had left-recursion removed from it [2].

Code

A simple recursive descent parser for a calculator, written in Python, can be downloaded here. It also includes a fairly generic Lexer class that implements regex-based tokenization of a string.

[1]	Since `LR` parsers are table-based are too tedious and unwieldy to write by hand.

[2]	Which, however, has left it with a slight operator associativity problem. Finding it is left as an exercise for the reader).

Parse::RecDescent vs. YACC

2004-01-29T15:08:00-08:00

Parse::RecDescent (RD) looks like the best parsing option in Perl for me, for two reasons. First, it is very lightweight - only one .pm file to carry around. Second, I like recursive descent parsing :-) RD parsing is, IMHO, easier to visualize and understand. Looking at the grammar (BNF) it is immediately obvious how each rule will be parsed given the input. This is very nice for grammar debugging.

Yesterday was my first serious experience with the RD module (historically, I did a lot of Yacc (in C), and coded some simple recursive descent parsers by hand). The module works nicely, and is easy to learn and understand. Some notable differences from Yacc:

Integrated lexing. Very nice ! It looks much more natural this way, and there's no need for extra headache with Lex linkage. Tokens are defined as simple regex rules in the grammar itself.
Some little things that make life easier and more pleasant. For example, the rule quantifiers (s), (s?) etc.
Left recursion problem. Hits blatantly when arithmetic expressions must be parsed. A different mindset must be employed when comparing with Yacc.

Additionally, RD has a very useful trace option, that traces parsing and allows to see where things went wrong with the grammar.