Parsing expressions by precedence climbing
August 2nd, 2012 at 5:48 amI’ve written previously about the problem recursive descent parsers have with expressions, especially when the language has multiple levels of operator precedence.
There are several ways to attack this problem. The Wikipedia article on operatorprecedence parsers mentions three algorithms: Shunting Yard, topdown operator precedence (TDOP) and precedence climbing. I have already covered Shunting Yard and TDOP in this blog. Here I aim to present the third method (and the one that actually ends up being used a lot in practice) – precedence climbing.
Precedence climbing – what it aims to achieve
It’s not necessary to be familiar with the other algorithms for expression parsing in order to understand precedence climbing. In fact, I think that precedence climbing is the simplest of them all. To explain it, I want to first present what the algorithm is trying to achieve. After this, I will explain how it does this, and finally will present a fully functional implementation in Python.
So the basic goal of the algorithm is the following: treat an expression as a bunch of nested subexpressions, where each subexpression has in common the lowest precedence level of the the operators it contains.
Here’s a simple example:
2 + 3 * 4 * 5  6
Assuming that the precedence of + (and ) is 1 and the precedence of * (and /) is 2, we have:
2 + 3 * 4 * 5  6
 : prec 1
 : prec 2
The subexpression multiplying the three numbers has a minimal precedence of 2. The subexpression spanning the whole original expression has a minimal precedence of 1.
Here’s a more complex example, adding a power operator ^ with precedence 3:
2 + 3 ^ 2 * 3 + 4
 : prec 1
 : prec 2
 : prec 3
Associativity
Binary operators, in addition to precedence, also have the concept of associativity. Simply put, left associative operators stick to the left stronger than to the right; right associative operators vice versa.
Some examples. Since addition is left associative, this:
2 + 3 + 4
Is equivalent to this:
(2 + 3) + 4
On the other hand, power (exponentiation) is right associative. This:
2 ^ 3 ^ 4
Is equivalent to this:
2 ^ (3 ^ 4)
The precedence climbing algorithm also needs to handle associativity correctly.
Nested parenthesized subexpressions
Finally, we all know that parentheses can be used to explicitly group subexpressions, beating operator precedence. So the following expression computes the addition before the multiplication:
2 * (3 + 5) * 7
As we’ll see, the algorithm has a special provision to cleverly handle nested subexpressions.
Precedence climbing – how it actually works
First let’s define some terms. Atoms are either numbers or parenthesized expressions. Expressions consist of atoms connected by binary operators [1]. Note how these two terms are mutually dependent. This is normal in the land of grammars and parsers.
The algorithm is operatorguided. Its fundamental step is to consume the next atom and look at the operator following it. If the operator has precedence lower than the lowest acceptable for the current step, the algorithm returns. Otherwise, it calls itself in a loop to handle the subexpression. In pseudocode, it looks like this [2]:
compute_expr(min_prec):
result = compute_atom()
while cur token is a binary operator with precedence >= min_prec:
prec, assoc = precedence and associativity of current token
if assoc is left:
next_min_prec = prec + 1
else:
next_min_prec = prec
rhs = compute_expr(next_min_prec)
result = compute operator(result, rhs)
return result
Each recursive call here handles a sequence of operatorconnected atoms sharing the same minimal precedence.
An example
To get a feel for how the algorithm works, let’s start with an example:
2 + 3 ^ 2 * 3 + 4
It’s recommended to follow the execution of the algorithm through this expression with, on paper. The computation is kicked off by calling compute_expr(1), because 1 is the minimal operator precedence among all operators we’ve defined. Here is the "call tree" the algorithm produces for this expression:
* compute_expr(1) # Initial call on the whole expression
* compute_atom() > 2
* compute_expr(2) # Loop entered, operator '+'
* compute_atom() > 3
* compute_expr(3)
* compute_atom() > 2
* result > 2 # Loop not entered for '*' (prec < '^')
* result = 3 ^ 2 > 9
* compute_expr(3)
* compute_atom() > 3
* result > 3 # Loop not entered for '+' (prec < '*')
* result = 9 * 3 > 27
* result = 2 + 27 > 29
* compute_expr(2) # Loop entered, operator '+'
* compute_atom() > 4
* result > 4 # Loop not entered  end of expression
* result = 29 + 4 > 33
Handling precedence
Note that the algorithm makes one recursive call per binary operator. Some of these calls are short lived – they will only consume an atom and return it because the while loop is not entered (this happens on the second 2, as well as on the second 3 in the example expression above). Some are longer lived. The initial call to compute_expr will compute the whole expression.
The while loop is the essential ingredient here. It’s the thing that makes sure that the current compute_expr call handles all consecutive operators with the given minimal precedence before exiting.
Handling associativity
In my opinion, one of the coolest aspects of this algorithm is the simple and elegant way it handles associativity. It’s all in that condition that either sets the minimal precedence for the next call to the current one, or current one plus one.
Here’s how this works. Assume we have this subexpression somewhere:
8 * 9 * 10
^

The arrow marks where the compute_expr call is, having entered the while loop. prec is 2. Since the associativity of * is left, next_min_prec is set to 3. The recursive call to compute_expr(3), after consuming an atom, sees the next * token:
8 * 9 * 10
^

Since the precedence of * is 2, while min_prec is 3, the while loop never runs and the call returns. So the original compute_expr will get to handle the second multiplication, not the internal call. Essentially, this means that the expression is grouped as follows:
(8 * 9) * 10
Which is exactly what we want from left associativity.
In contrast, for this expression:
8 ^ 9 ^ 10
The precedence of ^ is 3, and since it’s right associative, the min_prec for the recursive call stays 3. This will mean that the recursive call will consume the next ^ operator before returning to the original compute_expr, grouping the expression as follows:
8 ^ (9 ^ 10)
Handling subexpressions
The algorithm pseudocode presented above doesn’t explain how parenthesized subexpressions are handled. Consider this expression:
2000 * (4  3) / 100
It’s not clear how the while loop can handle this. The answer is compute_atom. When it sees a left paren, it knows that a subexpression will follow, so it calls compute_expr on the sub expression (which lasts until the matching right paren), and returns its result as the result of the atom. So compute_expr is oblivious to the existence of subexpressions.
Finally, in order to stay short the pseudocode leaves some interesting details out. What follows is a full implementation of the algorithm that fills all the gaps.
A Python implementation
Here is a Python implementation of expression parsing by precedence climbing. It’s kept short for simplicity, but can be be easily expanded to cover a more realworld language of expressions. The following sections present the code in small chunks. The whole code is available here.
I’ll start with a small tokenizer class that breaks text into tokens and keeps a state. The grammar is very simple: numeric expressions, the basic arithmetic operators +, , *, /, ^ and parens – (, ).
Tok = namedtuple('Tok', 'name value')
class Tokenizer(object):
""" Simple tokenizer object. The cur_token attribute holds the current
token (Tok). Call get_next_token() to advance to the
next token. cur_token is None before the first token is
taken and after the source ends.
"""
TOKPATTERN = re.compile("\s*(?:(\d+)(.))")
def __init__(self, source):
self._tokgen = self._gen_tokens(source)
self.cur_token = None
def get_next_token(self):
""" Advance to the next token, and return it.
"""
try:
self.cur_token = self._tokgen.next()
except StopIteration:
self.cur_token = None
return self.cur_token
def _gen_tokens(self, source):
for number, operator in self.TOKPATTERN.findall(source):
if number:
yield Tok('NUMBER', number)
elif operator == '(':
yield Tok('LEFTPAREN', '(')
elif operator == ')':
yield Tok('RIGHTPAREN', ')')
else:
yield Tok('BINOP', operator)
Next, compute_atom:
def compute_atom(tokenizer):
tok = tokenizer.cur_token
if tok.name == 'LEFTPAREN':
tokenizer.get_next_token()
val = compute_expr(tokenizer, 1)
if tokenizer.cur_token.name != 'RIGHTPAREN':
parse_error('unmatched "("')
tokenizer.get_next_token()
return val
elif tok is None:
parse_error('source ended unexpectedly')
elif tok.name == 'BINOP':
parse_error('expected an atom, not an operator "%s"' % tok.value)
else:
assert tok.name == 'NUMBER'
tokenizer.get_next_token()
return int(tok.value)
It handles true atoms (numbers in our case), as well as parenthesized subexpressions.
Here is compute_expr itself, which is very close to the pseudocode shown above:
# For each operator, a (precedence, associativity) pair.
OpInfo = namedtuple('OpInfo', 'prec assoc')
OPINFO_MAP = {
'+': OpInfo(1, 'LEFT'),
'': OpInfo(1, 'LEFT'),
'*': OpInfo(2, 'LEFT'),
'/': OpInfo(2, 'LEFT'),
'^': OpInfo(3, 'RIGHT'),
}
def compute_expr(tokenizer, min_prec):
atom_lhs = compute_atom(tokenizer)
while True:
cur = tokenizer.cur_token
if (cur is None or cur.name != 'BINOP'
or OPINFO_MAP[cur.value].prec < min_prec):
break
# Inside this loop the current token is a binary operator
assert cur.name == 'BINOP'
# Get the operator's precedence and associativity, and compute a
# minimal precedence for the recursive call
op = cur.value
prec, assoc = OPINFO_MAP[op]
next_min_prec = prec + 1 if assoc == 'LEFT' else prec
# Consume the current token and prepare the next one for the
# recursive call
tokenizer.get_next_token()
atom_rhs = compute_expr(tokenizer, next_min_prec)
# Update lhs with the new value
atom_lhs = compute_op(op, atom_lhs, atom_rhs)
return atom_lhs
The only difference is that this code makes token handling more explicit. It basically follows the usual "recursivedescent protocol". Each recursive call has the current token available in tokenizer.cur_tok, and makes sure to consume all the tokens it has handled (by calling tokenizer.get_next_token()).
One additional small piece is missing. compute_op simply performs the arithmetic computation for the supported binary operators:
def compute_op(op, lhs, rhs):
lhs = int(lhs); rhs = int(rhs)
if op == '+': return lhs + rhs
elif op == '': return lhs  rhs
elif op == '*': return lhs * rhs
elif op == '/': return lhs / rhs
elif op == '^': return lhs ** rhs
else:
parse_error('unknown operator "%s"' % op)
In the real world – Clang
Precedence climbing is being used in real world tools. One example is Clang, the C/C++/ObjC frontend. Clang’s parser is handwritten recursive descent, and it uses precedence climbing for efficient parsing of expressions. If you’re interested to see the code, it’s Parser::ParseExpression in lib/Parse/ParseExpr.cpp [3]. This method plays the role of compute_expr. The role of compute_atom is played by Parser::ParseCastExpression.
Other resources
Here are some resources I found useful while writing this article:
 The Wikipedia page for Operatorprecedence parsing.
 The article by Keith Clarke (PDF), who apparently came up with the technique.
 This page by Theodore Norvell, about parsing expressions by recursive descent.
 The Clang source code (exact locations given in the previous section).
[1]  There are a couple of simplifications made here on purpose. First, I assume only numeric expressions. Identifiers that represent variables can also be viewed as atoms. Second, I ignore unary operators. These are quite easy to incorporate into the algorithm by also treating them as atoms. I leave them out for succinctness. 
[2]  In this article I present a parser that computes the result of a numeric expression onthefly. Modifying it for accumulating the result into some kind of a parse tree is trivial. 
[3]  Clang’s source code is constantly in flow. This information is correct at least for the date the article was written. 
Related posts:
August 2nd, 2012 at 10:37
I was thinking about precedence parsing and came up this method. Instead of using binary tree, you use a terenary tree. Left, right child are interpreted as they are – but the third leg is scope. Instead of merging the subexpression into the initial tree you attach it to the scope leg. This simplifies a lot of things and makes it more flexible – different subexpressions and infix, prefix, postfix operators come for free. You can see the thing in action at : http://egonelbre.com/js/bigram/
August 3rd, 2012 at 09:32
I found your example a little confusing. I think a stepby step animation would be needed to really explain this clearly. Nevertheless, this algorithm seems nice and simple (just the way I like them). I just happen to be implementing a parser for my own programming language and am giving this a try as we speak. I just hope I can fit the ternary “?” operator in there without too much trouble!
August 4th, 2012 at 05:16
Maxime,
The ternary operator would need special casing to be fitted into this algorithm, but that shouldn’t be too hard.
August 4th, 2012 at 20:15
Actually if you treat the terenary operator as binary operator(s) “C ?( T ): F” or “C ? ( T : F )”, it can be quite easy to fit into the current system. The first version means that ?(…): becomes an binary operator over C and F. In the second version ? is an operator between C and ( T : F ), and : is just a separator operator. Although in postprocessing you must check the validity of using the operator.
Examples for the rules : https://gist.github.com/3259101 .
August 5th, 2012 at 21:08
Farther along in my implementation, I seem to have found an annoying complication. I’m trying to parse a Clike language with unary prefix operators (e.g.: ++x). The temptation would be to parse that in the parseAtom portion of the code, as is done for parenthesized expressions.
The problem is that the precedence doesn’t work in our favor here. In C/C++/JavaScript, the *postfix* operator ++ has higher precedence than the prefix one. Furthermore, the membership operator (i.e.: a.b) has higher precedence as well. This means that ++a.b should parse to ++(a.b) and not (++a).b. I’m guessing I will have to pass along min_prec to parseAtom to fix this problem.
Anyways, maybe you’d like to see my code once complete. It’s a whole handwritten parser for a language that will be similar to JavaScript, all written in D. I will probably put it on github.
August 6th, 2012 at 04:24
Maxime,
I’m not sure this is necessary. Both prefix/postfix pre/decrements and the membership dot bind tighter than all binary operators. So
compute_atom
can handle them without knowing the precedence level currently tracked for binary operators. The relative precedences between them are handled implicitly in the callscompute_atom
is making. This is how it’s done in Clang to parse actual C++.August 15th, 2012 at 13:12
This is fascinating . Now i get to go gut my parser and try this out… i’m only concerned about out it might interact with unary ops (i, too, an working on an ECMAish grammar).
August 18th, 2012 at 03:42
The link to Theodore Norvell is broken for me.
I have used precedence climbing to great effect, probably via http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm which cites Martin Richards not Keith Clarke as the origin of this technique.
August 18th, 2012 at 04:07
Pete,
Fixed the link, thanks.
August 20th, 2013 at 14:31
Thanks a lot! I’ve been stuck on expression parsing for days and this is actually the most clear tutorial on precedence climbing ( or expression parsing on a whole) i could find.
you can follow my project on github if you like: https://github.com/njbrown/dragonscript