Eli Bendersky's website - Recursive descent parsinghttps://eli.thegreenplace.net/2023-07-08T13:12:59-07:00Ungrammar in Go and resilient parsing2023-07-08T06:12:00-07:002023-07-08T13:12:59-07:00Eli Benderskytag:eli.thegreenplace.net,2023-07-08:/2023/ungrammar-in-go-and-resilient-parsing/<p>It won't be news to the readers of this blog that I have <a class="reference external" href="https://github.com/eliben/pycparser">some interest</a> in
<a class="reference external" href="https://eli.thegreenplace.net/tag/compilation">compiler</a>
<a class="reference external" href="https://eli.thegreenplace.net/tag/recursive-descent-parsing">front-ends</a>.
So when I heard about a new(-ish) DSL for
<a class="reference external" href="https://en.wikipedia.org/wiki/Parse_tree">concrete syntax trees</a> (CST), I
couldn't resist playing with it a bit.</p>
<p><a class="reference external" href="https://github.com/rust-analyzer/ungrammar/tree/master">Ungrammar</a> is used
in <tt class="docutils literal"><span class="pre">rust-analyzer</span></tt> to define and access a …</p><p>It won't be news to the readers of this blog that I have <a class="reference external" href="https://github.com/eliben/pycparser">some interest</a> in
<a class="reference external" href="https://eli.thegreenplace.net/tag/compilation">compiler</a>
<a class="reference external" href="https://eli.thegreenplace.net/tag/recursive-descent-parsing">front-ends</a>.
So when I heard about a new(-ish) DSL for
<a class="reference external" href="https://en.wikipedia.org/wiki/Parse_tree">concrete syntax trees</a> (CST), I
couldn't resist playing with it a bit.</p>
<p><a class="reference external" href="https://github.com/rust-analyzer/ungrammar/tree/master">Ungrammar</a> is used
in <tt class="docutils literal"><span class="pre">rust-analyzer</span></tt> to define and access a CST for Rust.
<a class="reference external" href="https://rust-analyzer.github.io/blog/2020/10/24/introducing-ungrammar.html">This blog post</a>
by its creator provides much more details. According to the author, Ungrammar
is "the ASDL for concrete syntax trees". This sounded interesting,
since I've been <a class="reference external" href="https://eli.thegreenplace.net/2014/06/04/using-asdl-to-describe-asts-in-compilers">dabbling in ASDL in the past</a>,
and also have experience with similar techniques for defining
<a class="reference external" href="https://github.com/eliben/pycparser">pycparser ASTs</a>.</p>
<p>The result is <a class="reference external" href="https://github.com/eliben/go-ungrammar">go-ungrammar</a>,
a re-implementation of Ungrammar in Go. The input is an Ungrammar file defining
some CST; for example, here's a simple calculator language:</p>
<div class="highlight"><pre><span></span>Program = Stmt*
Stmt = AssignStmt | Expr
AssignStmt = 'set' 'ident' '=' Expr
Expr =
Literal
| UnaryExpr
| ParenExpr
| BinExpr
UnaryExpr = op:('+' | '-') Expr
ParenExpr = '(' Expr ')'
BinExpr = lhs:Expr op:('+' | '-' | '*' | '/' | '%') rhs:Expr
Literal = 'int_literal' | 'ident'
</pre></div>
<p>Ungrammar looks a bit like EBNF, but not <em>quite</em> (hence the name "ungrammar").
It's much simpler because it doesn't need to concern itself with precedence,
ambiguities and so on, also leaving all the (often complex) lexical rules to the
lexer. It simply defines a <em>tree</em> that can be used to represent parsed language.
It's also different from ASTs in that it preserves all tokens, including
delimiters and other syntax elements. This is useful for tools like language
servers that need a full-fidelity representation of the source code.</p>
<div class="section" id="implementation-notes">
<h2>Implementation notes</h2>
<p><tt class="docutils literal"><span class="pre">go-ungrammar</span></tt> uses a classical <a class="reference external" href="https://github.com/eliben/go-ungrammar/blob/main/lexer.go">hand-written lexical analyzer</a>
and a <a class="reference external" href="https://github.com/eliben/go-ungrammar/blob/main/parser.go">recursive
descent parser</a>.
Just for fun, I spent more time on error recovery than strictly necessary for
such a simple input language. The lexer <a class="reference external" href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">never gives up</a> when encountering non-sensical
input; it simply emits an <tt class="docutils literal">ERROR</tt> token and keeps going. The parser doesn't
quit on the first error either; instead, it collects all the errors it
encounters and tries to recover from each one (the <tt class="docutils literal">synchronize()</tt> method in
the parser code). As an example of this in action, consider this faulty
Ungrammar input:</p>
<div class="highlight"><pre><span></span>foo = @
bar = ( joe
x = y
</pre></div>
<p>At first glance, there are at least a couple of issues here:</p>
<ul class="simple">
<li><tt class="docutils literal">@</tt> is not a valid Ungrammar token</li>
<li>The <tt class="docutils literal">(</tt> in the second rule is unterminated; as all programmers know,
unterminated grouping elements spell trouble because the compiler can get
easily confused until it finds a valid terminator</li>
</ul>
<p>When <tt class="docutils literal"><span class="pre">go-ungrammar</span></tt> runs it will report an error that looks like this:</p>
<div class="highlight"><pre><span></span>1:7: unknown token starting with '@' (and 2 more errors)
</pre></div>
<p>The <a class="reference external" href="https://github.com/eliben/go-ungrammar/blob/main/errorlist.go">concrete error type</a> returned by
the parser collects all the errors, so we can iterate over them and display them
all:</p>
<div class="highlight"><pre><span></span>1:7: unknown token starting with '@'
2:1: expected rule, got bar
3:1: expected ')', got x
</pre></div>
<p>The parser recovers after the first error expecting to see the RHS
(right-hand-side) for the <tt class="docutils literal">foo</tt> rule, but doesn't find any. This is a good
place to discuss parser recovery. The Ungrammar language has a significant
ambiguity:</p>
<div class="highlight"><pre><span></span>foo = bar baz = barn
</pre></div>
<p>Are <tt class="docutils literal">bar baz</tt> the RHS sequence for rule <tt class="docutils literal">foo</tt>, or is <tt class="docutils literal">baz =</tt> the beginning
of a new rule? Note that the language is whitespace-insensitive, so this really
does come up; just look at the example calculator Ungrammar above - this is
encountered on pretty much any new rule.</p>
<p>The way <tt class="docutils literal"><span class="pre">go-ungrammar</span></tt> resolves the ambiguity is by using an <tt class="docutils literal">NODE =</tt>
lookahead, deciding it's the beginning of a new rule (<tt class="docutils literal">NODE</tt> is an Ungrammar
term for "plain identifier").</p>
<p>Back to our recovery example: the second error is the parser complaining that
it expected some rule after <tt class="docutils literal">foo =</tt> but found none; an empty RHS is invalid
in Ungrammar and the <tt class="docutils literal">@</tt> was reported and skipped. So the parser complains
that it found a new rule definition instead of the RHS for an existing rule.
At this point it re-synchronizes and parses the <tt class="docutils literal">bar =</tt> rule. Then it runs into
the third error - the <tt class="docutils literal">(</tt> is unterminated. Still, the parser recovers and
keeps going.</p>
<p>Even with all these errors, the parser will produce a partial result - a tree
equivalent to this input:</p>
<div class="highlight"><pre><span></span>bar = joe
x = y
</pre></div>
<p>For <tt class="docutils literal">foo</tt> there was simply nothing to parse. For <tt class="docutils literal">bar</tt>, the parser reported
the missing <tt class="docutils literal">)</tt> but parsed the contents anyway. It then fully recovered and
was able to parse <tt class="docutils literal">x = y</tt> properly. Being able to parse incomplete input and
produce partial trees is very important for error recovery, and especially for
tools like language servers that need to be resilient in the presence of partial
input the user is busy typing in.</p>
<p>I enjoyed coding this resilient parser; while it's probably an overkill for
a language as simple as Ungrammar, it's a good kata for frontend construction.</p>
</div>
Deciphering Haskell's applicative and monadic parsers2017-11-27T05:28:00-08:002022-10-04T14:08:24-07:00Eli Benderskytag:eli.thegreenplace.net,2017-11-27:/2017/deciphering-haskells-applicative-and-monadic-parsers/<p>This post follows the construction of parsers described in <a class="reference external" href="http://www.cs.nott.ac.uk/~pszgmh/pih.html">Graham Hutton's
"Programming in Haskell" (2nd edition)</a>. It's my attempt to work through
chapter 13 in this book and understand the details of applicative and monadic
combination of parsers presented therein.</p>
<div class="section" id="basic-definitions-for-the-parser-type">
<h2>Basic definitions for the Parser type</h2>
<p>A parser parameterized on …</p></div><p>This post follows the construction of parsers described in <a class="reference external" href="http://www.cs.nott.ac.uk/~pszgmh/pih.html">Graham Hutton's
"Programming in Haskell" (2nd edition)</a>. It's my attempt to work through
chapter 13 in this book and understand the details of applicative and monadic
combination of parsers presented therein.</p>
<div class="section" id="basic-definitions-for-the-parser-type">
<h2>Basic definitions for the Parser type</h2>
<p>A parser parameterized on some type <tt class="docutils literal">a</tt> is:</p>
<div class="highlight"><pre><span></span><span class="kr">newtype</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="kt">P</span><span class="w"> </span><span class="p">(</span><span class="kt">String</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="p">[(</span><span class="n">a</span><span class="p">,</span><span class="kt">String</span><span class="p">)])</span><span class="w"></span>
</pre></div>
<p>It's a function taking a <tt class="docutils literal">String</tt> and returning a list of <tt class="docutils literal">(a,String)</tt>
pairs, where <tt class="docutils literal">a</tt> is a value of the parameterized type and <tt class="docutils literal">String</tt> is (by
convention) the unparsed remainder of the input. The returned list is
potentially empty, which signals a failure in parsing <a class="footnote-reference" href="#footnote-1" id="footnote-reference-1">[1]</a>. It might have made
more sense to define <tt class="docutils literal">Parser</tt> as a <tt class="docutils literal">type</tt> alias for the function, but
<tt class="docutils literal">type</tt>s can't be made into instances of typeclasses; therefore, we use
<tt class="docutils literal">netwype</tt> with a dummy constructor named <tt class="docutils literal">P</tt>.</p>
<p>With this <tt class="docutils literal">Parser</tt> type, the act of actually parsing a string is expressed
with the following helper function. It's not strictly necessary, but it helps
make code cleaner by hiding <tt class="docutils literal">P</tt> from users of the parser.</p>
<div class="highlight"><pre><span></span><span class="nf">parse</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kt">String</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="p">[(</span><span class="n">a</span><span class="p">,</span><span class="kt">String</span><span class="p">)]</span><span class="w"></span>
<span class="nf">parse</span><span class="w"> </span><span class="p">(</span><span class="kt">P</span><span class="w"> </span><span class="n">p</span><span class="p">)</span><span class="w"> </span><span class="n">inp</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="n">inp</span><span class="w"></span>
</pre></div>
<p>The most basic parsing primitive plucks off the first character from a given
string:</p>
<div class="highlight"><pre><span></span><span class="nf">item</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="kt">Char</span><span class="w"></span>
<span class="nf">item</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="kt">P</span><span class="w"> </span><span class="p">(</span><span class="nf">\</span><span class="n">inp</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kr">case</span><span class="w"> </span><span class="n">inp</span><span class="w"> </span><span class="kr">of</span><span class="w"></span>
<span class="w"> </span><span class="kt">[]</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kt">[]</span><span class="w"></span>
<span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="kt">:</span><span class="n">xs</span><span class="p">)</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="p">[(</span><span class="n">x</span><span class="p">,</span><span class="n">xs</span><span class="p">)])</span><span class="w"></span>
</pre></div>
<p>Here's how it works in practice:</p>
<div class="highlight"><pre><span></span>> parse item "foo"
[('f',"oo")]
> parse item "f"
[('f',"")]
> parse item ""
[]
</pre></div>
</div>
<div class="section" id="parser-as-a-functor">
<h2>Parser as a Functor</h2>
<p>We'll start by making <tt class="docutils literal">Parser</tt> an instance of <tt class="docutils literal">Functor</tt>:</p>
<div class="highlight"><pre><span></span><span class="kr">instance</span><span class="w"> </span><span class="kt">Functor</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="kr">where</span><span class="w"></span>
<span class="w"> </span><span class="c1">-- fmap :: (a -> b) -> Parser a -> Parser b</span><span class="w"></span>
<span class="w"> </span><span class="n">fmap</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="kt">P</span><span class="w"> </span><span class="p">(</span><span class="nf">\</span><span class="n">inp</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kr">case</span><span class="w"> </span><span class="n">parse</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="n">inp</span><span class="w"> </span><span class="kr">of</span><span class="w"></span>
<span class="w"> </span><span class="kt">[]</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kt">[]</span><span class="w"></span>
<span class="w"> </span><span class="p">[(</span><span class="n">v</span><span class="p">,</span><span class="n">out</span><span class="p">)]</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="p">[(</span><span class="n">g</span><span class="w"> </span><span class="n">v</span><span class="p">,</span><span class="n">out</span><span class="p">)])</span><span class="w"></span>
</pre></div>
<p>With <tt class="docutils literal">fmap</tt> we can create a new parser from an existing parser, with a
function applied to the parser's output. For example:</p>
<div class="highlight"><pre><span></span>> parse (fmap toUpper item) "foo"
[('F',"oo")]
> parse (fmap toUpper item) ""
[]
</pre></div>
<p>Let's check that the functor laws work for this definition. The first law:</p>
<div class="highlight"><pre><span></span>fmap id = id
</pre></div>
<p>Is fairly obvious when we substitute <tt class="docutils literal">id</tt> for <tt class="docutils literal">g</tt> in the definition of
<tt class="docutils literal">fmap</tt>. We get:</p>
<div class="highlight"><pre><span></span>fmap id p = P (\inp -> case parse p inp of
[] -> []
[(v,out)] -> [(id v,out)])
</pre></div>
<p>Which takes the parse result of <tt class="docutils literal">p</tt> and passes it through without
modification. In other words, it's equivalent to <tt class="docutils literal">p</tt> itself, and hence the
first law holds.</p>
<p>Verifying the second law:</p>
<div class="highlight"><pre><span></span>fmap (g . h) = fmap g . fmap h
</pre></div>
<p>... is similarly straightforward and is left as an exercise to the reader.</p>
<p>While it's not obvious why a <tt class="docutils literal">Functor</tt> instance for <tt class="docutils literal">Parser</tt> is useful in
its own right, it's actually required to make <tt class="docutils literal">Parser</tt> into an
<tt class="docutils literal">Applicative</tt>, and also when combining parsers using applicative style.</p>
</div>
<div class="section" id="parser-as-an-applicative">
<h2>Parser as an Applicative</h2>
<p>Consider parsing conditional expressions in a fictional language:</p>
<div class="highlight"><pre><span></span>if <expr> then <expr> else <expr>
</pre></div>
<p>To parse such expressions we'd like to say:</p>
<ul class="simple">
<li>Parse the token <tt class="docutils literal">if</tt></li>
<li>Parse an <expr></li>
<li>Parse the token <tt class="docutils literal">then</tt></li>
<li>Parse an <expr></li>
<li>Parse the token <tt class="docutils literal">else</tt></li>
<li>Parse an <expr></li>
<li>If all of this was successful, combine all the parsed expressions into some
sort of result, like an AST node.</li>
</ul>
<p>Such sequences, along with alternation (an expression is either <em>this</em> or
<em>that</em>) are two of the critical basic blocks of constructing non-trivial
parsers. Let's see a popular way to accomplish this in Haskell (for a complete
example demonstrating how to construct a parser for this particular conditional
expression, see the last section in this post).</p>
<p><a class="reference external" href="https://en.wikipedia.org/wiki/Parser_combinator">Parser combinators</a> is a
popular technique for constructing complex parsers from simpler parsers, by
means of higher-order functions. In Haskell, one of the ways in which parsers
can be elegantly combined is using applicative style. Here's the <tt class="docutils literal">Applicative</tt>
instance for <tt class="docutils literal">Parser</tt>.</p>
<div class="highlight"><pre><span></span><span class="kr">instance</span><span class="w"> </span><span class="kt">Applicative</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="kr">where</span><span class="w"></span>
<span class="w"> </span><span class="c1">-- pure :: a -> Parser a</span><span class="w"></span>
<span class="w"> </span><span class="n">pure</span><span class="w"> </span><span class="n">v</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="kt">P</span><span class="w"> </span><span class="p">(</span><span class="nf">\</span><span class="n">inp</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="p">[(</span><span class="n">v</span><span class="p">,</span><span class="n">inp</span><span class="p">)])</span><span class="w"></span>
<span class="w"> </span><span class="c1">-- <*> :: Parser (a -> b) -> Parser a -> Parser b</span><span class="w"></span>
<span class="w"> </span><span class="n">pg</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">px</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="kt">P</span><span class="w"> </span><span class="p">(</span><span class="nf">\</span><span class="n">inp</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kr">case</span><span class="w"> </span><span class="n">parse</span><span class="w"> </span><span class="n">pg</span><span class="w"> </span><span class="n">inp</span><span class="w"> </span><span class="kr">of</span><span class="w"></span>
<span class="w"> </span><span class="kt">[]</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kt">[]</span><span class="w"></span>
<span class="w"> </span><span class="p">[(</span><span class="n">g</span><span class="p">,</span><span class="n">out</span><span class="p">)]</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="n">parse</span><span class="w"> </span><span class="p">(</span><span class="n">fmap</span><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="n">px</span><span class="p">)</span><span class="w"> </span><span class="n">out</span><span class="p">)</span><span class="w"></span>
</pre></div>
<p>Recall how we created a parser that applied <tt class="docutils literal">toUpper</tt> to its result using
<tt class="docutils literal">fmap</tt>? We can now do the same in applicative style:</p>
<div class="highlight"><pre><span></span>> parse (pure toUpper <*> item) "foo"
[('F',"oo")]
</pre></div>
<p>Let's see why this works. While not too exciting on its own, this application of
a single-argument function is a good segue to more complicated use cases.</p>
<p>Looking at the <tt class="docutils literal">Applicative</tt> instance, <tt class="docutils literal">pure toUpper</tt> translates to
<tt class="docutils literal">P (\inp <span class="pre">-></span> [(toUpper,inp)]</tt> - a parser that passes its input through
unchanged, returning <tt class="docutils literal">toUpper</tt> as a result. Now, substituting <tt class="docutils literal">item</tt> into
the definition of <tt class="docutils literal"><*></tt> we get:</p>
<div class="highlight"><pre><span></span>pg <*> item = P (\inp -> case parse pg inp of
[] -> []
[(g,out)] -> parse (fmap g item) out)
... pg is (pure toUpper), the parsing of which always succeeds, returning
[(toUpper,inp)]
pg <*> item = P (\inp -> parse (fmap toUpper item) inp)
</pre></div>
<p>In other words, this is exactly the example we had for <tt class="docutils literal">Functor</tt> by
<tt class="docutils literal">fmap</tt>-ing <tt class="docutils literal">toUpper</tt> onto <tt class="docutils literal">item</tt>.</p>
<p>The more interesting case is applying functions with multiple parameters. Here's
how we define a parser that parses three items from the input, dropping the
middle result:</p>
<div class="highlight"><pre><span></span><span class="nf">dropMiddle</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="p">(</span><span class="kt">Char</span><span class="p">,</span><span class="kt">Char</span><span class="p">)</span><span class="w"></span>
<span class="nf">dropMiddle</span><span class="w"> </span><span class="ow">=</span><span class="w"></span>
<span class="w"> </span><span class="n">pure</span><span class="w"> </span><span class="n">selector</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">item</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">item</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">item</span><span class="w"></span>
<span class="w"> </span><span class="kr">where</span><span class="w"> </span><span class="n">selector</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="n">z</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">z</span><span class="p">)</span><span class="w"></span>
</pre></div>
<p>Following the application of nested <tt class="docutils literal"><*></tt> operators is tricky because it
builds a run-time chain of functions referring to other functions. This chain
is only collapsed when the parser is used to actually <tt class="docutils literal">parse</tt> some input, so
it is necessary to keep a lot of context "on the fly". To better understand how
this works, we can break the definition of <tt class="docutils literal">dropMiddle</tt> into parts as follows
(since <tt class="docutils literal"><*></tt> is left-associative):</p>
<div class="highlight"><pre><span></span><span class="nf">dropMiddle</span><span class="w"> </span><span class="ow">=</span><span class="w"></span>
<span class="w"> </span><span class="p">((</span><span class="n">pure</span><span class="w"> </span><span class="n">selector</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">item</span><span class="p">)</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">item</span><span class="p">)</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">item</span><span class="w"></span>
<span class="w"> </span><span class="kr">where</span><span class="w"> </span><span class="n">selector</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="n">z</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">z</span><span class="p">)</span><span class="w"></span>
</pre></div>
<p>Applying the first <tt class="docutils literal"><*></tt>:</p>
<div class="highlight"><pre><span></span>pg <*> item = P (\inp -> case parse pg inp of
[] -> []
[(g,out)] -> parse (fmap g item) out)
... pg is (pure selector), the parsing of which always succeeds, returning
[(selector,inp)]
pg <*> item = P (\inp -> parse (fmap selector item) inp) --= app1
</pre></div>
<p>Let's call this parser <tt class="docutils literal">app1</tt> and apply the second <tt class="docutils literal"><*></tt> in the sequence.</p>
<div class="highlight"><pre><span></span>app1 <*> item = P (\inp -> case parse app1 inp of
[] -> []
[(g,out)] -> parse (fmap g item) out) --= app2
</pre></div>
<p>We'll call this <tt class="docutils literal">app2</tt> and move on. Similarly, applying the third <tt class="docutils literal"><*></tt> in
the sequence produces:</p>
<div class="highlight"><pre><span></span>app2 <*> item = P (\inp -> case parse app2 inp of
[] -> []
[(g,out)] -> parse (fmap g item) out)
</pre></div>
<p>This is <tt class="docutils literal">dropMiddle</tt>. It's a chain of parsers expressed as a compbination of
higher-order functions (closures, actually).</p>
<p>To see how this combined parser actually parses input, let's trace through the
execution of:</p>
<div class="highlight"><pre><span></span>> parse dropMiddle "pumpkin"
[(('p','m'),"pkin")]
</pre></div>
<p><tt class="docutils literal">dropMiddle</tt> is <tt class="docutils literal">app2 <*> item</tt>, so we have:</p>
<div class="highlight"><pre><span></span>-- parse dropMiddle
parse P (\inp -> case parse app2 inp of
[] -> []
[(g,out)] -> parse (fmap g item) out)
"pumpkin"
.. substituting "pumpkin" into inp
case parse app2 "pumpkin" of
[] -> []
[(g,out)] -> parse (fmap g item) out
</pre></div>
<p>Now <tt class="docutils literal">parse app2 "pumpkin"</tt> is going to be invoked; <tt class="docutils literal">app2</tt> is <tt class="docutils literal">app1 <*>
item</tt>:</p>
<div class="highlight"><pre><span></span>-- parse app2
case parse app1 "pumpkin" of
[] -> []
[(g,out)] -> parse (fmap g item) out
</pre></div>
<p>Similarly, we get to <tt class="docutils literal">parse app1 "pumpkin"</tt>:</p>
<div class="highlight"><pre><span></span>-- parse app1
parse (fmap selector item) "pumpkin"
.. following the definition of fmap
parse P (\inp -> case parse item inp of
[] -> []
[(v,out)] -> [(selector v,out)])
"pumpkin"
.. Since (parse item "pumpkin") returns [('p',"umpkin")], we get:
[(selector 'p',"umpkin")]
</pre></div>
<p>Now going back to <tt class="docutils literal">parse app2</tt>, knowing what <tt class="docutils literal">parse app1 "pumpkin"</tt> returns:</p>
<div class="highlight"><pre><span></span>parse (fmap (selector 'p') item) "umpkin"
.. following the definition of fmap
parse P (\inp -> case parse item inp of
[] -> []
[(v,out)] -> [(selector 'p' v,out)])
"umpkin"
[(selector 'p' 'u',"mpkin")]
</pre></div>
<p>Finally, <tt class="docutils literal">dropMiddle</tt>:</p>
<div class="highlight"><pre><span></span>app2 <*> item = P (\inp -> case parse app2 inp of
[] -> []
[(g,out)] -> parse (fmap g item) out)
.. Since (parse app2 "pumpkin") returns [(selector 'p' 'u',"mpkin")]
parse (fmap (selector 'p' "u") item) "mpkin"
.. If we follow the definition of fmap again, we'll get:
[(selector 'p' 'u' 'm',"pkin")]
</pre></div>
<p>This is the final result of applying <tt class="docutils literal">dropMiddle</tt> to "pumpkin", and when
<tt class="docutils literal">selector</tt> is invoked we get <tt class="docutils literal"><span class="pre">[(('p','m'),"pkin")]</span></tt>, as expected.</p>
</div>
<div class="section" id="parser-as-a-monad">
<h2>Parser as a Monad</h2>
<p>Parsers can also be expressed and combined using monadic style. Here's the
<tt class="docutils literal">Monad</tt> instance for <tt class="docutils literal">Parser</tt>:</p>
<div class="highlight"><pre><span></span><span class="kr">instance</span><span class="w"> </span><span class="kt">Monad</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="kr">where</span><span class="w"></span>
<span class="w"> </span><span class="c1">-- return :: a -> Parser a</span><span class="w"></span>
<span class="w"> </span><span class="n">return</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="n">pure</span><span class="w"></span>
<span class="w"> </span><span class="c1">-- (>>=) :: Parser a -> (a -> Parser b) -> Parser b</span><span class="w"></span>
<span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">>>=</span><span class="w"> </span><span class="n">f</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="kt">P</span><span class="w"> </span><span class="p">(</span><span class="nf">\</span><span class="n">inp</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kr">case</span><span class="w"> </span><span class="n">parse</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="n">inp</span><span class="w"> </span><span class="kr">of</span><span class="w"></span>
<span class="w"> </span><span class="kt">[]</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="kt">[]</span><span class="w"></span>
<span class="w"> </span><span class="p">[(</span><span class="n">v</span><span class="p">,</span><span class="n">out</span><span class="p">)]</span><span class="w"> </span><span class="ow">-></span><span class="w"> </span><span class="n">parse</span><span class="w"> </span><span class="p">(</span><span class="n">f</span><span class="w"> </span><span class="n">v</span><span class="p">)</span><span class="w"> </span><span class="n">out</span><span class="p">)</span><span class="w"></span>
</pre></div>
<p>Let's take the simple example of applying <tt class="docutils literal">toUpper</tt> to <tt class="docutils literal">item</tt> again, this
time using monadic operators:</p>
<div class="highlight"><pre><span></span>> parse (item >>= (\x -> return $ toUpper x)) "foo"
[('F',"oo")]
</pre></div>
<p>Substituting in the definition of <tt class="docutils literal">>>=</tt>:</p>
<div class="highlight"><pre><span></span>item >>= (\x -> return $ toUpper x) =
P (\inp -> case parse item inp of
[] -> []
[(v,out)] -> parse (return $ toUpper v) out)
... if item succeeds, this is a parser that will always succeed with
the upper-cased result of item
</pre></div>
<p>When writing in monadic style, however, we won't typically be using the <tt class="docutils literal">>>=</tt>
operator explicitly; instead, we'll use the <tt class="docutils literal">do</tt> notation. Recall that in the
general multi-parameter case, this:</p>
<div class="highlight"><pre><span></span>m1 >>= \x1 ->
m2 >>= \x2 ->
...
mn >>= \xn -> f x1 x2 ... xn
</pre></div>
<p>Is equivalent to this:</p>
<div class="highlight"><pre><span></span>do x1 <- m1
x2 <- m2
...
xn <- mn
f x1 x2 ... xn
</pre></div>
<p>So we can also rewrite our example as:</p>
<div class="highlight"><pre><span></span>> parse (do x <- item; return $ toUpper x) "foo"
[('F',"oo")]
</pre></div>
<p>The <tt class="docutils literal">do</tt> notation starts looking much more attractive for multiple parameters,
however. Here's <tt class="docutils literal">dropMiddle</tt> in monadic style written directly <a class="footnote-reference" href="#footnote-2" id="footnote-reference-2">[2]</a>:</p>
<div class="highlight"><pre><span></span>dropMiddleM :: Parser (Char,Char)
dropMiddleM = item >>= \x ->
item >>= \_ ->
item >>= \z -> return (x,z)
</pre></div>
<p>And now rewritten using <tt class="docutils literal">do</tt>:</p>
<div class="highlight"><pre><span></span>dropMiddleM' :: Parser (Char,Char)
dropMiddleM' =
do x <- item
item
z <- item
return (x,z)
</pre></div>
<p>Let's do a detailed breakdown of what's happening here to better understand the
monadic sequencing mechanics. I'll be using the direct style (<tt class="docutils literal">dropMiddleM</tt>)
to unravel the applications of <tt class="docutils literal">>>=</tt>:</p>
<div class="highlight"><pre><span></span>item >>= \x ->
item >>= \_ ->
item >>= \z -> return (x,z)
.. applying the first >>=, calling the right-hand side rhsX
P (\inp -> case parse item inp of
[] -> []
[(v,out)] -> parse (rhsX v) out)
.. the result of parsing the first item is passed in as the argument to rhsX,
which then returns the next application of >>=; As usual, we acknowledge
the error propagation and ignore it for simplicity.
P (\inp -> case parse item inp of
[] -> []
[(v,out)] -> parse (rhsY v) out)
... and similarly for rhsZ; the final result is invoking "parse return (x,z)"
where x is the result of parsing the first item and z the result of
parsing the third.
</pre></div>
</div>
<div class="section" id="a-complete-example">
<h2>A complete example</h2>
<p>As a complete example, I've expanded the parser grammar found in the book to
support conditional expressions. The full example is <a class="reference external" href="https://github.com/eliben/code-for-blog/blob/master/2017/haskell-parsers/exprparser.hs">available here</a>.
Recall that wa want to parse expressions of the form:</p>
<div class="highlight"><pre><span></span>if <expr> then <expr> else <expr>
</pre></div>
<p>This is the monadic parser <a class="footnote-reference" href="#footnote-3" id="footnote-reference-3">[3]</a>:</p>
<div class="highlight"><pre><span></span><span class="nf">ifexpr</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="kt">Int</span><span class="w"></span>
<span class="nf">ifexpr</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="kr">do</span><span class="w"> </span><span class="n">symbol</span><span class="w"> </span><span class="s">"if"</span><span class="w"></span>
<span class="w"> </span><span class="n">cond</span><span class="w"> </span><span class="ow"><-</span><span class="w"> </span><span class="n">expr</span><span class="w"></span>
<span class="w"> </span><span class="n">symbol</span><span class="w"> </span><span class="s">"then"</span><span class="w"></span>
<span class="w"> </span><span class="n">thenExpr</span><span class="w"> </span><span class="ow"><-</span><span class="w"> </span><span class="n">expr</span><span class="w"></span>
<span class="w"> </span><span class="n">symbol</span><span class="w"> </span><span class="s">"else"</span><span class="w"></span>
<span class="w"> </span><span class="n">elseExpr</span><span class="w"> </span><span class="ow"><-</span><span class="w"> </span><span class="n">expr</span><span class="w"></span>
<span class="w"> </span><span class="n">return</span><span class="w"> </span><span class="p">(</span><span class="kr">if</span><span class="w"> </span><span class="n">cond</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="kr">then</span><span class="w"> </span><span class="n">elseExpr</span><span class="w"> </span><span class="kr">else</span><span class="w"> </span><span class="n">thenExpr</span><span class="p">)</span><span class="w"></span>
</pre></div>
<p>And this is the equivalent applicative version (<tt class="docutils literal"><$></tt> is just an infix
synonym for <tt class="docutils literal">fmap</tt>):</p>
<div class="highlight"><pre><span></span><span class="nf">ifexpr'</span><span class="w"> </span><span class="ow">::</span><span class="w"> </span><span class="kt">Parser</span><span class="w"> </span><span class="kt">Int</span><span class="w"></span>
<span class="nf">ifexpr'</span><span class="w"> </span><span class="ow">=</span><span class="w"></span>
<span class="w"> </span><span class="n">selector</span><span class="w"> </span><span class="o"><$></span><span class="w"> </span><span class="n">symbol</span><span class="w"> </span><span class="s">"if"</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">expr</span><span class="w"></span>
<span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">symbol</span><span class="w"> </span><span class="s">"then"</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">expr</span><span class="w"></span>
<span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">symbol</span><span class="w"> </span><span class="s">"else"</span><span class="w"> </span><span class="o"><*></span><span class="w"> </span><span class="n">expr</span><span class="w"></span>
<span class="w"> </span><span class="kr">where</span><span class="w"> </span><span class="n">selector</span><span class="w"> </span><span class="kr">_</span><span class="w"> </span><span class="n">cond</span><span class="w"> </span><span class="kr">_</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="kr">_</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="ow">=</span><span class="w"> </span><span class="kr">if</span><span class="w"> </span><span class="n">cond</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="kr">then</span><span class="w"> </span><span class="n">e</span><span class="w"> </span><span class="kr">else</span><span class="w"> </span><span class="n">t</span><span class="w"></span>
</pre></div>
<p>Which one is better? It's really a matter of personal taste. Since both the
monadic and applicative styles deal in <tt class="docutils literal">Parser</tt>s, they can be freely mixed
and combined.</p>
<hr class="docutils" />
<table class="docutils footnote" frame="void" id="footnote-1" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#footnote-reference-1">[1]</a></td><td>Failures could also be signaled by using <tt class="docutils literal">Maybe</tt>, but a list lets us
express multiple results (for example a string that can be parsed in
multiple ways). We're not going to be using multiple results in this
article, but it's good to keep this option open.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="footnote-2" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#footnote-reference-2">[2]</a></td><td>We could also use the monadic operator <tt class="docutils literal">>></tt> for statements that
don't create a new assignment, but using <tt class="docutils literal">>>=</tt> everywhere for
consistency makes it a bit easier to understand.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="footnote-3" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#footnote-reference-3">[3]</a></td><td>The return value of this parser is <tt class="docutils literal">Int</tt>, because it evaluates the
parsed expression on the fly - this technique is called <em>Syntax Directed
Translation</em> in the Dragon book. Note also that the conditional clauses
are evaluated eagerly, which is valid only when no side effects are
present.</td></tr>
</tbody>
</table>
</div>
Parsing expressions by precedence climbing2012-08-02T05:48:43-07:002023-06-30T23:16:27-07:00Eli Benderskytag:eli.thegreenplace.net,2012-08-02:/2012/08/02/parsing-expressions-by-precedence-climbing
<p>I've written <a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/">previously</a> about the problem recursive descent parsers have with expressions, especially when the language has multiple levels of operator precedence.</p>
<p>There are several ways to attack this problem. The Wikipedia article on <a class="reference external" href="http://en.wikipedia.org/wiki/Operator-precedence_parser">operator-precedence parsers</a> mentions three algorithms: Shunting Yard, top-down operator precedence (TDOP) and precedence climbing. I have …</p>
<p>I've written <a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/">previously</a> about the problem recursive descent parsers have with expressions, especially when the language has multiple levels of operator precedence.</p>
<p>There are several ways to attack this problem. The Wikipedia article on <a class="reference external" href="http://en.wikipedia.org/wiki/Operator-precedence_parser">operator-precedence parsers</a> mentions three algorithms: Shunting Yard, top-down operator precedence (TDOP) and precedence climbing. I have already covered <a class="reference external" href="https://eli.thegreenplace.net/2009/03/20/a-recursive-descent-parser-with-an-infix-expression-evaluator/">Shunting Yard</a> and <a class="reference external" href="https://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing/">TDOP</a> in this blog. Here I aim to present the third method (and the one that actually ends up being used a lot in practice) - precedence climbing.</p>
<div class="section" id="precedence-climbing-what-it-aims-to-achieve">
<h3>Precedence climbing - what it aims to achieve</h3>
<p>It's not necessary to be familiar with the other algorithms for expression parsing in order to understand precedence climbing. In fact, I think that precedence climbing is the simplest of them all. To explain it, I want to first present what the algorithm is trying to achieve. After this, I will explain how it does this, and finally will present a fully functional implementation in Python.</p>
<p>So the basic goal of the algorithm is the following: treat an expression as a bunch of nested sub-expressions, where each sub-expression has in common the lowest precedence level of the the operators it contains.</p>
<p>Here's a simple example:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2 + 3 * 4 * 5 - 6
</pre></div>
<p>Assuming that the precedence of <tt class="docutils literal">+</tt> (and <tt class="docutils literal">-</tt>) is 1 and the precedence of <tt class="docutils literal">*</tt> (and <tt class="docutils literal">/</tt>) is 2, we have:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2 + 3 * 4 * 5 - 6
|---------------| : prec 1
|-------| : prec 2
</pre></div>
<p>The sub-expression multiplying the three numbers has a minimal precedence of 2. The sub-expression spanning the whole original expression has a minimal precedence of 1.</p>
<p>Here's a more complex example, adding a power operator <tt class="docutils literal">^</tt> with precedence 3:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2 + 3 ^ 2 * 3 + 4
|---------------| : prec 1
|-------| : prec 2
|---| : prec 3
</pre></div>
<div class="section" id="associativity">
<h4>Associativity</h4>
<p>Binary operators, in addition to precedence, also have the concept of <em>associativity</em>. Simply put, <em>left associative</em> operators stick to the left stronger than to the right; <em>right associative</em> operators vice versa.</p>
<p>Some examples. Since addition is left associative, this:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2 + 3 + 4
</pre></div>
<p>Is equivalent to this:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">(2 + 3) + 4
</pre></div>
<p>On the other hand, power (exponentiation) is right associative. This:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2 ^ 3 ^ 4
</pre></div>
<p>Is equivalent to this:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2 ^ (3 ^ 4)
</pre></div>
<p>The precedence climbing algorithm also needs to handle associativity correctly.</p>
</div>
<div class="section" id="nested-parenthesized-sub-expressions">
<h4>Nested parenthesized sub-expressions</h4>
<p>Finally, we all know that parentheses can be used to explicitly group sub-expressions, beating operator precedence. So the following expression computes the addition <em>before</em> the multiplication:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2 * (3 + 5) * 7
</pre></div>
<p>As we'll see, the algorithm has a special provision to cleverly handle nested sub-expressions.</p>
</div>
</div>
<div class="section" id="precedence-climbing-how-it-actually-works">
<h3>Precedence climbing - how it actually works</h3>
<p>First let's define some terms. <em>Atoms</em> are either numbers or parenthesized expressions. <em>Expressions</em> consist of atoms connected by binary operators <a class="footnote-reference" href="#id4" id="id1">[1]</a>. Note how these two terms are mutually dependent. This is normal in the land of grammars and parsers.</p>
<p>The algorithm is <em>operator-guided</em>. Its fundamental step is to consume the next atom and look at the operator following it. If the operator has precedence lower than the lowest acceptable for the current step, the algorithm returns. Otherwise, it calls itself in a loop to handle the sub-expression. In pseudo-code, it looks like this <a class="footnote-reference" href="#id5" id="id2">[2]</a>:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">compute_expr(min_prec):
result = compute_atom()
while cur token is a binary operator with precedence >= min_prec:
prec, assoc = precedence and associativity of current token
if assoc is left:
next_min_prec = prec + 1
else:
next_min_prec = prec
rhs = compute_expr(next_min_prec)
result = compute operator(result, rhs)
return result
</pre></div>
<p>Each recursive call here handles a sequence of operator-connected atoms sharing the same minimal precedence.</p>
<div class="section" id="an-example">
<h4>An example</h4>
<p>To get a feel for how the algorithm works, let's start with an example:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2 + 3 ^ 2 * 3 + 4
</pre></div>
<p>It's recommended to follow the execution of the algorithm through this expression with, on paper. The computation is kicked off by calling <tt class="docutils literal">compute_expr(1)</tt>, because 1 is the minimal operator precedence among all operators we've defined. Here is the "call tree" the algorithm produces for this expression:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">* compute_expr(1) # Initial call on the whole expression
* compute_atom() --> 2
* compute_expr(2) # Loop entered, operator '+'
* compute_atom() --> 3
* compute_expr(3)
* compute_atom() --> 2
* result --> 2 # Loop not entered for '*' (prec < '^')
* result = 3 ^ 2 --> 9
* compute_expr(3)
* compute_atom() --> 3
* result --> 3 # Loop not entered for '+' (prec < '*')
* result = 9 * 3 --> 27
* result = 2 + 27 --> 29
* compute_expr(2) # Loop entered, operator '+'
* compute_atom() --> 4
* result --> 4 # Loop not entered - end of expression
* result = 29 + 4 --> 33
</pre></div>
</div>
<div class="section" id="handling-precedence">
<h4>Handling precedence</h4>
<p>Note that the algorithm makes one recursive call per binary operator. Some of these calls are short lived - they will only consume an atom and return it because the <tt class="docutils literal">while</tt> loop is not entered (this happens on the second 2, as well as on the second 3 in the example expression above). Some are longer lived. The initial call to <tt class="docutils literal">compute_expr</tt> will compute the whole expression.</p>
<p>The <tt class="docutils literal">while</tt> loop is the essential ingredient here. It's the thing that makes sure that the current <tt class="docutils literal">compute_expr</tt> call handles all consecutive operators with the given minimal precedence before exiting.</p>
</div>
<div class="section" id="handling-associativity">
<h4>Handling associativity</h4>
<p>In my opinion, one of the coolest aspects of this algorithm is the simple and elegant way it handles associativity. It's all in that condition that either sets the minimal precedence for the next call to the current one, or current one plus one.</p>
<p>Here's how this works. Assume we have this sub-expression somewhere:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">8 * 9 * 10
^
|
</pre></div>
<p>The arrow marks where the <tt class="docutils literal">compute_expr</tt> call is, having entered the <tt class="docutils literal">while</tt> loop. <tt class="docutils literal">prec</tt> is 2. Since the associativity of <tt class="docutils literal">*</tt> is left, <tt class="docutils literal">next_min_prec</tt> is set to 3. The recursive call to <tt class="docutils literal">compute_expr(3)</tt>, after consuming an atom, sees the next <tt class="docutils literal">*</tt> token:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">8 * 9 * 10
^
|
</pre></div>
<p>Since the precedence of <tt class="docutils literal">*</tt> is 2, while <tt class="docutils literal">min_prec</tt> is 3, the <tt class="docutils literal">while</tt> loop never runs and the call returns. So the original <tt class="docutils literal">compute_expr</tt> will get to handle the second multiplication, not the internal call. Essentially, this means that the expression is grouped as follows:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">(8 * 9) * 10
</pre></div>
<p>Which is exactly what we want from left associativity.</p>
<p>In contrast, for this expression:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">8 ^ 9 ^ 10
</pre></div>
<p>The precedence of <tt class="docutils literal">^</tt> is 3, and since it's right associative, the <tt class="docutils literal">min_prec</tt> for the recursive call stays 3. This will mean that the recursive call <em>will</em> consume the next <tt class="docutils literal">^</tt> operator before returning to the original <tt class="docutils literal">compute_expr</tt>, grouping the expression as follows:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">8 ^ (9 ^ 10)
</pre></div>
</div>
<div class="section" id="handling-sub-expressions">
<h4>Handling sub-expressions</h4>
<p>The algorithm pseudo-code presented above doesn't explain how parenthesized sub-expressions are handled. Consider this expression:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">2000 * (4 - 3) / 100
</pre></div>
<p>It's not clear how the <tt class="docutils literal">while</tt> loop can handle this. The answer is <tt class="docutils literal">compute_atom</tt>. When it sees a left paren, it knows that a sub-expression will follow, so it calls <tt class="docutils literal">compute_expr</tt> on the sub expression (which lasts until the matching right paren), and returns its result as the result of the atom. So <tt class="docutils literal">compute_expr</tt> is oblivious to the existence of sub-expressions.</p>
<p>Finally, in order to stay short the pseudo-code leaves some interesting details out. What follows is a full implementation of the algorithm that fills all the gaps.</p>
</div>
</div>
<div class="section" id="a-python-implementation">
<h3>A Python implementation</h3>
<p>Here is a Python implementation of expression parsing by precedence climbing. It's kept short for simplicity, but can be be easily expanded to cover a more real-world language of expressions. The following sections present the code in small chunks. The whole code is <a class="reference external" href="https://github.com/eliben/code-for-blog/blob/master/2012/rd_infix_precedence.py">available here</a>.</p>
<p>I'll start with a small tokenizer class that breaks text into tokens and keeps a state. The grammar is very simple: numeric expressions, the basic arithmetic operators <tt class="docutils literal">+, <span class="pre">-,</span> *, /, ^</tt> and parens - <tt class="docutils literal">(, )</tt>.</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">Tok = namedtuple(<span style="color: #7f007f">'Tok'</span>, <span style="color: #7f007f">'name value'</span>)
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">Tokenizer</span>(<span style="color: #00007f">object</span>):
<span style="color: #7f007f">""" Simple tokenizer object. The cur_token attribute holds the current</span>
<span style="color: #7f007f"> token (Tok). Call get_next_token() to advance to the</span>
<span style="color: #7f007f"> next token. cur_token is None before the first token is</span>
<span style="color: #7f007f"> taken and after the source ends.</span>
<span style="color: #7f007f"> """</span>
TOKPATTERN = re.compile(<span style="color: #7f007f">"\s*(?:(\d+)|(.))"</span>)
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">__init__</span>(<span style="color: #00007f">self</span>, source):
<span style="color: #00007f">self</span>._tokgen = <span style="color: #00007f">self</span>._gen_tokens(source)
<span style="color: #00007f">self</span>.cur_token = <span style="color: #00007f">None</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">get_next_token</span>(<span style="color: #00007f">self</span>):
<span style="color: #7f007f">""" Advance to the next token, and return it.</span>
<span style="color: #7f007f"> """</span>
<span style="color: #00007f; font-weight: bold">try</span>:
<span style="color: #00007f">self</span>.cur_token = <span style="color: #00007f">self</span>._tokgen.next()
<span style="color: #00007f; font-weight: bold">except</span> StopIteration:
<span style="color: #00007f">self</span>.cur_token = <span style="color: #00007f">None</span>
<span style="color: #00007f; font-weight: bold">return</span> <span style="color: #00007f">self</span>.cur_token
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">_gen_tokens</span>(<span style="color: #00007f">self</span>, source):
<span style="color: #00007f; font-weight: bold">for</span> number, operator <span style="color: #0000aa">in</span> <span style="color: #00007f">self</span>.TOKPATTERN.findall(source):
<span style="color: #00007f; font-weight: bold">if</span> number:
<span style="color: #00007f; font-weight: bold">yield</span> Tok(<span style="color: #7f007f">'NUMBER'</span>, number)
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">'('</span>:
<span style="color: #00007f; font-weight: bold">yield</span> Tok(<span style="color: #7f007f">'LEFTPAREN'</span>, <span style="color: #7f007f">'('</span>)
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">')'</span>:
<span style="color: #00007f; font-weight: bold">yield</span> Tok(<span style="color: #7f007f">'RIGHTPAREN'</span>, <span style="color: #7f007f">')'</span>)
<span style="color: #00007f; font-weight: bold">else</span>:
<span style="color: #00007f; font-weight: bold">yield</span> Tok(<span style="color: #7f007f">'BINOP'</span>, operator)
</pre></div>
<p>Next, <tt class="docutils literal">compute_atom</tt>:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">compute_atom</span>(tokenizer):
tok = tokenizer.cur_token
<span style="color: #00007f; font-weight: bold">if</span> tok.name == <span style="color: #7f007f">'LEFTPAREN'</span>:
tokenizer.get_next_token()
val = compute_expr(tokenizer, <span style="color: #007f7f">1</span>)
<span style="color: #00007f; font-weight: bold">if</span> tokenizer.cur_token.name != <span style="color: #7f007f">'RIGHTPAREN'</span>:
parse_error(<span style="color: #7f007f">'unmatched "("'</span>)
tokenizer.get_next_token()
<span style="color: #00007f; font-weight: bold">return</span> val
<span style="color: #00007f; font-weight: bold">elif</span> tok <span style="color: #0000aa">is</span> <span style="color: #00007f">None</span>:
parse_error(<span style="color: #7f007f">'source ended unexpectedly'</span>)
<span style="color: #00007f; font-weight: bold">elif</span> tok.name == <span style="color: #7f007f">'BINOP'</span>:
parse_error(<span style="color: #7f007f">'expected an atom, not an operator "%s"'</span> % tok.value)
<span style="color: #00007f; font-weight: bold">else</span>:
<span style="color: #00007f; font-weight: bold">assert</span> tok.name == <span style="color: #7f007f">'NUMBER'</span>
tokenizer.get_next_token()
<span style="color: #00007f; font-weight: bold">return</span> <span style="color: #00007f">int</span>(tok.value)
</pre></div>
<p>It handles true atoms (numbers in our case), as well as parenthesized sub-expressions.</p>
<p>Here is <tt class="docutils literal">compute_expr</tt> itself, which is very close to the pseudo-code shown above:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="color: #007f00"># For each operator, a (precedence, associativity) pair.</span>
OpInfo = namedtuple(<span style="color: #7f007f">'OpInfo'</span>, <span style="color: #7f007f">'prec assoc'</span>)
OPINFO_MAP = {
<span style="color: #7f007f">'+'</span>: OpInfo(<span style="color: #007f7f">1</span>, <span style="color: #7f007f">'LEFT'</span>),
<span style="color: #7f007f">'-'</span>: OpInfo(<span style="color: #007f7f">1</span>, <span style="color: #7f007f">'LEFT'</span>),
<span style="color: #7f007f">'*'</span>: OpInfo(<span style="color: #007f7f">2</span>, <span style="color: #7f007f">'LEFT'</span>),
<span style="color: #7f007f">'/'</span>: OpInfo(<span style="color: #007f7f">2</span>, <span style="color: #7f007f">'LEFT'</span>),
<span style="color: #7f007f">'^'</span>: OpInfo(<span style="color: #007f7f">3</span>, <span style="color: #7f007f">'RIGHT'</span>),
}
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">compute_expr</span>(tokenizer, min_prec):
atom_lhs = compute_atom(tokenizer)
<span style="color: #00007f; font-weight: bold">while</span> <span style="color: #00007f">True</span>:
cur = tokenizer.cur_token
<span style="color: #00007f; font-weight: bold">if</span> (cur <span style="color: #0000aa">is</span> <span style="color: #00007f">None</span> <span style="color: #0000aa">or</span> cur.name != <span style="color: #7f007f">'BINOP'</span>
<span style="color: #0000aa">or</span> OPINFO_MAP[cur.value].prec < min_prec):
<span style="color: #00007f; font-weight: bold">break</span>
<span style="color: #007f00"># Inside this loop the current token is a binary operator</span>
<span style="color: #00007f; font-weight: bold">assert</span> cur.name == <span style="color: #7f007f">'BINOP'</span>
<span style="color: #007f00"># Get the operator's precedence and associativity, and compute a</span>
<span style="color: #007f00"># minimal precedence for the recursive call</span>
op = cur.value
prec, assoc = OPINFO_MAP[op]
next_min_prec = prec + <span style="color: #007f7f">1</span> <span style="color: #00007f; font-weight: bold">if</span> assoc == <span style="color: #7f007f">'LEFT'</span> <span style="color: #00007f; font-weight: bold">else</span> prec
<span style="color: #007f00"># Consume the current token and prepare the next one for the</span>
<span style="color: #007f00"># recursive call</span>
tokenizer.get_next_token()
atom_rhs = compute_expr(tokenizer, next_min_prec)
<span style="color: #007f00"># Update lhs with the new value</span>
atom_lhs = compute_op(op, atom_lhs, atom_rhs)
<span style="color: #00007f; font-weight: bold">return</span> atom_lhs
</pre></div>
<p>The only difference is that this code makes token handling more explicit. It basically follows the usual "recursive-descent protocol". Each recursive call has the current token available in <tt class="docutils literal">tokenizer.cur_tok</tt>, and makes sure to consume all the tokens it has handled (by calling <tt class="docutils literal">tokenizer.get_next_token()</tt>).</p>
<p>One additional small piece is missing. <tt class="docutils literal">compute_op</tt> simply performs the arithmetic computation for the supported binary operators:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">compute_op</span>(op, lhs, rhs):
lhs = <span style="color: #00007f">int</span>(lhs); rhs = <span style="color: #00007f">int</span>(rhs)
<span style="color: #00007f; font-weight: bold">if</span> op == <span style="color: #7f007f">'+'</span>: <span style="color: #00007f; font-weight: bold">return</span> lhs + rhs
<span style="color: #00007f; font-weight: bold">elif</span> op == <span style="color: #7f007f">'-'</span>: <span style="color: #00007f; font-weight: bold">return</span> lhs - rhs
<span style="color: #00007f; font-weight: bold">elif</span> op == <span style="color: #7f007f">'*'</span>: <span style="color: #00007f; font-weight: bold">return</span> lhs * rhs
<span style="color: #00007f; font-weight: bold">elif</span> op == <span style="color: #7f007f">'/'</span>: <span style="color: #00007f; font-weight: bold">return</span> lhs / rhs
<span style="color: #00007f; font-weight: bold">elif</span> op == <span style="color: #7f007f">'^'</span>: <span style="color: #00007f; font-weight: bold">return</span> lhs ** rhs
<span style="color: #00007f; font-weight: bold">else</span>:
parse_error(<span style="color: #7f007f">'unknown operator "%s"'</span> % op)
</pre></div>
</div>
<div class="section" id="in-the-real-world-clang">
<h3>In the real world - Clang</h3>
<p>Precedence climbing is being used in real world tools. One example is <a class="reference external" href="http://clang.llvm.org/">Clang</a>, the C/C++/ObjC front-end. Clang's parser is hand-written recursive descent, and it uses precedence climbing for efficient parsing of expressions. If you're interested to see the code, it's <tt class="docutils literal"><span class="pre">Parser::ParseExpression</span></tt> in <tt class="docutils literal">lib/Parse/ParseExpr.cpp</tt> <a class="footnote-reference" href="#id6" id="id3">[3]</a>. This method plays the role of <tt class="docutils literal">compute_expr</tt>. The role of <tt class="docutils literal">compute_atom</tt> is played by <tt class="docutils literal"><span class="pre">Parser::ParseCastExpression</span></tt>.</p>
</div>
<div class="section" id="other-resources">
<h3>Other resources</h3>
<p>Here are some resources I found useful while writing this article:</p>
<ul class="simple">
<li>The Wikipedia page for <a class="reference external" href="http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-precedence parsing</a>.</li>
<li>The <a class="reference external" href="http://antlr.org/papers/Clarke-expr-parsing-1986.pdf">article by Keith Clarke</a> (PDF), one of the early inventors of the technique.</li>
<li><a class="reference external" href="http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm">This page</a> by Theodore Norvell, about parsing expressions by recursive descent.</li>
<li>The Clang source code (exact locations given in the previous section).</li>
</ul>
<p>
<i><b>Update (2016-11-02):</b> Andy Chu <a href="http://www.oilshell.org/blog/2016/11/01.html">notes</a>
that precedence climbing and <a href="https://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing">TDOP</a>
are pretty much the same algorithm, formulated a bit differently. I tend to agree,
and also note that <a href="https://eli.thegreenplace.net/2009/03/20/a-recursive-descent-parser-with-an-infix-expression-evaluator">Shunting Yard</a>
is again the same algorithm, except that the explicit recursion is replaced by
a stack.</i>
</p>
<img class="align-center" src="https://eli.thegreenplace.net/images/hline.jpg" style="width: 320px; height: 5px;" />
<table class="docutils footnote" frame="void" id="id4" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>There are a couple of simplifications made here on purpose. First, I assume only numeric expressions. Identifiers that represent variables can also be viewed as atoms. Second, I ignore unary operators. These are quite easy to incorporate into the algorithm by also treating them as atoms. I leave them out for succinctness.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id5" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>In this article I present a parser that computes the result of a numeric expression on-the-fly. Modifying it for accumulating the result into some kind of a parse tree is trivial.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id6" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>Clang's source code is constantly in flow. This information is correct at least for the date the article was written.</td></tr>
</tbody>
</table>
</div>
How Clang handles the type / variable name ambiguity of C/C++2012-07-05T19:35:22-07:002023-02-04T15:35:51-08:00Eli Benderskytag:eli.thegreenplace.net,2012-07-05:/2012/07/05/how-clang-handles-the-type-variable-name-ambiguity-of-cc
<p>My previous articles on the context sensitivity and ambiguity of the C/C++ grammar (<a class="reference external" href="https://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar/">one</a>, <a class="reference external" href="https://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-cs-grammar-revisited/">two</a>, <a class="reference external" href="https://eli.thegreenplace.net/2012/06/28/the-type-variable-name-ambiguity-in-c/">three</a>) can probably make me sound pessimistic about the prospect of correctly parsing C/C++, which couldn't be farther from the truth. My gripe is not with the grammar itself (although I admit it's …</p>
<p>My previous articles on the context sensitivity and ambiguity of the C/C++ grammar (<a class="reference external" href="https://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar/">one</a>, <a class="reference external" href="https://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-cs-grammar-revisited/">two</a>, <a class="reference external" href="https://eli.thegreenplace.net/2012/06/28/the-type-variable-name-ambiguity-in-c/">three</a>) can probably make me sound pessimistic about the prospect of correctly parsing C/C++, which couldn't be farther from the truth. My gripe is not with the grammar itself (although I admit it's needlessly complex), it's with the inability of Yacc-generated LALR(1) parsers to parse it without considerable hacks. As I've mentioned numerous times before, industrial-strength compilers for C/C++ exist after all, so they do manage to somehow parse these languages.</p>
<p>One of the newest, and in my eyes the most exciting of C/C++ compilers is <a class="reference external" href="http://clang.llvm.org/">Clang</a>. Originally developed by Apple as a front-end to LLVM, it's been a vibrant open-source project for the past couple of years with participation from many companies and individuals (although Apple remains the main driving force in the community). Clang, similarly to LLVM, features a modular library-based design and a very clean C++ code-base. Clang's parser is hand-written, based on a standard recursive-descent parsing algorithm.</p>
<p>In this post I want to explain how Clang manages to overcome the ambiguities I mentioned in the previous articles.</p>
<div class="section" id="no-lexer-hack">
<h3>No lexer hack</h3>
<p>There is no "lexer hack" in Clang. Information flows in a single direction - from the lexer to the parser, not back. How is this managed?</p>
<p>The thing is that the Clang lexer doesn't distinguish between user-defined types and other identifiers. All are marked with the <tt class="docutils literal">identifier</tt> token.</p>
<p>For this code:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="color: #00007f; font-weight: bold">typedef</span> <span style="color: #00007f; font-weight: bold">int</span> mytype;
mytype bb;
</pre></div>
<p>The Clang parser encounters the following tokens (<tt class="docutils literal"><span class="pre">-dump-tokens</span></tt>):</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%">typedef 'typedef' [StartOfLine] Loc=<z.c:1:1>
int 'int' [LeadingSpace] Loc=<z.c:1:9>
identifier 'mytype' [LeadingSpace] Loc=<z.c:1:13>
semi ';' Loc=<z.c:1:19>
identifier 'mytype' [StartOfLine] Loc=<z.c:2:1>
identifier 'bb' [LeadingSpace] Loc=<z.c:2:8>
semi ';' Loc=<z.c:2:10>
eof '' Loc=<z.c:4:1>
</pre></div>
<p>Note how <tt class="docutils literal">mytype</tt> is always reported as an identifier, both before and after Clang figures out it's actually a user-defined type.</p>
</div>
<div class="section" id="figuring-out-what-s-a-type">
<h3>Figuring out what's a type</h3>
<p>So if the Clang lexer always reports <tt class="docutils literal">mytype</tt> as an identifier, how does the parser figure out when it is actually a type? By keeping a symbol table.</p>
<p>Well, actually it's not the parser that keeps the symbol table, it's <tt class="docutils literal">Sema</tt>. <tt class="docutils literal">Sema</tt> is the Clang module responsible for semantic analysis and AST construction. It gets invoked from the parser through a generic "actions" interface, which in theory could serve a different client. Although conceptually the parser and <tt class="docutils literal">Sema</tt> are coupled, the actions interface provides a clean separation in the code. The parser is responsible for driving the parsing process, and <tt class="docutils literal">Sema</tt> is responsible for handling semantic information. In this particular case, the symbol table <em>is</em> semantic information, so it's handled by <tt class="docutils literal">Sema</tt>.</p>
<p>To follow this process through, we'll start in <tt class="docutils literal"><span class="pre">Parser::ParseDeclarationSpecifiers</span></tt> <a class="footnote-reference" href="#id5" id="id1">[1]</a>. In the C/C++ grammar, type names are part of the "specifiers" in a declaration (that also include things like <tt class="docutils literal">extern</tt> or <tt class="docutils literal">inline</tt>), and following the "recursive-descent protocol", Clang will usually feature a parsing method per grammar rule. When this method encounters an identifier (<tt class="docutils literal"><span class="pre">tok::identifier</span></tt>), it asks <tt class="docutils literal">Sema</tt> whether it's actually a type by calling <tt class="docutils literal">Actions.getTypeName</tt> <a class="footnote-reference" href="#id6" id="id2">[2]</a>.</p>
<p><tt class="docutils literal"><span class="pre">Sema::getTypeName</span></tt> calls <tt class="docutils literal"><span class="pre">Sema::LookupName</span></tt> to do the actual name lookup. For C, name lookup rules are relatively simple - you just climb the lexical scope stack the code belongs to, trying to find a scope that defines the name as a type. I've <a class="reference external" href="https://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-cs-grammar-revisited/">mentioned before</a> that all names in C (including type names) obey lexical scoping rules. With this mechanism, Clang implements the required nested symbol table. Note that this symbol table is queried by Clang in places where a type is actually expected and allowed, not only in declarations. For example, it's also done to disambiguate function calls from casts in some cases.</p>
<p>How does a type actually get into this table, though?</p>
<p>When the parser is done parsing a <tt class="docutils literal">typedef</tt> (and any declarator, for that matter), it calls <tt class="docutils literal"><span class="pre">Sema::ActOnDeclarator</span></tt>. When the latter notices a new <tt class="docutils literal">typedef</tt> and makes sure everything about it is kosher (e.g. it does not re-define a name in the same scope), it adds the new name to the symbol table at the current scope.</p>
<p>In Clang's code this whole process looks very clean and intuitive, but in a generated LALR(1) parser it would be utterly impossible, because leaving out the special token for type names and merging it with <tt class="docutils literal">identifier</tt> would create a tons of unresolvable reduce-reduce conflicts in the grammar. This is why Yacc-based parsers require a lexer hack to handle this issue.</p>
</div>
<div class="section" id="class-wide-declarations-in-c">
<h3>Class-wide declarations in C++</h3>
<p>In the <a class="reference external" href="https://eli.thegreenplace.net/2012/06/28/the-type-variable-name-ambiguity-in-c/">previous post</a> I mentioned how C++ makes this type lookup problem much more difficult by forcing declarations inside a class to be visible throughout the class, even in code that appears before them. Here's a short reminder:</p>
<div class="highlight" style="background: #ffffff"><pre style="line-height: 125%"><span style="color: #00007f; font-weight: bold">int</span> aa(<span style="color: #00007f; font-weight: bold">int</span> arg) {
<span style="color: #00007f; font-weight: bold">return</span> arg;
}
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">C</span> {
<span style="color: #00007f; font-weight: bold">int</span> foo(<span style="color: #00007f; font-weight: bold">int</span> bb) {
<span style="color: #00007f; font-weight: bold">return</span> (aa)(bb);
}
<span style="color: #00007f; font-weight: bold">typedef</span> <span style="color: #00007f; font-weight: bold">int</span> aa;
};
</pre></div>
<p>In this code, even though the <tt class="docutils literal">typedef</tt> appears after <tt class="docutils literal">foo</tt>, the parser must figure out that <tt class="docutils literal"><span class="pre">(aa)(bb)</span></tt> is a cast of <tt class="docutils literal">bb</tt> to type <tt class="docutils literal">aa</tt>, and not the function call <tt class="docutils literal">aa(bb)</tt>.</p>
<p>We've seen how Clang can manage to figure out that <tt class="docutils literal">aa</tt> is a type. However, when it parses <tt class="docutils literal">foo</tt> it hasn't even <em>seen</em> the <tt class="docutils literal">typedef</tt> yet, so how does that work?</p>
</div>
<div class="section" id="delayed-parsing-of-inline-method-bodies">
<h3>Delayed parsing of inline method bodies</h3>
<p>To solve the problem described above, Clang employs a clever technique. When parsing an inline member function declaration/definition, it does full parsing and semantic analysis of the <em>declaration</em>, leaving the <em>definition</em> for later.</p>
<p>Specifically, the body of an inline method definition is <em>lexed</em> and the tokens are kept in a special buffer for later (this is done by <tt class="docutils literal"><span class="pre">Parser::ParseCXXInlineMethodDef</span></tt>). Once the parser has finished parsing the class, it calls <tt class="docutils literal"><span class="pre">Parser::ParseLexedMethodDefs</span></tt> that does the actual parsing and semantic analysis of the saved method bodies. At this point, all the types declared inside the class are available, so the parser can correctly disambiguate wherever required.</p>
</div>
<div class="section" id="annotation-tokens">
<h3>Annotation tokens</h3>
<p>Although the above is enough to understand how Clang approaches the problem, I want to mention another trick it uses to make parsing more efficient in some cases.</p>
<p>The <tt class="docutils literal"><span class="pre">Sema::getTypeName</span></tt> method mentioned earlier can be costly. It performs a lookup in a set of nested scopes, which may be expensive if the scopes are deeply nested and a name is <em>not</em> actually a type (which is probably most often the case). It's alright (and inevitable!) to do this lookup once, but Clang would like to avoid repeating it for the same token when it <em>backtracks</em> trying to parse a statement in a different way.</p>
<p>A word on what "backtracks" means in this context. <a class="reference external" href="https://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and-predictive-parsers/">Recursive descent parsers</a> are naturally (by their very structure) backtracking. That is, they may try a number of different ways to parse a single grammatical production (be that a statement, an expression, a declaration, or whatever), before finding an approach that succeeds. In this process, the same token may need to be queried more than once.</p>
<p>To avoid this, Clang has special "annotation tokens" it inserts into the token stream. The mechanism is used for other things as well, but in our case we're interested in the <tt class="docutils literal"><span class="pre">tok::annot_typename</span></tt> token. What happens is that the first time the parser encounters a <tt class="docutils literal"><span class="pre">tok::identifier</span></tt> and figures out it's a type, this token gets replaced by <tt class="docutils literal"><span class="pre">tok::annot_typename</span></tt>. The next time the parser encounters this token, it won't have to lookup whether it's a type once again, because it's no longer a generic <tt class="docutils literal"><span class="pre">tok::identifier</span></tt> <a class="footnote-reference" href="#id7" id="id3">[3]</a>.</p>
</div>
<div class="section" id="disclaimer-and-conclusion">
<h3>Disclaimer and conclusion</h3>
<p>It's important to keep in mind that the cases examined in this post do not represent the full complexity of the C++ grammar. In C++, constructs like qualified names (<tt class="docutils literal"><span class="pre">foo::bar::baz</span></tt>) and templates complicate matters considerably. However, I just wanted to focus on the cases I specifically discussed in previous posts, explaining how Clang addresses them.</p>
<p>To conclude, we've seen how Clang's recursive descent parser manages some of the ambiguities of the C/C++ grammar. For a task that complex, it's inevitable for the code to become non-trivial <a class="footnote-reference" href="#id8" id="id4">[4]</a>. That said, Clang has actually managed to keep its code-base relatively clean and logically structured, while at the same time sticking to its aggressive performance goals. Someone with a general understanding of how front-ends work shouldn't require more than a few hours of immersion in the Clang code-base to be able to answer questions about "how does it do <em>that</em>".</p>
<img class="align-center" src="https://eli.thegreenplace.net/images/hline.jpg" style="width: 320px; height: 5px;" />
<table class="docutils footnote" frame="void" id="id5" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>As a rule, all <tt class="docutils literal">Parser</tt> code lives in <tt class="docutils literal">lib/Parse</tt> in the Clang source tree. <tt class="docutils literal">Sema</tt> code lives in <tt class="docutils literal">lib/Sema</tt>.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id6" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Here and later I'll skip a lot of details and variations, focusing only on the path I want to use in the example.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id7" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>It's very important to note that only <em>this instance</em> of the token in the token stream is replaced. The next instance may have already become a type (or we may have even changed the scope), so it wouldn't be semantically correct to reason about it.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id8" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td>That Clang parses Objective-C and various extensions like CUDA or OpenCL in the same code-base doesn't help in this respect.</td></tr>
</tbody>
</table>
</div>
Top-Down operator precedence (Pratt) parsing2010-01-02T17:08:12-08:002023-07-08T13:12:59-07:00Eli Benderskytag:eli.thegreenplace.net,2010-01-02:/2010/01/02/top-down-operator-precedence-parsing
<div class="section" id="introduction">
<h3>Introduction</h3>
<p>Recursive-descent parsers have always interested me, and in the past year and a half I wrote a few articles on the topic. Here they are in chronological order:</p>
<ul class="simple">
<li><a class="reference external" href="https://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and-predictive-parsers/">Recursive descent, LL and predictive parsers</a></li>
<li><a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/">Some problems of recursive descent parsers</a></li>
<li><a class="reference external" href="https://eli.thegreenplace.net/2009/03/20/a-recursive-descent-parser-with-an-infix-expression-evaluator/">A recursive descent parser with an infix expression evaluator …</a></li></ul></div>
<div class="section" id="introduction">
<h3>Introduction</h3>
<p>Recursive-descent parsers have always interested me, and in the past year and a half I wrote a few articles on the topic. Here they are in chronological order:</p>
<ul class="simple">
<li><a class="reference external" href="https://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and-predictive-parsers/">Recursive descent, LL and predictive parsers</a></li>
<li><a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/">Some problems of recursive descent parsers</a></li>
<li><a class="reference external" href="https://eli.thegreenplace.net/2009/03/20/a-recursive-descent-parser-with-an-infix-expression-evaluator/">A recursive descent parser with an infix expression evaluator</a></li>
</ul>
<p>The third article describes a method that combines RD parsing with a different algorithm for parsing expressions to achieve better results. This method is actually used in the real-world, for example in GCC and Parrot (<a class="reference external" href="http://en.wikipedia.org/wiki/Operator-precedence_parser">source</a>).</p>
<p>An alternative parsing algorithm was discovered by <a class="reference external" href="http://en.wikipedia.org/wiki/Vaughan_Pratt">Vaughan Pratt</a> in 1973. Called <em>Top Down Operator Precedence</em>, it shares some features with the modified RD parser, but promises to simplify the code, as well as provide better performance. Recently it was popularized again by Douglas Crockford in <a class="reference external" href="http://javascript.crockford.com/tdop/tdop.html">his article</a>, and employed by him in <a class="reference external" href="http://www.jslint.com/">JSLint</a> to parse Javascript.</p>
<p>I encountered Crockford's article in the <a class="reference external" href="https://eli.thegreenplace.net/2007/09/28/book-review-beautiful-code-edited-by-andy-oram-greg-wilson/">Beautiful Code</a> book, but found it hard to understand. I could follow the code, but had a hard time grasping <em>why</em> the thing works. Recently I became interested in the topic again, tried to read the article once more, and again was stumped. Finally, by reading Pratt's original paper and Fredrik Lundh's excellent <a class="reference external" href="http://effbot.org/zone/simple-top-down-parsing.htm">Python-based piece</a> <a class="footnote-reference" href="#id7" id="id1">[1]</a>, I understood the algorithm.</p>
<p>So this article is my usual attempt to explain the topic to myself, making sure that when I forget how it works in a couple of months, I will have a simple way of remembering.</p>
</div>
<div class="section" id="the-fundamentals">
<h3>The fundamentals</h3>
<p>Top down operator precedence parsing (TDOP from now on) is based on a few fundamental principles:</p>
<ul class="simple">
<li>A "binding power" mechanism to handle precedence levels</li>
<li>A means of implementing different functionality of tokens depending on their position relative to their neighbors - prefix or infix.</li>
<li>As opposed to classic RD, where semantic actions are associated with grammar rules (BNF), TDOP associates them with tokens.</li>
</ul>
<div class="section" id="binding-power">
<h4>Binding power</h4>
<p>Operator precedence and associativity is a fundamental topic to be handled by parsing techniques. TDOP handles this issue by assigning a "binding power" to each token it parses.</p>
<p>Consider a substring AEB where A takes a right argument, B a left, and E is an expression. Does E associate with A or with B? We define a numeric <strong>binding power</strong> for each operator. <strong>The operator with the higher binding power "wins" - gets E associated with it</strong>. Let's examine the expression:</p>
<div class="highlight"><pre>1 + 2 * 4
</pre></div>
<p>Here it is once again with A, E, B identified:</p>
<div class="highlight"><pre>1 + 2 * 4
^ ^ ^
A E B
</pre></div>
<p>If we want to express the convention of multiplication having a higher precedence than addition, let's define the binding power (<tt class="docutils literal"><span class="pre">bp</span></tt>) of * to be 20 and that of + to be 10 (the numbers are arbitrary, what's important is that <tt class="docutils literal"><span class="pre">bp(*)</span> <span class="pre">></span> <span class="pre">bp(+)</span></tt>). Thus, by the definition we've made above, the 2 will be associated with <tt class="docutils literal"><span class="pre">*</span></tt>, since its binding power is higher than that of <tt class="docutils literal"><span class="pre">+</span></tt>.</p>
</div>
<div class="section" id="prefix-and-infix-operators">
<h4>Prefix and infix operators</h4>
<p>To parse the traditional <a class="reference external" href="http://en.wikipedia.org/wiki/Infix_notation">infix-notation</a> expression languages <a class="footnote-reference" href="#id8" id="id2">[2]</a>, we have to differentiate between the prefix form and infix form of tokens. The best example is the minus operator (<tt class="docutils literal"><span class="pre">-</span></tt>). In its infix form it is subtraction:</p>
<div class="highlight"><pre>a = b - c <span style="color: #007f00"># a is b minus c</span>
</pre></div>
<p>In its prefix form, it is negation:</p>
<div class="highlight"><pre>a = -b <span style="color: #007f00"># b has a's magnitude but an opposite sign</span>
</pre></div>
<p>To accommodate this difference, TDOP allows for different treatment of tokens in prefix and infix contexts. In TDOP terminology the handler of a token as prefix is called <strong>nud</strong> (for "null denotation") and the handler of a token as infix is called <strong>led</strong> (for "left denotation").</p>
</div>
</div>
<div class="section" id="the-tdop-algorithm">
<h3>The TDOP algorithm</h3>
<p>Here's a basic TDOP parser:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">expression</span>(rbp=<span style="color: #007f7f">0</span>):
<span style="color: #00007f; font-weight: bold">global</span> token
t = token
token = next()
left = t.nud()
<span style="color: #00007f; font-weight: bold">while</span> rbp < token.lbp:
t = token
token = next()
left = t.led(left)
<span style="color: #00007f; font-weight: bold">return</span> left
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">literal_token</span>(<span style="color: #00007f">object</span>):
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">__init__</span>(<span style="color: #00007f">self</span>, value):
<span style="color: #00007f">self</span>.value = <span style="color: #00007f">int</span>(value)
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">nud</span>(<span style="color: #00007f">self</span>):
<span style="color: #00007f; font-weight: bold">return</span> <span style="color: #00007f">self</span>.value
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_add_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">10</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
right = expression(<span style="color: #007f7f">10</span>)
<span style="color: #00007f; font-weight: bold">return</span> left + right
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_mul_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">20</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
<span style="color: #00007f; font-weight: bold">return</span> left * expression(<span style="color: #007f7f">20</span>)
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">end_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">0</span>
</pre></div>
<p>We only have to augment it with some support code consisting of a simple tokenizer <a class="footnote-reference" href="#id9" id="id3">[3]</a> and the parser driver:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">import</span> <span style="color: #00007f">re</span>
token_pat = re.compile(<span style="color: #7f007f">"\s*(?:(\d+)|(.))"</span>)
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">tokenize</span>(program):
<span style="color: #00007f; font-weight: bold">for</span> number, operator <span style="color: #0000aa">in</span> token_pat.findall(program):
<span style="color: #00007f; font-weight: bold">if</span> number:
<span style="color: #00007f; font-weight: bold">yield</span> literal_token(number)
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">"+"</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_add_token()
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">"*"</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_mul_token()
<span style="color: #00007f; font-weight: bold">else</span>:
<span style="color: #00007f; font-weight: bold">raise</span> SyntaxError(<span style="color: #7f007f">'unknown operator: %s'</span>, operator)
<span style="color: #00007f; font-weight: bold">yield</span> end_token()
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">parse</span>(program):
<span style="color: #00007f; font-weight: bold">global</span> token, next
next = tokenize(program).next
token = next()
<span style="color: #00007f; font-weight: bold">return</span> expression()
</pre></div>
<p>And we have a complete parser and evaluator for arithmetic expressions involving addition and multiplication.</p>
<p>Now let's figure out how it actually works. Note that the token classes have several attributes (not all classes have all kinds of attributes):</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">lbp</span></tt> - the left binding power of the operator. For an infix operator, it tells us how strongly the operator binds to the argument at its left.</li>
<li><tt class="docutils literal"><span class="pre">nud</span></tt> - this is the prefix handler we talked about. In this simple parser it exists only for the literals (the numbers)</li>
<li><tt class="docutils literal"><span class="pre">led</span></tt> - the infix handler.</li>
</ul>
<p>The key to enlightenment here is to notice how the <tt class="docutils literal"><span class="pre">expression</span></tt> function works, and how the operator handlers call it, passing in a binding power.</p>
<p>When <tt class="docutils literal"><span class="pre">expression</span></tt> is called, it is provided the <tt class="docutils literal"><span class="pre">rbp</span></tt> - right binding power of the operator that called it. It consumes tokens until it meets a token whose left binding power is equal or lower than <tt class="docutils literal"><span class="pre">rbp</span></tt>. Specifically, it means that it collects all tokens that bind together before returning to the operator that called it.</p>
<p>Handlers of operators call <tt class="docutils literal"><span class="pre">expression</span></tt> to process their arguments, providing it with their binding power to make sure it gets just the right tokens from the input.</p>
<p>Let's see, for example, how this parser handles the expression:</p>
<div class="highlight"><pre>3 + 1 * 2 * 4 + 5
</pre></div>
<p>Here's the call trace of the parser's functions when parsing this expression:</p>
<div class="highlight"><pre><<expression with rbp 0>>
<<literal nud = 3>>
<<led of "+">>
<<expression with rbp 10>>
<<literal nud = 1>>
<<led of "*">>
<<expression with rbp 20>>
<<literal nud = 2>>
<<led of "*">>
<<expression with rbp 20>>
<<literal nud = 4>>
<<led of "+">>
<<expression with rbp 10>>
<<literal nud = 5>>
</pre></div>
<p>The following diagram shows the calls made to <tt class="docutils literal"><span class="pre">expression</span></tt> on various recursion levels:</p>
<img src="https://eli.thegreenplace.net/images/2010/01/tdop_expr1.png" />
<p>The arrows show the tokens on which each execution of <tt class="docutils literal"><span class="pre">expression</span></tt> works, and the number above them is the <tt class="docutils literal"><span class="pre">rbp</span></tt> given to <tt class="docutils literal"><span class="pre">expression</span></tt> for this call.</p>
<p>Apart from the initial call (with <tt class="docutils literal"><span class="pre">rbp=0</span></tt>) which spans the whole input, <tt class="docutils literal"><span class="pre">expression</span></tt> is called after each operator (by its <tt class="docutils literal"><span class="pre">led</span></tt> handler) to collect the right-side argument. As the diagram clearly shows, the binding power mechanism makes sure <tt class="docutils literal"><span class="pre">expression</span></tt> doesn't go "too far" - only as far as the precedence of the invoking operator allows. The best place to see it is the long arrow after the first <tt class="docutils literal"><span class="pre">+</span></tt>, that collects all the multiplication terms (they must be grouped together due to the higher precedence of <tt class="docutils literal"><span class="pre">*</span></tt>) and returns before the second <tt class="docutils literal"><span class="pre">+</span></tt> is encountered (when the precedence ceases being higher than its <tt class="docutils literal"><span class="pre">rbp</span></tt>).</p>
<p>Another way to look at it: at any point in the execution of the parser, there's an instance of <tt class="docutils literal"><span class="pre">expression</span></tt> for each precedence level that is active at that moment. This instance awaits the results of the higher-precedence instance and keeps going, until it has to stop itself and return its result to its caller.</p>
<p>If you understand this example, you understand TDOP parsing. All the rest is really just more of the same.</p>
<div class="section" id="enhancing-the-parser">
<h4>Enhancing the parser</h4>
<p>The parser presented so far is very rudimentary, so let's enhance it to be more realistic. First of all, what about unary operators?</p>
<p>As I've mentioned in the section on prefix and infix operators, TDOP makes an explicit distinction between the two, encoding it in the difference between the <tt class="docutils literal"><span class="pre">nud</span></tt> and <tt class="docutils literal"><span class="pre">led</span></tt> methods. Adding the subtraction operator handler <a class="footnote-reference" href="#id10" id="id4">[4]</a>:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_sub_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">10</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">nud</span>(<span style="color: #00007f">self</span>):
<span style="color: #00007f; font-weight: bold">return</span> -expression(<span style="color: #007f7f">100</span>)
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
<span style="color: #00007f; font-weight: bold">return</span> left - expression(<span style="color: #007f7f">10</span>)
</pre></div>
<p><tt class="docutils literal"><span class="pre">nud</span></tt> handles the unary (prefix) form of minus. It has no <tt class="docutils literal"><span class="pre">left</span></tt> argument (since it's prefix), and it negates its right argument. The binding power passed into <tt class="docutils literal"><span class="pre">expression</span></tt> is high, since infix minus has a high precedence (higher than multiplication). <tt class="docutils literal"><span class="pre">led</span></tt> handles the infix case similarly to the handlers of <tt class="docutils literal"><span class="pre">+</span></tt> and <tt class="docutils literal"><span class="pre">*</span></tt>.</p>
<p>Now we can handle expressions like:</p>
<div class="highlight"><pre>3 - 2 + 4 * -5
</pre></div>
<p>And get a correct result (-19).</p>
<p>How about right-associative operators? Let's implement exponentiation (using the caret sign <tt class="docutils literal"><span class="pre">^</span></tt>). To make the operation right-associative, we want the parser to treat subsequent exponentiation operators as sub-expressions of the first one. We can do that by calling <tt class="docutils literal"><span class="pre">expression</span></tt> in the handler of exponentiation with a <tt class="docutils literal"><span class="pre">rbp</span></tt> lower than the <tt class="docutils literal"><span class="pre">lbp</span></tt> of exponentiation:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_pow_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">30</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
<span style="color: #00007f; font-weight: bold">return</span> left ** expression(<span style="color: #007f7f">30</span> - <span style="color: #007f7f">1</span>)
</pre></div>
<p>When <tt class="docutils literal"><span class="pre">expression</span></tt> gets to the next <tt class="docutils literal"><span class="pre">^</span></tt> in its loop, it will find that still <tt class="docutils literal"><span class="pre">rbp</span> <span class="pre"><</span> <span class="pre">token.lbp</span></tt> and won't return the result right away, but will collect the value of the sub-expression first.</p>
<p>And how about grouping with parentheses? Since each token can execute actions in TDOP, this can be easily handled by adding actions to the <tt class="docutils literal"><span class="pre">(</span></tt> token.</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_lparen_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">0</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">nud</span>(<span style="color: #00007f">self</span>):
expr = expression()
match(operator_rparen_token)
<span style="color: #00007f; font-weight: bold">return</span> expr
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_rparen_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">0</span>
</pre></div>
<p>Where <tt class="docutils literal"><span class="pre">match</span></tt> is the usual RD primitive:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">match</span>(tok=<span style="color: #00007f">None</span>):
<span style="color: #00007f; font-weight: bold">global</span> token
<span style="color: #00007f; font-weight: bold">if</span> tok <span style="color: #0000aa">and</span> tok != <span style="color: #00007f">type</span>(token):
<span style="color: #00007f; font-weight: bold">raise</span> SyntaxError(<span style="color: #7f007f">'Expected %s'</span> % tok)
token = next()
</pre></div>
<p>Note that <tt class="docutils literal"><span class="pre">(</span></tt> has <tt class="docutils literal"><span class="pre">lbp=0</span></tt>, meaning that it doesn't bind to any token on its left. It is treated as a prefix, and its <tt class="docutils literal"><span class="pre">nud</span></tt> collects an expression after the <tt class="docutils literal"><span class="pre">(</span></tt>, right until <tt class="docutils literal"><span class="pre">)</span></tt> is encountered (which stops the expression parser since it also has <tt class="docutils literal"><span class="pre">lbp=0</span></tt>). Then it consumes the <tt class="docutils literal"><span class="pre">)</span></tt> itself and returns the result of the expression <a class="footnote-reference" href="#id11" id="id5">[5]</a>.</p>
<p>Here's the code for the complete parser, handling addition, subtraction, multiplication, division, exponentiation and grouping by parentheses:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">import</span> <span style="color: #00007f">re</span>
token_pat = re.compile(<span style="color: #7f007f">"\s*(?:(\d+)|(.))"</span>)
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">tokenize</span>(program):
<span style="color: #00007f; font-weight: bold">for</span> number, operator <span style="color: #0000aa">in</span> token_pat.findall(program):
<span style="color: #00007f; font-weight: bold">if</span> number:
<span style="color: #00007f; font-weight: bold">yield</span> literal_token(number)
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">"+"</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_add_token()
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">"-"</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_sub_token()
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">"*"</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_mul_token()
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">"/"</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_div_token()
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">"^"</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_pow_token()
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">'('</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_lparen_token()
<span style="color: #00007f; font-weight: bold">elif</span> operator == <span style="color: #7f007f">')'</span>:
<span style="color: #00007f; font-weight: bold">yield</span> operator_rparen_token()
<span style="color: #00007f; font-weight: bold">else</span>:
<span style="color: #00007f; font-weight: bold">raise</span> SyntaxError(<span style="color: #7f007f">'unknown operator: %s'</span>, operator)
<span style="color: #00007f; font-weight: bold">yield</span> end_token()
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">match</span>(tok=<span style="color: #00007f">None</span>):
<span style="color: #00007f; font-weight: bold">global</span> token
<span style="color: #00007f; font-weight: bold">if</span> tok <span style="color: #0000aa">and</span> tok != <span style="color: #00007f">type</span>(token):
<span style="color: #00007f; font-weight: bold">raise</span> SyntaxError(<span style="color: #7f007f">'Expected %s'</span> % tok)
token = next()
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">parse</span>(program):
<span style="color: #00007f; font-weight: bold">global</span> token, next
next = tokenize(program).next
token = next()
<span style="color: #00007f; font-weight: bold">return</span> expression()
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">expression</span>(rbp=<span style="color: #007f7f">0</span>):
<span style="color: #00007f; font-weight: bold">global</span> token
t = token
token = next()
left = t.nud()
<span style="color: #00007f; font-weight: bold">while</span> rbp < token.lbp:
t = token
token = next()
left = t.led(left)
<span style="color: #00007f; font-weight: bold">return</span> left
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">literal_token</span>(<span style="color: #00007f">object</span>):
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">__init__</span>(<span style="color: #00007f">self</span>, value):
<span style="color: #00007f">self</span>.value = <span style="color: #00007f">int</span>(value)
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">nud</span>(<span style="color: #00007f">self</span>):
<span style="color: #00007f; font-weight: bold">return</span> <span style="color: #00007f">self</span>.value
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_add_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">10</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">nud</span>(<span style="color: #00007f">self</span>):
<span style="color: #00007f; font-weight: bold">return</span> expression(<span style="color: #007f7f">100</span>)
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
right = expression(<span style="color: #007f7f">10</span>)
<span style="color: #00007f; font-weight: bold">return</span> left + right
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_sub_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">10</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">nud</span>(<span style="color: #00007f">self</span>):
<span style="color: #00007f; font-weight: bold">return</span> -expression(<span style="color: #007f7f">100</span>)
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
<span style="color: #00007f; font-weight: bold">return</span> left - expression(<span style="color: #007f7f">10</span>)
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_mul_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">20</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
<span style="color: #00007f; font-weight: bold">return</span> left * expression(<span style="color: #007f7f">20</span>)
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_div_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">20</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
<span style="color: #00007f; font-weight: bold">return</span> left / expression(<span style="color: #007f7f">20</span>)
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_pow_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">30</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">led</span>(<span style="color: #00007f">self</span>, left):
<span style="color: #00007f; font-weight: bold">return</span> left ** expression(<span style="color: #007f7f">30</span> - <span style="color: #007f7f">1</span>)
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_lparen_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">0</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">nud</span>(<span style="color: #00007f">self</span>):
expr = expression()
match(operator_rparen_token)
<span style="color: #00007f; font-weight: bold">return</span> expr
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">operator_rparen_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">0</span>
<span style="color: #00007f; font-weight: bold">class</span> <span style="color: #00007f">end_token</span>(<span style="color: #00007f">object</span>):
lbp = <span style="color: #007f7f">0</span>
</pre></div>
<p>Sample usage:</p>
<div class="highlight"><pre>>>> parse(<span style="color: #7f007f">'3 * (2 + -4) ^ 4'</span>)
<span style="color: #007f7f">48</span>
</pre></div>
</div>
</div>
<div class="section" id="closing-words">
<h3>Closing words</h3>
<p>When people consider parsing methods to implement, the debate usually goes between hand-coded RD parsers, auto-generated <tt class="docutils literal"><span class="pre">LL(k)</span></tt> parsers, or auto-generated <tt class="docutils literal"><span class="pre">LR</span></tt> parsers. TDOP is another alternative <a class="footnote-reference" href="#id12" id="id6">[6]</a>. It's an original and unusual parsing method that can handle complex grammars (not limited to expressions), relatively easy to code, and is quite fast.</p>
<p>What makes TDOP fast is that it doesn't need deep recursive descents to parse expressions - only a couple of calls per token are required, no matter how the grammar looks. If you trace the token actions in the example parser I presented in this article, you'll notice that on average, <tt class="docutils literal"><span class="pre">expression</span></tt> and one <tt class="docutils literal"><span class="pre">nud</span></tt> or <tt class="docutils literal"><span class="pre">led</span></tt> method are called per token, and that's about it. Fredrik Lundh compares the performance of TDOP with several other methods in his article, and gets very favorable results.</p>
<div align="center" class="align-center"><img class="align-center" src="https://eli.thegreenplace.net/images/hline.jpg" style="width: 320px; height: 5px;" /></div>
<table class="docutils footnote" frame="void" id="id7" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>Which is also the source for most of the code in this article - so the copyright is Fredrik Lundh's</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id8" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Like C, Java, Python. An example of a language that doesn't have infix notation is Lisp, which has prefix notation for expressions.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id9" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>This tokenizer just recognizes numbers and single-character operators.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id10" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td>Note that to allow our parser actually recognize <tt class="docutils literal"><span class="pre">-</span></tt>, an appropriate dispatcher should be added to the <tt class="docutils literal"><span class="pre">tokenize</span></tt> function - this is left as an exercise to the reader.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id11" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id5">[5]</a></td><td>Quiz: is it useful having a <tt class="docutils literal"><span class="pre">led</span></tt> handler for a left paren as well? Hint: how would you implement function calls?</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id12" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id6">[6]</a></td><td>By the way, I have no idea where to categorize it on the LL/LR scale? Any ideas?</td></tr>
</tbody>
</table>
</div>
A recursive descent parser with an infix expression evaluator2009-03-20T18:01:09-07:002023-06-30T23:16:27-07:00Eli Benderskytag:eli.thegreenplace.net,2009-03-20:/2009/03/20/a-recursive-descent-parser-with-an-infix-expression-evaluator
<p><a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/">Last week</a> I wrote about some of the inherent problems of recursive-descent parsers. An elegant solution to the operator associativity problem was shown, but another problem remained - and that is of the unwieldy handling of expressions, mainly performance-wise.</p>
<p>Here I want to present one alternative to the pure-RD approach, and …</p>
<p><a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/">Last week</a> I wrote about some of the inherent problems of recursive-descent parsers. An elegant solution to the operator associativity problem was shown, but another problem remained - and that is of the unwieldy handling of expressions, mainly performance-wise.</p>
<p>Here I want to present one alternative to the pure-RD approach, and that is intermixing RD with another parsing method.</p>
<div class="section" id="the-code">
<h3>The code</h3>
<p>I'll begin by pointing to <a class="reference external" href="https://github.com/eliben/code-for-blog/tree/master/2009/py_rd_parser_example">the code for this article</a>. It contains several Python files and a <tt class="docutils literal"><span class="pre">readme.txt</span></tt> explaining what is what. Throughout the article I'll present short snippets from the code, but it's encouraged to run it on your own. The code is self-contained and only requires Python (version 2.5) to run.</p>
</div>
<div class="section" id="extending-the-grammar">
<h3>Extending the grammar</h3>
<p>To illuminate some of the points I'm presenting better, I've greatly extended the EBNF grammar we'll be parsing. Here's the new grammar (taken from the top of the <tt class="docutils literal"><span class="pre">rd_parser_ebnf.py</span></tt> in the code .zip):</p>
<div class="highlight"><pre># EBNF:
#
# <stmt> : <assign_stmt>
# | <if_stmt>
# | <cmp_expr>
#
# <assign_stmt> : set <id> = <cmp_expr>
#
## Note 'else' binds to the innermost 'if', like in C
#
# <if_stmt> : if <cmp_expr> then <stmt> [else <stmt>]
#
# <cmp_expr> : <bitor_expr> [== <bitor_expr>]
# | <bitor_expr> [!= <bitor_expr>]
# | <bitor_expr> [> <bitor_expr>]
# | <bitor_expr> [< <bitor_expr>]
# | <bitor_expr> [>= <bitor_expr>]
# | <bitor_expr> [<= <bitor_expr>]
#
# <bitor_expr> | <bitxor_expr> {| <bitxor_expr>}
#
# <bitxor_expr> | <bitand_expr> {^ <bitand_expr>}
#
# <bitand_expr> | <shift_expr> {& <shift_expr>}
#
# <shift_expr> | <arith_expr> {<< <arith_expr>}
# : <arith_expr> {>> <arith_expr>}
#
# <arith_expr> : <term> {+ <term>}
# | <term> {- <term>}
#
# <term> : <power> {* <power>}
# | <power> {/ <power>}
#
# <power> : <power> ** <factor>
# | <factor>
#
# <factor> : <id>
# | <number>
# | - <factor>
# | ( <cmp_expr> )
#
# <id> : [a-zA-Z_]\w+
# <number> : \d+
</pre></div>
<p>As you can see, this simple calculator is starting to approach a real programming language, as it supports a plethora of mathematical and logical expressions, as well as conditional statements (<tt class="docutils literal"><span class="pre">if</span> <span class="pre">...</span> <span class="pre">then</span> <span class="pre">...</span> <span class="pre">else</span></tt>) and assignments. I've added a simplistic "prompt" so you can experiment with the calculator from the command line:</p>
<div class="highlight"><pre>D:\zzz\rd_parser_calc>rd_parser_ebnf.py -p
Welcome to the calculator. Press Ctrl+C to exit.
--> set x = 2 + 2 * 3
8
--> set y = (x - 1) * (x - 2)
42
--> if y > x then set y = x else set x = y
8
--> x
8
--> y
8
--> x ** ((y - 10) * -3)
262144
--> ... Thanks for using the calculator.
</pre></div>
<p>Note that since a separate expression "level" is required for each precedence, the resulting code is somewhat repetitive. I'll get back to this point later on.</p>
</div>
<div class="section" id="evaluating-infix-expressions">
<h3>Evaluating infix expressions</h3>
<p>An alternative method of evaluating expressions is required, then. Luckily, such a need arose early enough (in the 1950s and 60s, when first compilers and interpreters were constructed) and some luminaries examined this problem in detail. In particular, <a class="reference external" href="http://en.wikipedia.org/wiki/Edsger_Dijkstra">Edsger W. Dijkstra</a> proposed an efficient and intuitive algorithm for converting from <a class="reference external" href="http://en.wikipedia.org/wiki/Infix_notation">infix notation</a> to <a class="reference external" href="http://en.wikipedia.org/wiki/Reverse_Polish_notation">RPN</a>, called the <a class="reference external" href="http://en.wikipedia.org/wiki/Shunting_yard_algorithm">Shunting Yard algorithm</a>.</p>
<p>I will not describe the algorithm here, as it's been done several times already. If the Wikipedia article is not enough, <a class="reference external" href="http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm">here's another good source</a> (which I've actually used as the basis for my implementation).</p>
<p>The algorithm employs two stacks to resolve the precedence dilemmas of infix notation. One stack is for storing operators of relatively low precedence that await results from computations with higher precedence. The other stack keeps the result accumulated so far. The result can either be a RPN expression, an AST or just the computed result (a number) of the computation.</p>
<p>In my code, the file <tt class="docutils literal"><span class="pre">rd_parser_infix_exper.py</span></tt> implements a hybrid parser, using Shunting Yard to evaluate expressions and a top-level RD parser for statements and combining everything together. It's instructive to examine the implementation and see how things fit together.</p>
<p>The grammar this parser accepts is exactly the same as the pure RD EBNF parser presented eariler. The statements (<tt class="docutils literal"><span class="pre">assign_stmt</span></tt>, <tt class="docutils literal"><span class="pre">if_stmt</span></tt>, and <tt class="docutils literal"><span class="pre">stmt</span></tt>) are evaluated by traditional RD, but getting deeper into expressions is done with an "infix evaluator", the gateway to which is the <tt class="docutils literal"><span class="pre">_infix_eval</span></tt> method <a class="footnote-reference" href="#id4" id="id1">[1]</a>:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">_infix_eval</span>(<span style="color: #00007f">self</span>):
<span style="color: #7f007f">""" Run the infix evaluator and return the result.</span>
<span style="color: #7f007f"> """</span>
<span style="color: #00007f">self</span>.op_stack = []
<span style="color: #00007f">self</span>.res_stack = []
<span style="color: #00007f">self</span>.op_stack.append(<span style="color: #00007f">self</span>._sentinel)
<span style="color: #00007f">self</span>._infix_eval_expr()
<span style="color: #00007f; font-weight: bold">return</span> <span style="color: #00007f">self</span>.res_stack[-<span style="color: #007f7f">1</span>]
</pre></div>
<p>This method prepares the Shunting Yard stacks and begins evaluating the expression, terminating with returning its results.</p>
<p>Note that the connection to the RD parser is seamless. When <tt class="docutils literal"><span class="pre">_infix_eval</span></tt> is called, it assumes that the current token is the beginning of an expression (just like any RD rule), and consumes as much tokens as required to parse the full expression before returning the result.</p>
<p>The rest of the implementation (the <tt class="docutils literal"><span class="pre">_infix_eval_expr</span></tt>, <tt class="docutils literal"><span class="pre">_infix_eval_atom</span></tt>, <tt class="docutils literal"><span class="pre">_push_op</span></tt> and <tt class="docutils literal"><span class="pre">_pop_op</span></tt> methods) is pretty much a word by word translation of the algorithm described in <a class="reference external" href="http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm">this article</a> into Python.</p>
</div>
<div class="section" id="adding-expressions">
<h3>Adding expressions</h3>
<p>Here's a big advantage of this hybrid parser: adding new expressions and/or changing precedence levels is much simpler and requires far less code. In the pure RD parser, the operators and their precedences are determined by the structure of recursive calls between methods. Adding a new operator requires a new method, as well as modifying some of the other methods <a class="footnote-reference" href="#id5" id="id2">[2]</a>. Changing the precedence of some operator is also troublesome and requires moving around lots of code.</p>
<p>Not so in the infix expression parser. Once the Shunting Yard machinery is in place, all we have to do to add new operators or modify existing ones is update the <tt class="docutils literal"><span class="pre">_ops</span></tt> table:</p>
<div class="highlight"><pre>_ops = {
<span style="color: #7f007f">'u-'</span>: Op(<span style="color: #7f007f">'unary -'</span>, operator.neg, <span style="color: #007f7f">90</span>, unary=<span style="color: #00007f">True</span>),
<span style="color: #7f007f">'**'</span>: Op(<span style="color: #7f007f">'**'</span>, operator.pow, <span style="color: #007f7f">70</span>, right_assoc=<span style="color: #00007f">True</span>),
<span style="color: #7f007f">'*'</span>: Op(<span style="color: #7f007f">'*'</span>, operator.mul, <span style="color: #007f7f">50</span>),
<span style="color: #7f007f">'/'</span>: Op(<span style="color: #7f007f">'/'</span>, operator.div, <span style="color: #007f7f">50</span>),
<span style="color: #7f007f">'+'</span>: Op(<span style="color: #7f007f">'+'</span>, operator.add, <span style="color: #007f7f">40</span>),
<span style="color: #7f007f">'-'</span>: Op(<span style="color: #7f007f">'-'</span>, operator.sub, <span style="color: #007f7f">40</span>),
<span style="color: #7f007f">'<<'</span>: Op(<span style="color: #7f007f">'<<'</span>, operator.lshift, <span style="color: #007f7f">35</span>),
<span style="color: #7f007f">'>>'</span>: Op(<span style="color: #7f007f">'>>'</span>, operator.rshift, <span style="color: #007f7f">35</span>),
<span style="color: #7f007f">'&'</span>: Op(<span style="color: #7f007f">'&'</span>, operator.and_, <span style="color: #007f7f">30</span>),
<span style="color: #7f007f">'^'</span>: Op(<span style="color: #7f007f">'^'</span>, operator.xor, <span style="color: #007f7f">29</span>),
<span style="color: #7f007f">'|'</span>: Op(<span style="color: #7f007f">'|'</span>, operator.or_, <span style="color: #007f7f">28</span>),
<span style="color: #7f007f">'>'</span>: Op(<span style="color: #7f007f">'>'</span>, operator.gt, <span style="color: #007f7f">20</span>),
<span style="color: #7f007f">'>='</span>: Op(<span style="color: #7f007f">'>='</span>, operator.ge, <span style="color: #007f7f">20</span>),
<span style="color: #7f007f">'<'</span>: Op(<span style="color: #7f007f">'<'</span>, operator.lt, <span style="color: #007f7f">20</span>),
<span style="color: #7f007f">'<='</span>: Op(<span style="color: #7f007f">'<='</span>, operator.le, <span style="color: #007f7f">20</span>),
<span style="color: #7f007f">'=='</span>: Op(<span style="color: #7f007f">'=='</span>, operator.eq, <span style="color: #007f7f">15</span>),
<span style="color: #7f007f">'!='</span>: Op(<span style="color: #7f007f">'!='</span>, operator.ne, <span style="color: #007f7f">15</span>),
}
</pre></div>
<p>I also find this table much more descriptive in the sense of understanding how the operators relate to one another than the parallel 9 methods required to implement them in the pure RD version (<tt class="docutils literal"><span class="pre">rd_parser_ebnf.py</span></tt>).</p>
</div>
<div class="section" id="performance">
<h3>Performance</h3>
<p>Now here is the funny thing. My initial motivation for examining the infix expression hybrid was the allegedly poor performance of the RD parser for parsing expressions (as described in the <a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/">previous article</a>). But the performance hasn't improved! In fact, the new hybrid parser is a bit slower than the pure RD parser!</p>
<p>And the annoying thing is that it's entirely unclear to me how to optimize it, since profiling shows that the runtime divides rather evenly between the various methods of the algorithm. Yes, the pure RD parser requires the full precedence-chain of methods called for each single terminal, but the infix version has more method calls in total.</p>
<p>If anything, this has been a lesson in optimization, as profiling initially showed that the vast majority of the time is spent in the lexer <a class="footnote-reference" href="#id6" id="id3">[3]</a>. So I've managed to optimize my lexer (by precompiling all its regexes into a single large one using alternation), which greatly reduced the runtime.</p>
</div>
<div class="section" id="conclusion">
<h3>Conclusion</h3>
<p>This article has presented an alternative to the pure recursive-descent parser. The hybrid parser developed here combines RD with infix expression evaluation using the Shunting Yard algorithm.</p>
<p>We've seen that the new code is more manageable for operator-rich grammars. If even more operators are to be added to the parser (such as the full set of operators supported by C), it's much simpler to implement into the parser, and the operator table is a single place summarizing the operators, their associativities and precedences, making the parser more readable.</p>
<p>However, this has not made the parser any faster. The pure-RD implementation is lean enough to be efficient even when the grammar consists of many precedence levels. This is an important lesson in optimization - it's difficult to assess the relative runtimes of complex chunks of code in advance, without actually trying them out and profiling them.</p>
<div align="center" class="align-center"><img class="align-center" src="https://eli.thegreenplace.net/images/hline.jpg" style="width: 320px; height: 5px;" /></div>
<table class="docutils footnote" frame="void" id="id4" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>It would be a swell idea to read the description of the algorithm and have an intuitive understanding of it from this point and on in the article.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id5" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Suppose we had no multiplication and division and had to add the <tt class="docutils literal"><span class="pre">term</span></tt> rule. In addition to writing the code for the new rule, we must modify the <tt class="docutils literal"><span class="pre">arith_expr</span></tt> rule to now call <tt class="docutils literal"><span class="pre">term</span></tt> instead of <tt class="docutils literal"><span class="pre">power</span></tt>.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id6" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td>Which makes lots of sense, as it's well known that lexing/tokenization is usually the most time consuming stage of parsing. This is because the lexer has to examine every single character of the input, while the parser above it works on the level of whole tokens.</td></tr>
</tbody>
</table>
</div>
Some problems of recursive descent parsers2009-03-14T11:24:39-07:002023-06-30T23:16:27-07:00Eli Benderskytag:eli.thegreenplace.net,2009-03-14:/2009/03/14/some-problems-of-recursive-descent-parsers
<div class="section" id="reminder-recursive-descent-rd-parsers">
<h3>Reminder - recursive descent (RD) parsers</h3>
<p><a class="reference external" href="https://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and-predictive-parsers/">Here's an article</a> I wrote on the subject a few months ago. It provides a good introduction on how RD parsers are constructed and what grammars they can parse.</p>
<p>Here I want to focus on a couple of problems with the RD parser developed in …</p></div>
<div class="section" id="reminder-recursive-descent-rd-parsers">
<h3>Reminder - recursive descent (RD) parsers</h3>
<p><a class="reference external" href="https://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and-predictive-parsers/">Here's an article</a> I wrote on the subject a few months ago. It provides a good introduction on how RD parsers are constructed and what grammars they can parse.</p>
<p>Here I want to focus on a couple of problems with the RD parser developed in that article, and propose solutions.</p>
</div>
<div class="section" id="problem-1-operator-associativity">
<h3>Problem #1: operator associativity</h3>
<p>If you recall from the <a class="reference external" href="https://eli.thegreenplace.net/2008/09/26/recursive-descent-ll-and-predictive-parsers/">previous article</a>, the <tt class="docutils literal"><span class="pre">expr</span></tt> rule of the parser looks like this (BNF notation):</p>
<div class="highlight"><pre><expr> : <term> + <expr>
| <term> - <expr>
| <term>
</pre></div>
<p>It's built this way (<tt class="docutils literal"><span class="pre">expr</span></tt> on the right-hand side of the expression, <tt class="docutils literal"><span class="pre">term</span></tt> on the left-hand side), to avoid <em>left-recursion</em> in the grammar, which can crash a RD parser by sending it wheeling in an infinite loop.</p>
<p>But as I hinted in the footnotes (and some readers caught on in the comments), this injects an associativity problem into the grammar. Let's see why.</p>
<p>Wikipedia is much better than me at explaining what <a class="reference external" href="http://en.wikipedia.org/wiki/Operator_associativity">operator associativity</a> is, so I'll assume you've read and understood it.</p>
<p>In short, however, left associativity of the minus operator means that <tt class="docutils literal"><span class="pre">5</span> <span class="pre">-</span> <span class="pre">1</span> <span class="pre">-</span> <span class="pre">2</span> <span class="pre">=</span> <span class="pre">(5</span> <span class="pre">-</span> <span class="pre">1)</span> <span class="pre">-</span> <span class="pre">2</span></tt> and not <tt class="docutils literal"><span class="pre">5</span> <span class="pre">-</span> <span class="pre">(1</span> <span class="pre">-</span> <span class="pre">2)</span></tt> (which returns a different result).</p>
<p>But if you run <tt class="docutils literal"><span class="pre">5</span> <span class="pre">-</span> <span class="pre">1</span> <span class="pre">-</span> <span class="pre">2</span></tt> in the parser with the above BNF for <tt class="docutils literal"><span class="pre">expr</span></tt>, you'll get 6 instead of 2. So what went wrong?</p>
<p>The problem is in the grammar definition (BNF) itself. The way the <tt class="docutils literal"><span class="pre">expr</span></tt> rule is defined makes it inherently right-associative instead of left-associative. The hierarchy of the rules implicitly defines their associativity, because it defines what will be grouped together. To understand it better, perhaps the code implementing the <tt class="docutils literal"><span class="pre">expr</span></tt> rule will help:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">_expr</span>(<span style="color: #00007f">self</span>):
lval = <span style="color: #00007f">self</span>._term()
<span style="color: #00007f; font-weight: bold">if</span> <span style="color: #00007f">self</span>.cur_token.type == <span style="color: #7f007f">'+'</span>:
<span style="color: #00007f">self</span>._match(<span style="color: #7f007f">'+'</span>)
op = <span style="color: #00007f; font-weight: bold">lambda</span> a, b: a + b
<span style="color: #00007f; font-weight: bold">elif</span> <span style="color: #00007f">self</span>.cur_token.type == <span style="color: #7f007f">'-'</span>:
<span style="color: #00007f">self</span>._match(<span style="color: #7f007f">'-'</span>)
op = <span style="color: #00007f; font-weight: bold">lambda</span> a, b: a - b
<span style="color: #00007f; font-weight: bold">else</span>:
<span style="color: #00007f; font-weight: bold">print</span> <span style="color: #7f007f">'returning lval = %s'</span> % lval
<span style="color: #00007f; font-weight: bold">return</span> lval
rval = <span style="color: #00007f">self</span>._expr()
<span style="color: #00007f; font-weight: bold">print</span> <span style="color: #7f007f">'lval = %s, rval = %s, res = %s'</span> % (
lval, rval, op(lval, rval))
<span style="color: #00007f; font-weight: bold">return</span> op(lval, rval)
</pre></div>
<p>Note that the first <tt class="docutils literal"><span class="pre">term</span></tt> is parsed, and then the rule recursively calls itself for the next one. So the expression is being built from right to left, and this causes its right-associativity.</p>
<p>As you can see, I've added a couple of printouts to better show what's going on. When run on the expression <tt class="docutils literal"><span class="pre">5</span> <span class="pre">-</span> <span class="pre">1</span> <span class="pre">-</span> <span class="pre">2</span></tt>, this prints:</p>
<div class="highlight"><pre>returning lval = 2
lval = 1, rval = 2, res = -1
lval = 5, rval = -1, res = 6
</pre></div>
<p>We clearly see the problem here. The actual returns are done from right to left because of the recursion.</p>
<p>Note that this grammar evaluates addition, multiplication, subtraction and division in a right-associative way. This causes problems for both subtraction and division, but not for addition and multiplication, because these operations compute the same whether right-to-left or left-to-right <a class="footnote-reference" href="#id3" id="id1">[1]</a>.</p>
</div>
<div class="section" id="a-solution-for-the-associativity-problem">
<h3>A solution for the associativity problem</h3>
<p>I suppose the problem can be solved by rewriting the BNF rules in some sophisticated way that makes them both left-associative and not left-recursive <a class="footnote-reference" href="#id4" id="id2">[2]</a>, but I'll pick another way.</p>
<p>BNF is somewhat limiting, since it doesn't really allow much options when defining rules. All the rules must have a very strict structure, and if you want to customize something you must resort to defining sub-rules and referencing them recursively.</p>
<p>Enter <a class="reference external" href="http://en.wikipedia.org/wiki/Ebnf">EBNF</a>. It was developed to fix some of the deficiencies of plain BNF. One of those is the addition of repetition of sub-rules. For instance, we can write the <tt class="docutils literal"><span class="pre">expr</span></tt> rule in EBNF as follows:</p>
<div class="highlight"><pre><expr> : <term> {+ <term>}
| <term> {- <term>}
</pre></div>
<p>Note the braces <tt class="docutils literal"><span class="pre">{</span> <span class="pre">...</span> <span class="pre">}</span></tt>. In EBNF, these mean "repeated 0 or more times". This is still a <tt class="docutils literal"><span class="pre">LL(1)</span></tt> grammar, but now it's expressed a bit more comfortably. Such a representation is very suitable for coding, because the repetition can be expressed naturally with a loop.</p>
<p>Here's a re-implementation of the <tt class="docutils literal"><span class="pre">expr</span></tt> rule using this idiom:</p>
<div class="highlight"><pre><span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">_expr</span>(<span style="color: #00007f">self</span>):
lval = <span style="color: #00007f">self</span>._term()
<span style="color: #00007f; font-weight: bold">while</span> ( <span style="color: #00007f">self</span>.cur_token.type == <span style="color: #7f007f">'+'</span> <span style="color: #0000aa">or</span>
<span style="color: #00007f">self</span>.cur_token.type == <span style="color: #7f007f">'-'</span>):
<span style="color: #00007f; font-weight: bold">if</span> <span style="color: #00007f">self</span>.cur_token.type == <span style="color: #7f007f">'+'</span>:
<span style="color: #00007f">self</span>._match(<span style="color: #7f007f">'+'</span>)
lval += <span style="color: #00007f">self</span>._term()
<span style="color: #00007f; font-weight: bold">elif</span> <span style="color: #00007f">self</span>.cur_token.type == <span style="color: #7f007f">'-'</span>:
<span style="color: #00007f">self</span>._match(<span style="color: #7f007f">'-'</span>)
lval -= <span style="color: #00007f">self</span>._term()
<span style="color: #00007f; font-weight: bold">return</span> lval
</pre></div>
<p>Note the <tt class="docutils literal"><span class="pre">while</span></tt> loop "eating up" all successive terms in the expression and accumulating the result in the expected left-to-right manner. Now the computation <tt class="docutils literal"><span class="pre">5</span> <span class="pre">-</span> <span class="pre">1</span> <span class="pre">-</span> <span class="pre">2</span></tt> will correctly produce <tt class="docutils literal"><span class="pre">2</span></tt>.</p>
</div>
<div class="section" id="the-code">
<h3>The code</h3>
<p>This is a good place to refer to the code. In <a class="reference external" href="https://github.com/eliben/code-for-blog/tree/master/2009/py_rd_parser_example">here</a> you will find the source of both the old (BNF-based) parser and the new (EBNF-based) one, along with the <tt class="docutils literal"><span class="pre">lexer</span></tt> module that implements the tokenizer. Each of the parsers is self contained and can be used separately. Note that they were developed and tested with Python 2.5</p>
</div>
<div class="section" id="right-associative-operators">
<h3>Right-associative operators</h3>
<p>Some operators are inherently right-associative. Exponentiation, for example. <tt class="docutils literal"><span class="pre">2^3^2</span> <span class="pre">=</span> <span class="pre">2^(3^2)</span> <span class="pre">=</span> <span class="pre">512</span></tt>, and not <tt class="docutils literal"><span class="pre">(2^3)^2</span></tt> (which equals 64).</p>
<p>We can leave these operators defined as before, using a recursive rule that naturally results in right-associativity. Here's the code of the <tt class="docutils literal"><span class="pre">power</span></tt> rule that was added to the EBNF-based parser to support exponentiation:</p>
<div class="highlight"><pre><span style="color: #007f00"># <power> : <factor> ** <power></span>
<span style="color: #007f00"># | <factor></span>
<span style="color: #007f00">#</span>
<span style="color: #00007f; font-weight: bold">def</span> <span style="color: #00007f">_power</span>(<span style="color: #00007f">self</span>):
lval = <span style="color: #00007f">self</span>._factor()
<span style="color: #00007f; font-weight: bold">if</span> <span style="color: #00007f">self</span>.cur_token.type == <span style="color: #7f007f">'**'</span>:
<span style="color: #00007f">self</span>._match(<span style="color: #7f007f">'**'</span>)
lval **= <span style="color: #00007f">self</span>._power()
<span style="color: #00007f; font-weight: bold">return</span> lval
</pre></div>
</div>
<div class="section" id="intermission">
<h3>Intermission</h3>
<p>We now have a correct recursive descent parser that uses EBNF-based rules to parse expressions with the desired associativity for each operator. This parser can be readily employed to parse simple languages - it is production-use ready. The next "problem" I present only has to do with the parser's efficiency, so it is probably of no concern unless performance is crucial.</p>
</div>
<div class="section" id="problem-2-efficiency">
<h3>Problem #2: efficiency</h3>
<p>There's an inherent performance problem with recursive-descent parsers when dealing with expressions. This problem stems from the need to define operator precedence, and in RD parsers the only way to define this precedence is by using recursive sub-rules. For example (from the EBNF-based code):</p>
<div class="highlight"><pre><expr> : <term> {+ <term>}
| <term> {- <term>}
<term> : <power> {* <power>}
| <power> {/ <power>}
</pre></div>
<p>The nesting of these rules defines the relative precedence of addition and multiplication. It tells the parser: between plus signs, dive into the expression and collect all sub-terms connected by multiply signs. In other words, it tells it to group the expression: <tt class="docutils literal"><span class="pre">5</span> <span class="pre">+</span> <span class="pre">2</span> <span class="pre">*</span> <span class="pre">2</span></tt> as <tt class="docutils literal"><span class="pre">5</span> <span class="pre">+</span> <span class="pre">(2</span> <span class="pre">*</span> <span class="pre">2)</span></tt> and not as <tt class="docutils literal"><span class="pre">(5</span> <span class="pre">+</span> <span class="pre">2)</span> <span class="pre">*</span> <span class="pre">2</span></tt>.</p>
<p>To see the problem this nesting causes, I've inserted simple printouts into each of the <tt class="docutils literal"><span class="pre">expr</span></tt>, <tt class="docutils literal"><span class="pre">term</span></tt>, <tt class="docutils literal"><span class="pre">power</span></tt> and <tt class="docutils literal"><span class="pre">factor</span></tt> rules to show which functions get called while parsing. Let's see what happens when the trivial expression <tt class="docutils literal"><span class="pre">42</span></tt> is parsed:</p>
<div class="highlight"><pre>expr called with NUMBER(42) at 0
term called with NUMBER(42) at 0
power called with NUMBER(42) at 0
factor called with NUMBER(42) at 0
</pre></div>
<p><strong>Yikes!!!</strong> 4 function calls just to parse the single-token input 42! Unfortunately, while this problem may look simple on the surface, it is not. There's simply no other way to express precedence in RD parsers - you have to use nested rules, and this nesting turns out to be inefficient for parsing expressions.</p>
<p>The solution to this problem is to use a hybrid parser instead of a pure RD one. Some algorithms were developed to efficiently parse <a class="reference external" href="http://en.wikipedia.org/wiki/Infix_notation">infix expressions</a>. <a class="reference external" href="http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm">This article</a> provides a good survey. One such algorithm can be combined with RD to provide a general-purpose parser for both expressions and higher programming language constructs.</p>
<p>In a future article I will discuss an implementation of such a parser.</p>
<div align="center" class="align-center"><img class="align-center" src="https://eli.thegreenplace.net/images/hline.jpg" style="width: 320px; height: 5px;" /></div>
<table class="docutils footnote" frame="void" id="id3" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>To be more precise, addition and multiplication are <a class="reference external" href="http://en.wikipedia.org/wiki/Associativity">associative binary operators</a> in the mathematical sense.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id4" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>But I'm too lazy to look for such a way at the moment. Let me know if you find it.</td></tr>
</tbody>
</table>
</div>
Recursive descent, LL and predictive parsers2008-09-26T12:29:10-07:002023-06-30T23:16:27-07:00Eli Benderskytag:eli.thegreenplace.net,2008-09-26:/2008/09/26/recursive-descent-ll-and-predictive-parsers
<div class="section" id="introduction">
<h3>Introduction</h3>
<p>Although I've written some recursive-descent (RD) parsers by hand, the theory behind them eluded me for some time. I had a good understanding of the theory behind bottom-up <tt class="docutils literal"><span class="pre">LR</span></tt> parsers, and have used tools (like <a class="reference external" href="http://dinosaur.compilertools.net/">Yacc</a> and <a class="reference external" href="http://www.dabeaz.com/ply/">PLY</a>) to generate <tt class="docutils literal"><span class="pre">LALR</span></tt> parsers for languages, but I didn't really dig …</p></div>
<div class="section" id="introduction">
<h3>Introduction</h3>
<p>Although I've written some recursive-descent (RD) parsers by hand, the theory behind them eluded me for some time. I had a good understanding of the theory behind bottom-up <tt class="docutils literal"><span class="pre">LR</span></tt> parsers, and have used tools (like <a class="reference external" href="http://dinosaur.compilertools.net/">Yacc</a> and <a class="reference external" href="http://www.dabeaz.com/ply/">PLY</a>) to generate <tt class="docutils literal"><span class="pre">LALR</span></tt> parsers for languages, but I didn't really dig into the books about <tt class="docutils literal"><span class="pre">LL</span></tt>.</p>
<p>This week I've finally decided to understand what's going on. I tried to write a simple RD parser in Python (previously I've written RD parsers in C++ and Lisp), and ran into a problem which got me thinking hard about <tt class="docutils literal"><span class="pre">LL</span></tt> parsers. So, I've opened the <a class="reference external" href="http://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools">Dragon Book</a>, and now I know much more about <tt class="docutils literal"><span class="pre">LL(1)</span></tt>, <tt class="docutils literal"><span class="pre">LL(k)</span></tt>, predictive, recursive-descent parsers with and without backtracking, and what's between them.</p>
<p>This article is a summary of my findings, written for myself to read in a few months when I forget it :-)</p>
</div>
<div class="section" id="recursive-descent-parsers">
<h3>Recursive descent parsers</h3>
<p>From <a class="reference external" href="http://en.wikipedia.org/wiki/Recursive_descent">Wikipedia</a>:</p>
<blockquote>
A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the production rules of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.</blockquote>
<p>RD parsers are the most general form of top-down parsing, and the most popular type of parsers to write by hand. However, being so general, they have several problems, like requiring backtracking (which is difficult to code correctly and efficiently).</p>
<p>Usually, it is enough to use less general and powerful parsers for all practical needs, like parsing programming languages (and domain specific languages). This is where <tt class="docutils literal"><span class="pre">LL</span></tt> parsers come in.</p>
</div>
<div class="section" id="ll-parsers">
<h3>LL parsers</h3>
<blockquote>
An LL parser is a top-down parser for a subset of the context-free grammars. It parses the input from Left to right, and constructs a Leftmost derivation of the sentence (hence LL, compared with LR parser). The class of grammars which are parsable in this way is known as the LL grammars.</blockquote>
<p><tt class="docutils literal"><span class="pre">LL</span></tt> parsers are further classified by the amount of lookup they need. <tt class="docutils literal"><span class="pre">LL(1)</span></tt> parsers require 1 character of lookup, <tt class="docutils literal"><span class="pre">LL(k)</span></tt> require <tt class="docutils literal"><span class="pre">k</span></tt>, and so on. Usually, <tt class="docutils literal"><span class="pre">LL(1)</span></tt> is enough for most practical needs.</p>
<p><tt class="docutils literal"><span class="pre">LL</span></tt> parsers are also called <em>predictive</em>, because it's possible predict the exact path to take by a certain amount of lookup symbols, without backtracking.</p>
</div>
<div class="section" id="the-example">
<h3>The example</h3>
<p>This week I tried to construct a RD parser for this simple calculator grammar:</p>
<div class="highlight"><pre><expr> := <term> + <expr>
| <term> - <expr>
| <term>
<term> := <factor> * <term>
<factor> / <term>
<factor>
<factor> := <number>
| <id>
| ( <expr> )
<number> := \d+
<id> := [a-zA-Z_]\w+
</pre></div>
<p>This grammar is <tt class="docutils literal"><span class="pre">LL(1)</span></tt> and hence parseable by a simple predictive parser with a single token lookahead. However, I then tried to add the following rule to allow input of commands into an interactive calculator prompt:</p>
<div class="highlight"><pre><command> := <expr>
| <id> = <expr>
</pre></div>
<p>With this rule added, the grammar is no longer <tt class="docutils literal"><span class="pre">LL(1)</span></tt>, because looking at the first token I can't say which one of the two options of <tt class="docutils literal"><span class="pre"><command></span></tt> it is. In order to be able to differentiate between an assignment and a single expression, I must see the <tt class="docutils literal"><span class="pre">=</span></tt> token, and for this I need to see 2 tokens forward, and not just one. So, this grammar turns into a <tt class="docutils literal"><span class="pre">LL(2)</span></tt>.</p>
<p><tt class="docutils literal"><span class="pre">LL(2)</span></tt> grammars are much more difficult to code by hand than <tt class="docutils literal"><span class="pre">LL(1)</span></tt> grammars, and they are also much more difficult to turn into code automatically by parser generators. This is probably why for most languages <tt class="docutils literal"><span class="pre">LL(1)</span></tt> suffices.</p>
</div>
<div class="section" id="ll-parser-generators">
<h3>LL parser generators</h3>
<p>Unlike <tt class="docutils literal"><span class="pre">LR</span></tt> parsers, for which everyone uses parser generators <a class="footnote-reference" href="#id3" id="id1">[1]</a>, <tt class="docutils literal"><span class="pre">LL</span></tt> parsers are commonly written by hand. It even appears that some of the most popular compilers (such as GCC) use hand-written RD parsers to parse whole languages like C. As with anything, you get maximal flexibility and efficiency when you hand-code something, as you're not constrained by the limitations of the tools and libraries you're using.</p>
<p>Indeed, writing a simple predictive parser as a set of mutually recursive routines is simple, and can also be very educational. If you have a very small parsing task to perform, perhaps you'll be better off hand-coding a RD parser.</p>
<p>However, automatic tools for generating <tt class="docutils literal"><span class="pre">LL</span></tt> parsers exist. The most popular are probably <a class="reference external" href="http://www.antlr.org/">ANTLR</a> and <a class="reference external" href="http://spirit.sourceforge.net/">Boost.Spirit</a>. I haven't tried them, but both are widely used to write complex parsers. Both have a clear advantage over hand-written parsers - they can generate parsers with any lookup length, guessing the required length from the grammar. Hand-written parsers, as I mentioned earlier, get much more complex for any <tt class="docutils literal"><span class="pre">k</span> <span class="pre">></span> <span class="pre">1</span></tt>.</p>
</div>
<div class="section" id="left-recursion">
<h3>Left recursion</h3>
<p>Had my <tt class="docutils literal"><span class="pre">expr</span></tt> rule been written like this:</p>
<div class="highlight"><pre><expr> := <expr> + <term>
| <expr> - <term>
| <term>
</pre></div>
<p>It would have been <em>left recursive</em>, because the non-terminal <tt class="docutils literal"><span class="pre">expr</span></tt> appears as the first (leftmost) symbol in its own production. Since RD parsers work top-down, to recognize <tt class="docutils literal"><span class="pre"><expr></span></tt> it has to first recognize <tt class="docutils literal"><span class="pre"><expr></span></tt>, but for that it again has to recognize <tt class="docutils literal"><span class="pre"><expr></span></tt> and so on, ad infinitum. This infinite recursion is the reason why RD parsers can't handle left recursion.</p>
<p>Left recursion can also be indirect:</p>
<div class="highlight"><pre><a> := <b> <x>
| <c>
<b> := <a> <y>
| <d>
</pre></div>
<p>Here we can have the infinite derivation: <tt class="docutils literal"><span class="pre"><a></span> <span class="pre">-></span> <span class="pre"><b></span> <span class="pre"><x></span> <span class="pre">-></span> <span class="pre"><a></span> <span class="pre"><y></span> <span class="pre"><x></span></tt> and so on.</p>
<p>Techniques exist to remove left recursion from some grammars. For more information see <a class="reference external" href="http://en.wikipedia.org/wiki/Left_recursion">this</a>. The grammar shown in the example above had left-recursion removed from it <a class="footnote-reference" href="#id4" id="id2">[2]</a>.</p>
</div>
<div class="section" id="code">
<h3>Code</h3>
<p>A simple recursive descent parser for a calculator, written in Python, can be downloaded <a class="reference external" href="https://github.com/eliben/code-for-blog/tree/master/2009/py_rd_parser_example">here</a>. It also includes a fairly generic Lexer class that implements regex-based tokenization of a string.</p>
<div align="center" class="align-center"><img class="align-center" src="https://eli.thegreenplace.net/images/hline.jpg" style="width: 320px; height: 5px;" /></div>
<table class="docutils footnote" frame="void" id="id3" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td>Since <tt class="docutils literal"><span class="pre">LR</span></tt> parsers are table-based are too tedious and unwieldy to write by hand.</td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id4" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td>Which, however, has left it with a slight operator associativity problem. Finding it is left as an exercise for the reader).</td></tr>
</tbody>
</table>
</div>
Parse::RecDescent vs. YACC2004-01-29T15:08:00-08:002022-10-04T14:08:24-07:00Eli Benderskytag:eli.thegreenplace.net,2004-01-29:/2004/01/29/parserecdescent-vs-yacc
Parse::RecDescent (RD) looks like the best parsing option in Perl for me, for two reasons. First, it is very lightweight - only one .pm file to carry around. Second, I like recursive descent parsing :-) RD parsing is, IMHO, easier to visualize and understand. Looking at the grammar (BNF) it is …
Parse::RecDescent (RD) looks like the best parsing option in Perl for me, for two reasons. First, it is very lightweight - only one .pm file to carry around. Second, I like recursive descent parsing :-) RD parsing is, IMHO, easier to visualize and understand. Looking at the grammar (BNF) it is immediately obvious how each rule will be parsed given the input. This is very nice for grammar debugging.<p>
Yesterday was my first serious experience with the RD module (historically, I did a lot of Yacc (in C), and coded some simple recursive descent parsers by hand). The module works nicely, and is easy to learn and understand. Some notable differences from Yacc:</p><p><ol>
<li>Integrated lexing. Very nice ! It looks much more natural this way, and there's no need for extra headache with Lex linkage. Tokens are defined as simple regex rules in the grammar itself.</li>
<li>Some little things that make life easier and more pleasant. For example, the rule quantifiers (s), (s?) etc.</li>
<li>Left recursion problem. Hits blatantly when arithmetic expressions must be parsed. A different mindset must be employed when comparing with Yacc.</li>
</ol>
Additionally, RD has a very useful trace option, that traces parsing and allows to see where things went wrong with the grammar.