<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Eli Bendersky's website - Recursive descent parsing</title><link href="https://eli.thegreenplace.net/" rel="alternate"></link><link href="https://eli.thegreenplace.net/feeds/recursive-descent-parsing.atom.xml" rel="self"></link><id>https://eli.thegreenplace.net/</id><updated>2026-02-05T03:38:39-08:00</updated><entry><title>Rewriting pycparser with the help of an LLM</title><link href="https://eli.thegreenplace.net/2026/rewriting-pycparser-with-the-help-of-an-llm/" rel="alternate"></link><published>2026-02-04T19:35:00-08:00</published><updated>2026-02-05T03:38:39-08:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2026-02-04:/2026/rewriting-pycparser-with-the-help-of-an-llm/</id><summary type="html">&lt;p&gt;&lt;a class="reference external" href="https://github.com/eliben/pycparser"&gt;pycparser&lt;/a&gt; is my most widely used open
source project (with ~20M daily downloads from PyPI &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;). It's a pure-Python
parser for the C programming language, producing ASTs inspired by &lt;a class="reference external" href="https://docs.python.org/3/library/ast.html"&gt;Python's
own&lt;/a&gt;. Until very recently, it's
been using &lt;a class="reference external" href="https://www.dabeaz.com/ply/ply.html"&gt;PLY: Python Lex-Yacc&lt;/a&gt; for
the core parsing.&lt;/p&gt;
&lt;p&gt;In this post, I'll describe how …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a class="reference external" href="https://github.com/eliben/pycparser"&gt;pycparser&lt;/a&gt; is my most widely used open
source project (with ~20M daily downloads from PyPI &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;). It's a pure-Python
parser for the C programming language, producing ASTs inspired by &lt;a class="reference external" href="https://docs.python.org/3/library/ast.html"&gt;Python's
own&lt;/a&gt;. Until very recently, it's
been using &lt;a class="reference external" href="https://www.dabeaz.com/ply/ply.html"&gt;PLY: Python Lex-Yacc&lt;/a&gt; for
the core parsing.&lt;/p&gt;
&lt;p&gt;In this post, I'll describe how I collaborated with an LLM coding agent (Codex)
to help me rewrite pycparser to use a hand-written recursive-descent parser and
remove the dependency on PLY. This has been an interesting experience and the
post contains lots of information and is therefore quite long; if you're just
interested in the final result, check out the latest code of pycparser - the
&lt;tt class="docutils literal"&gt;main&lt;/tt&gt; branch already has the new implementation.&lt;/p&gt;
&lt;img alt="meme picture saying &amp;quot;can't come to bed because my AI agent produced something slightly wrong&amp;quot;" class="align-center" src="https://eli.thegreenplace.net/images/2026/cantcometobed.png" /&gt;
&lt;div class="section" id="the-issues-with-the-existing-parser-implementation"&gt;
&lt;h2&gt;The issues with the existing parser implementation&lt;/h2&gt;
&lt;p&gt;While pycparser has been working well overall, there were a number of nagging
issues that persisted over years.&lt;/p&gt;
&lt;div class="section" id="parsing-strategy-yacc-vs-hand-written-recursive-descent"&gt;
&lt;h3&gt;Parsing strategy: YACC vs. hand-written recursive descent&lt;/h3&gt;
&lt;p&gt;I began working on pycparser in 2008, and back then using a YACC-based approach
for parsing a whole language like C seemed like a no-brainer to me. Isn't this
what everyone does when writing a serious parser? Besides, the K&amp;amp;R2 book
famously carries the entire grammar of the C99 language in an appendix - so it
seemed like a simple matter of translating that to PLY-yacc syntax.&lt;/p&gt;
&lt;p&gt;And indeed, it wasn't &lt;em&gt;too&lt;/em&gt; hard, though there definitely were some complications
in building the ASTs for declarations (C's &lt;a class="reference external" href="https://eli.thegreenplace.net/2008/10/18/implementing-cdecl-with-pycparser"&gt;gnarliest part&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Shortly after completing pycparser, I got more and more interested in compilation
and started learning about the different kinds of parsers more seriously. Over
time, I grew convinced that &lt;a class="reference external" href="https://eli.thegreenplace.net/tag/recursive-descent-parsing"&gt;recursive descent&lt;/a&gt; is the way to
go - producing parsers that are easier to understand and maintain (and are often
faster!).&lt;/p&gt;
&lt;p&gt;It all ties in to the &lt;a class="reference external" href="https://eli.thegreenplace.net/2017/benefits-of-dependencies-in-software-projects-as-a-function-of-effort/"&gt;benefits of dependencies in software projects as a
function of effort&lt;/a&gt;.
Using parser generators is a heavy &lt;em&gt;conceptual&lt;/em&gt; dependency: it's really nice
when you have to churn out many parsers for small languages. But when you have
to maintain a single, very complex parser, as part of a large project - the
benefits quickly dissipate and you're left with a substantial dependency that
you constantly grapple with.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-other-issue-with-dependencies"&gt;
&lt;h3&gt;The other issue with dependencies&lt;/h3&gt;
&lt;p&gt;And then there are the usual problems with dependencies; dependencies get
abandoned, and they may also develop security issues. Sometimes, both of these
become true.&lt;/p&gt;
&lt;p&gt;Many years ago, pycparser forked and started vendoring its own version of PLY.
This was part of transitioning pycparser to a dual Python 2/3 code base when PLY
was slower to adapt. I believe this was the right decision, since PLY &amp;quot;just
worked&amp;quot; and I didn't have to deal with active (and very tedious in the Python
ecosystem, where packaging tools are replaced faster than dirty socks)
dependency management.&lt;/p&gt;
&lt;p&gt;A couple of weeks ago &lt;a class="reference external" href="https://github.com/eliben/pycparser/issues/588"&gt;this issue&lt;/a&gt;
was opened for pycparser. It turns out the some old PLY code triggers security
checks used by some Linux distributions; while this code was fixed in a later
commit of PLY, PLY itself was apparently abandoned and archived in late 2025.
And guess what? That happened in the middle of a large rewrite of the package,
so re-vendoring the pre-archiving commit seemed like a risky proposition.&lt;/p&gt;
&lt;p&gt;On the issue it was suggested that &amp;quot;hopefully the dependent packages move on to
a non-abandoned parser or implement their own&amp;quot;; I originally laughed this idea
off, but then it got me thinking... which is what this post is all about.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="growing-complexity-of-parsing-a-messy-language"&gt;
&lt;h3&gt;Growing complexity of parsing a messy language&lt;/h3&gt;
&lt;p&gt;The original K&amp;amp;R2 grammar for C99 had - famously - a single shift-reduce
conflict having to do with dangling &lt;tt class="docutils literal"&gt;else&lt;/tt&gt;s belonging to the most recent
&lt;tt class="docutils literal"&gt;if&lt;/tt&gt; statement. And indeed, other than the famous &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Lexer_hack"&gt;lexer hack&lt;/a&gt;
used to deal with &lt;a class="reference external" href="https://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-cs-grammar-revisited"&gt;C's type name / ID ambiguity&lt;/a&gt;,
pycparser only had this single shift-reduce conflict.&lt;/p&gt;
&lt;p&gt;But things got more complicated. Over the years, features were added that
weren't strictly in the standard but were supported by all the industrial
compilers. The more advanced C11 and C23 standards weren't beholden to the
promises of conflict-free YACC parsing (since almost no industrial-strength
compilers use YACC at this point), so all caution went out of the window.&lt;/p&gt;
&lt;p&gt;The latest (PLY-based) release of pycparser has many reduce-reduce conflicts
&lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;; these are a severe maintenance hazard because it means the parsing rules
essentially have to be tie-broken by order of appearance in the code. This is
very brittle; pycparser has only managed to maintain its stability and quality
through its comprehensive test suite. Over time, it became harder and harder to
extend, because YACC parsing rules have all kinds of spooky-action-at-a-distance
effects. The straw that broke the camel's back was &lt;a class="reference external" href="https://github.com/eliben/pycparser/pull/590"&gt;this PR&lt;/a&gt; which again proposed to
increase the number of reduce-reduce conflicts &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This - again - prompted me to think &amp;quot;what if I just dump YACC and switch to
a hand-written recursive descent parser&amp;quot;, and here we are.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="the-mental-roadblock"&gt;
&lt;h2&gt;The mental roadblock&lt;/h2&gt;
&lt;p&gt;None of the challenges described above are new; I've been pondering them for
many years now, and yet biting the bullet and rewriting the parser didn't feel
like something I'd like to get into. By my private estimates it'd take at least
a week of deep heads-down work to port the gritty 2000 lines of YACC grammar
rules to a recursive descent parser &lt;a class="footnote-reference" href="#footnote-4" id="footnote-reference-4"&gt;[4]&lt;/a&gt;. Moreover, it wouldn't be a
particularly &lt;em&gt;fun&lt;/em&gt; project either - I didn't feel like I'd learn much new and
my interests have shifted away from this project. In short, the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Potential_well"&gt;Potential well&lt;/a&gt; was just too deep.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="why-would-this-even-work-tests"&gt;
&lt;h2&gt;Why would this even work? Tests&lt;/h2&gt;
&lt;p&gt;I've definitely noticed the improvement in capabilities of LLM coding
agents in the past few months, and many reputable people online rave about using
them for increasingly larger projects. That said, would an LLM agent really be
able to accomplish such a complex project on its own? This isn't just a toy,
it's thousands of lines of dense parsing code.&lt;/p&gt;
&lt;p&gt;What gave me hope is the concept of &lt;a class="reference external" href="https://simonwillison.net/2025/Dec/31/the-year-in-llms/#the-year-of-conformance-suites"&gt;conformance suites mentioned by
Simon Willison&lt;/a&gt;.
Agents seem to do well when there's a very clear and rigid
goal function - such as a large, high-coverage conformance test suite.&lt;/p&gt;
&lt;p&gt;And pycparser has an &lt;a class="reference external" href="https://github.com/eliben/pycparser/blob/main/tests/test_c_parser.py"&gt;very extensive one&lt;/a&gt;.
Over 2500 lines of test code parsing various C snippets to ASTs with expected
results, grown over a decade and a half of real issues and bugs reported by
users.&lt;/p&gt;
&lt;p&gt;I figured the LLM can either succeed or fail and throw its hands up in despair,
but it's quite unlikely to produce a &lt;em&gt;wrong&lt;/em&gt; port that would still pass all
the tests. So I set it to run.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-initial-port"&gt;
&lt;h2&gt;The initial port&lt;/h2&gt;
&lt;p&gt;I fired up Codex in pycparser's repository, and wrote this prompt just to
make sure it understands me and can run the tests:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;This repository implements a Python parser producing an AST. It has a very
extensive test suite; can you first verify you successfully run all the tests?
(with python3 -m unittest discover, from the root dir)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Codex figured it out (I gave it the exact command, after all!); my next
prompt was the real thing &lt;a class="footnote-reference" href="#footnote-5" id="footnote-reference-5"&gt;[5]&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pycparser uses the PLY library for its lexer and parser, and its own AST
package for the ASTs created by the parser. I want to explore replacing the
Parser component (leaving the Lexer alone for now) with a hand written
recursive descent parser that will pass all the same tests. Can you try that?
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here Codex went to work and churned &lt;em&gt;for over an hour&lt;/em&gt;. Having never
observed an agent work for nearly this long, I kind of assumed it went off
the rails and will fail sooner or later. So I was rather surprised and skeptical
when it eventually came back with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;Replaced the PLY parse path with a handwritten recursive‑descent parser while
keeping the existing lexer and AST helpers intact; CParser now points to the
new RD implementation, with the old PLY version preserved as PLYCParser in
pycparser/c_parser.py. The RD parser includes a token stream, declarator/type
parsing, expression precedence handling, pragmas, and statement parsing to
match the previous AST output and coordinate behavior.
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It took me a while to poke around the code and run it until I was convinced - it
had actually done it! It wrote a new recursive descent parser with only
ancillary dependencies on PLY, and that parser passed the test suite. After a
few more prompts, we've removed the ancillary dependencies and made the
structure clearer. I hadn't looked too deeply into code quality at this point,
but at least on the functional level - it succeeded. This was very impressive!&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-quick-note-on-reviews-and-branches"&gt;
&lt;h2&gt;A quick note on reviews and branches&lt;/h2&gt;
&lt;p&gt;A change like the one described above is impossible to code-review as one PR in
any meaningful way; so I used a different strategy. Before embarking on this
path, I created a new branch and once Codex finished the initial rewrite, I
committed this change, knowing that I will review it in detail, piece-by-piece
later on.&lt;/p&gt;
&lt;p&gt;Even though coding agents have their own notion of history and can &amp;quot;revert&amp;quot;
certain changes, I felt much safer relying on Git. In the worst case if all of
this goes south, I can nuke the branch and it's as if nothing ever happened.
I was determined to only merge this branch onto &lt;tt class="docutils literal"&gt;main&lt;/tt&gt; once I was fully
satisfied with the code. In what follows, I had to &lt;tt class="docutils literal"&gt;git reset&lt;/tt&gt; several times
when I didn't like the direction in which Codex was going. In hindsight, doing
this work in a branch was absolutely the right choice.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-long-tail-of-goofs"&gt;
&lt;h2&gt;The long tail of goofs&lt;/h2&gt;
&lt;p&gt;Once I've sufficiently convinced myself that the new parser is actually working,
I used Codex to similarly rewrite the lexer and get rid of the PLY dependency
entirely, deleting it from the repository. Then, I started looking more deeply
into code quality - reading the code created by Codex and trying to wrap my head
around it.&lt;/p&gt;
&lt;p&gt;And - oh my - this was quite the journey. Much has been written about the code
produced by agents, and much of it seems to be true. Maybe it's a setting I'm
missing (I'm not using my own custom &lt;tt class="docutils literal"&gt;AGENTS.md&lt;/tt&gt; yet, for instance), but
Codex seems to be that eager programmer that wants to get from A to B whatever
the cost. Readability, minimalism and code clarity are very much secondary
goals.&lt;/p&gt;
&lt;p&gt;Using &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;raise...except&lt;/span&gt;&lt;/tt&gt; for control flow? Yep. Abusing Python's weak typing
(like having &lt;tt class="docutils literal"&gt;None&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;false&lt;/tt&gt; and other values all mean different things
for a given variable)? For sure. Spreading the logic of a complex function
all over the place instead of putting all the key parts in a single switch
statement? You bet.&lt;/p&gt;
&lt;p&gt;Moreover, the agent is hilariously &lt;em&gt;lazy&lt;/em&gt;. More than once I had to convince it
to do something it initially said is impossible, and even insisted again in
follow-up messages. The anthropomorphization here is mildly concerning, to be
honest. I could never imagine I would be writing something like the following to
a computer, and yet - here we are: &amp;quot;Remember how we moved X to Y before? You
can do it again for Z, definitely. Just try&amp;quot;.&lt;/p&gt;
&lt;p&gt;My process was to see how I can instruct Codex to fix things, and intervene
myself (by rewriting code) as little as possible. I've &lt;em&gt;mostly&lt;/em&gt; succeeded in
this, and did maybe 20% of the work myself.&lt;/p&gt;
&lt;p&gt;My branch grew &lt;em&gt;dozens&lt;/em&gt; of commits, falling into roughly these categories:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;The code in X is too complex; why can't we do Y instead?&lt;/li&gt;
&lt;li&gt;The use of X is needlessly convoluted; change Y to Z, and T to V in all
instances.&lt;/li&gt;
&lt;li&gt;The code in X is unclear; please add a detailed comment - with examples - to
explain what it does.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Interestingly, after doing (3), the agent was often more effective in giving
the code a &amp;quot;fresh look&amp;quot; and succeeding in either (1) or (2).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="the-end-result"&gt;
&lt;h2&gt;The end result&lt;/h2&gt;
&lt;p&gt;Eventually, after many hours spent in this process, I was reasonably pleased
with the code. It's far from perfect, of course, but taking the essential
complexities into account, it's something I could see myself maintaining (with
or without the help of an agent). I'm sure I'll find more ways to improve it
in the future, but I have a reasonable degree of confidence that this will be
doable.&lt;/p&gt;
&lt;p&gt;It passes all the tests, so I've been able to release a new version (3.00)
without major issues so far. The only issue I've discovered is that some of
CFFI's tests are overly precise about the phrasing of errors reported by
pycparser; this was &lt;a class="reference external" href="https://github.com/python-cffi/cffi/pull/224"&gt;an easy fix&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The new parser is also faster, by about 30% based on my benchmarks! This is
typical of recursive descent when compared with YACC-generated parsers, in my
experience. After reviewing the initial rewrite of the lexer, I've spent a while
instructing Codex on how to make it faster, and it worked reasonably well.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="followup-static-typing"&gt;
&lt;h2&gt;Followup - static typing&lt;/h2&gt;
&lt;p&gt;While working on this, it became quite obvious that static typing would make the
process easier. LLM coding agents really benefit from closed loops with strict
guardrails (e.g. a test suite to pass), and type-annotations act as such.
For example, had pycparser already been type annotated, Codex would probably not
have overloaded values to multiple types (like &lt;tt class="docutils literal"&gt;None&lt;/tt&gt; vs. &lt;tt class="docutils literal"&gt;False&lt;/tt&gt; vs.
others).&lt;/p&gt;
&lt;p&gt;In a followup, I asked Codex to type-annotate pycparser (running checks using
&lt;tt class="docutils literal"&gt;ty&lt;/tt&gt;), and this was also a back-and-forth because the process exposed some
issues that needed to be refactored. Time will tell, but hopefully it will make
further changes in the project simpler for the agent.&lt;/p&gt;
&lt;p&gt;Based on this experience, I'd bet that coding agents will be somewhat more
effective in strongly typed languages like Go, TypeScript and especially Rust.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="conclusions"&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Overall, this project has been a really good experience, and I'm impressed with
what modern LLM coding agents can do! While there's no reason to expect that
progress in this domain will stop, even if it does - these are already very
useful tools that can significantly improve programmer productivity.&lt;/p&gt;
&lt;p&gt;Could I have done this myself, without an agent's help? Sure. But it would have
taken me &lt;em&gt;much&lt;/em&gt; longer, assuming that I could even muster the will and
concentration to engage in this project. I estimate it would take me at least
a week of full-time work (so 30-40 hours) spread over who knows how long to
accomplish. With Codex, I put in an order of magnitude less work into this
(around 4-5 hours, I'd estimate) and I'm happy with the result.&lt;/p&gt;
&lt;p&gt;It was also &lt;em&gt;fun&lt;/em&gt;. At least in one sense, my professional life can be described
as the pursuit of focus, deep work and &lt;em&gt;flow&lt;/em&gt;. It's not easy for me to get into
this state, but when I do I'm highly productive and find it very enjoyable.
Agents really help me here. When I know I need to write some code and it's
hard to get started, asking an agent to write a prototype is a great catalyst
for my motivation. Hence the meme at the beginning of the post.&lt;/p&gt;
&lt;div class="section" id="does-code-quality-even-matter"&gt;
&lt;h3&gt;Does code quality even matter?&lt;/h3&gt;
&lt;p&gt;One can't avoid a nagging question - does the quality of the code produced
by agents even matter? Clearly, the agents themselves can understand it (if not
today's agent, then at least next year's). Why worry about future
maintainability if the agent can maintain it? In other words, does it make sense
to just go full vibe-coding?&lt;/p&gt;
&lt;p&gt;This is a fair question, and one I don't have an answer to. Right now, for
projects I maintain and &lt;em&gt;stand behind&lt;/em&gt;, it seems obvious to me that the code
should be fully understandable and accepted by me, and the agent is just a tool
helping me get to that state more efficiently. It's hard to say what the future
holds here; it's going to interesting, for sure.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;pycparser has a fair number of &lt;a class="reference external" href="https://deps.dev/pypi/pycparser/3.0.0/dependents"&gt;direct dependents&lt;/a&gt;,
but the majority of downloads comes through &lt;a class="reference external" href="https://github.com/python-cffi/cffi"&gt;CFFI&lt;/a&gt;,
which itself is a major building block for much of the Python ecosystem.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;The table-building report says 177, but that's certainly an
over-dramatization because it's common for a single conflict to
manifest in several ways.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;It didn't help the PR's case that it was almost certainly vibe coded.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-4" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-4"&gt;[4]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;p class="first"&gt;There was also the lexer to consider, but this seemed like a much
simpler job. My impression is that in the early days of computing,
&lt;tt class="docutils literal"&gt;lex&lt;/tt&gt; gained prominence because of strong regexp support which wasn't
very common yet. These days, with excellent regexp libraries
existing for pretty much every language, the added value of &lt;tt class="docutils literal"&gt;lex&lt;/tt&gt; over
a &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/06/25/regex-based-lexical-analysis-in-python-and-javascript"&gt;custom regexp-based lexer&lt;/a&gt;
isn't very high.&lt;/p&gt;
&lt;p class="last"&gt;That said, it wouldn't make much sense to embark on a journey to rewrite
&lt;em&gt;just&lt;/em&gt; the lexer; the dependency on PLY would still remain, and besides,
PLY's lexer and parser are designed to work well together. So it wouldn't
help me much without tackling the parser beast.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-5" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-5"&gt;[5]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;I've decided to ask it to the port the parser first, leaving the lexer
alone. This was to split the work into reasonable chunks. Besides, I
figured that the parser is the hard job anyway - if it succeeds in that,
the lexer should be easy. That assumption turned out to be correct.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Python"></category><category term="Machine Learning"></category><category term="Compilation"></category><category term="Recursive descent parsing"></category></entry><entry><title>Revisiting "Let's Build a Compiler"</title><link href="https://eli.thegreenplace.net/2025/revisiting-lets-build-a-compiler/" rel="alternate"></link><published>2025-12-09T20:40:00-08:00</published><updated>2026-01-17T22:40:40-08:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2025-12-09:/2025/revisiting-lets-build-a-compiler/</id><summary type="html">&lt;p&gt;There's an old compiler-building tutorial that has become part of the field's
lore: the &lt;a class="reference external" href="https://compilers.iecc.com/crenshaw/"&gt;Let's Build a Compiler&lt;/a&gt;
series by Jack Crenshaw (published between 1988 and 1995).&lt;/p&gt;
&lt;p&gt;I &lt;a class="reference external" href="https://eli.thegreenplace.net/2003/07/29/great-compilers-tutorial"&gt;ran into it in 2003&lt;/a&gt;
and was very impressed, but it's now 2025 and this tutorial is still being mentioned quite
often …&lt;/p&gt;</summary><content type="html">&lt;p&gt;There's an old compiler-building tutorial that has become part of the field's
lore: the &lt;a class="reference external" href="https://compilers.iecc.com/crenshaw/"&gt;Let's Build a Compiler&lt;/a&gt;
series by Jack Crenshaw (published between 1988 and 1995).&lt;/p&gt;
&lt;p&gt;I &lt;a class="reference external" href="https://eli.thegreenplace.net/2003/07/29/great-compilers-tutorial"&gt;ran into it in 2003&lt;/a&gt;
and was very impressed, but it's now 2025 and this tutorial is still being mentioned quite
often &lt;a class="reference external" href="https://hn.algolia.com/?dateRange=pastYear&amp;amp;page=0&amp;amp;prefix=true&amp;amp;query=crenshaw&amp;amp;sort=byDate&amp;amp;type=all"&gt;in Hacker News threads&lt;/a&gt;.
Why is that? Why does a tutorial from 35
years ago, built in Pascal and emitting Motorola 68000 assembly - technologies that
are virtually unknown for the new generation of programmers - hold sway over
compiler enthusiasts? I've decided to find out.&lt;/p&gt;
&lt;p&gt;The tutorial is &lt;a class="reference external" href="https://compilers.iecc.com/crenshaw/"&gt;easily available and readable online&lt;/a&gt;, but
just re-reading it seemed insufficient. So I've decided on meticulously
translating the compilers built in it to Python and emit a more modern target -
WebAssembly. It was an enjoyable process and I want to share the outcome and
some insights gained along the way.&lt;/p&gt;
&lt;p&gt;The result is &lt;a class="reference external" href="https://github.com/eliben/letsbuildacompiler"&gt;this code repository&lt;/a&gt;.
Of particular interest is the &lt;a class="reference external" href="https://github.com/eliben/letsbuildacompiler/blob/main/TUTORIAL.md"&gt;TUTORIAL.md file&lt;/a&gt;,
which describes how each part in the original tutorial is mapped to my code. So
if you want to read the original tutorial but play with code you can actually
easily try on your own, feel free to follow my path.&lt;/p&gt;
&lt;div class="section" id="a-sample"&gt;
&lt;h2&gt;A sample&lt;/h2&gt;
&lt;p&gt;To get a taste of the input language being compiled and the output my compiler
generates, here's a sample program in the KISS language designed by Jack
Crenshaw:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;var X=0

 { sum from 0 to n-1 inclusive, and add to result }
 procedure addseq(n, ref result)
     var i, sum  { 0 initialized }
     while i &amp;lt; n
         sum = sum + i
         i = i + 1
     end
     result = result + sum
 end

 program testprog
 begin
     addseq(11, X)
 end
 .
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's from part 13 of the tutorial, so it showcases procedures along with control
constructs like the &lt;tt class="docutils literal"&gt;while&lt;/tt&gt; loop, and passing parameters both by value and by
reference. Here's the WASM text generated by my compiler for part 13:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;module&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;memory&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;;; Linear stack pointer. Used to pass parameters by ref.&lt;/span&gt;
  &lt;span class="c1"&gt;;; Grows downwards (towards lower addresses).&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="nv"&gt;$__sp&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mf"&gt;65536&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="nv"&gt;$X&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="nv"&gt;$ADDSEQ&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;param&lt;/span&gt; &lt;span class="nv"&gt;$N&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;param&lt;/span&gt; &lt;span class="nv"&gt;$RESULT&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;local&lt;/span&gt; &lt;span class="nv"&gt;$I&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;local&lt;/span&gt; &lt;span class="nv"&gt;$SUM&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;loop&lt;/span&gt; &lt;span class="nv"&gt;$loop1&lt;/span&gt;
      &lt;span class="k"&gt;block&lt;/span&gt; &lt;span class="nv"&gt;$breakloop1&lt;/span&gt;
        &lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$I&lt;/span&gt;
        &lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$N&lt;/span&gt;
        &lt;span class="nb"&gt;i32.lt_s&lt;/span&gt;
        &lt;span class="nb"&gt;i32.eqz&lt;/span&gt;
        &lt;span class="nb"&gt;br_if&lt;/span&gt; &lt;span class="nv"&gt;$breakloop1&lt;/span&gt;
        &lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$SUM&lt;/span&gt;
        &lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$I&lt;/span&gt;
        &lt;span class="nb"&gt;i32.add&lt;/span&gt;
        &lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$SUM&lt;/span&gt;
        &lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$I&lt;/span&gt;
        &lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="nb"&gt;i32.add&lt;/span&gt;
        &lt;span class="nb"&gt;local.set&lt;/span&gt; &lt;span class="nv"&gt;$I&lt;/span&gt;
        &lt;span class="nb"&gt;br&lt;/span&gt; &lt;span class="nv"&gt;$loop1&lt;/span&gt;
      &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$RESULT&lt;/span&gt;
    &lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$RESULT&lt;/span&gt;
    &lt;span class="nb"&gt;i32.load&lt;/span&gt;
    &lt;span class="nb"&gt;local.get&lt;/span&gt; &lt;span class="nv"&gt;$SUM&lt;/span&gt;
    &lt;span class="nb"&gt;i32.add&lt;/span&gt;
    &lt;span class="nb"&gt;i32.store&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="nv"&gt;$main&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;main&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;result&lt;/span&gt; &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;
    &lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$__sp&lt;/span&gt;      &lt;span class="c1"&gt;;; make space on stack&lt;/span&gt;
    &lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="nb"&gt;i32.sub&lt;/span&gt;
    &lt;span class="nb"&gt;global.set&lt;/span&gt; &lt;span class="nv"&gt;$__sp&lt;/span&gt;
    &lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$__sp&lt;/span&gt;
    &lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$X&lt;/span&gt;
    &lt;span class="nb"&gt;i32.store&lt;/span&gt;
    &lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$__sp&lt;/span&gt;    &lt;span class="c1"&gt;;; push address as parameter&lt;/span&gt;
    &lt;span class="nb"&gt;call&lt;/span&gt; &lt;span class="nv"&gt;$ADDSEQ&lt;/span&gt;
    &lt;span class="c1"&gt;;; restore parameter X by ref&lt;/span&gt;
    &lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$__sp&lt;/span&gt;
    &lt;span class="nb"&gt;i32.load&lt;/span&gt; &lt;span class="k"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="nb"&gt;global.set&lt;/span&gt; &lt;span class="nv"&gt;$X&lt;/span&gt;
    &lt;span class="c1"&gt;;; clean up stack for ref parameters&lt;/span&gt;
    &lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$__sp&lt;/span&gt;
    &lt;span class="nb"&gt;i32.const&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="nb"&gt;i32.add&lt;/span&gt;
    &lt;span class="nb"&gt;global.set&lt;/span&gt; &lt;span class="nv"&gt;$__sp&lt;/span&gt;
    &lt;span class="nb"&gt;global.get&lt;/span&gt; &lt;span class="nv"&gt;$X&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You'll notice that there is some trickiness in the emitted code w.r.t. handling
the by-reference parameter (my &lt;a class="reference external" href="https://eli.thegreenplace.net/2025/notes-on-the-wasm-basic-c-abi/"&gt;previous post&lt;/a&gt;
deals with this issue in more detail). In general, though, the emitted code is
inefficient - there is close to 0 optimization applied.&lt;/p&gt;
&lt;p&gt;Also, if you're very diligent you'll notice something odd about the global
variable &lt;tt class="docutils literal"&gt;X&lt;/tt&gt; - it seems to be implicitly returned by the generated &lt;tt class="docutils literal"&gt;main&lt;/tt&gt;
function. This is just a testing facility that makes my compiler easy to test.
All the compilers are extensively tested - usually by running the
generated WASM code &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt; and verifying expected results.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="insights-what-makes-this-tutorial-so-special"&gt;
&lt;h2&gt;Insights - what makes this tutorial so special?&lt;/h2&gt;
&lt;p&gt;While reading the original tutorial again, I had on opportunity to reminisce on
what makes it so effective. Other than the very fluent and conversational
writing style of Jack Crenshaw, I think it's a combination of two key
factors:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;The tutorial builds a recursive-descent parser step by step, rather than
giving a long preface on automata and table-based parser generators. When
I first encountered it (in 2003), it was taken for granted that if you want
to write a parser then lex + yacc are the way to go &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;. Following the
development of a simple and clean hand-written
parser was a revelation that wholly changed my approach to the subject;
subsequently, hand-written recursive-descent parsers have been my go-to approach
&lt;a class="reference external" href="https://eli.thegreenplace.net/tag/recursive-descent-parsing"&gt;for almost 20 years now&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Rather than getting stuck in front-end minutiae, the tutorial goes straight
to generating working assembly code, from very early on. This was also a
breath of fresh air for engineers who grew up with more traditional courses
where you spend 90% of the time on parsing, type checking and other semantic
analysis and often run entirely out of steam by the time code generation
is taught.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To be honest, I don't think either of these are a big problem with modern
resources, but back in the day the tutorial clearly hit the right nerve with
many people.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="what-else-does-it-teach-us"&gt;
&lt;h2&gt;What else does it teach us?&lt;/h2&gt;
&lt;p&gt;Jack Crenshaw's tutorial takes the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Syntax-directed_translation"&gt;syntax-directed translation&lt;/a&gt;
approach, where code is emitted &lt;em&gt;while parsing&lt;/em&gt;, without having to divide the
compiler into explicit phases with IRs. As I said above, this is a fantastic
approach for getting started, but in the latter parts of the tutorial it starts
showing its limitations. Especially once we get to types, it becomes painfully
obvious that it would be very nice if we knew the types of expressions &lt;em&gt;before&lt;/em&gt;
we generate code for them.&lt;/p&gt;
&lt;p&gt;I don't know if this is implicated in Jack Crenshaw's abandoning the tutorial
at some point after part 14, but it may very well be. He keeps writing how
the emitted code is clearly sub-optimal &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt; and can be improved, but IMHO it's
just not that easy to improve using the syntax-directed translation strategy.
With perfect hindsight vision, I would probably use Part 14 (types) as a turning
point - emitting some kind of AST from the parser and then doing simple type
checking and analysis on that AST prior to generating code from it.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;All in all, the original tutorial remains a wonderfully readable introduction
to building compilers. This post and the &lt;a class="reference external" href="https://github.com/eliben/letsbuildacompiler"&gt;GitHub repository&lt;/a&gt;
it describes are a modest
contribution that aims to improve the experience of folks reading the original
tutorial today and not willing to use obsolete technologies. As always, let
me know if you run into any issues or have questions!&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;This is done using the &lt;a class="reference external" href="https://pypi.org/project/wasmtime/"&gt;Python bindings to wasmtime&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;By the way, gcc switched from YACC to hand-written recursive-descent
parsing in the 2004-2006 timeframe, and Clang has been implemented with
a recursive-descent parser from the start (2007).&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;p class="first"&gt;Concretely: when we compile &lt;tt class="docutils literal"&gt;subexpr1 + subexpr2&lt;/tt&gt; and the two sides have different
types, it would be mighty nice to know that &lt;em&gt;before&lt;/em&gt; we actually generate
the code for both sub-expressions. But the syntax-directed translation
approach just doesn't work that way.&lt;/p&gt;
&lt;p class="last"&gt;To be clear: it's easy to generate &lt;em&gt;working&lt;/em&gt; code; it's just not easy
to generate optimal code without some sort of type analysis that's
done before code is actually generated.&lt;/p&gt;
&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Compilation"></category><category term="WebAssembly"></category><category term="Python"></category><category term="Recursive descent parsing"></category></entry><entry><title>Ungrammar in Go and resilient parsing</title><link href="https://eli.thegreenplace.net/2023/ungrammar-in-go-and-resilient-parsing/" rel="alternate"></link><published>2023-07-08T06:12:00-07:00</published><updated>2023-07-08T13:12:59-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2023-07-08:/2023/ungrammar-in-go-and-resilient-parsing/</id><summary type="html">&lt;p&gt;It won't be news to the readers of this blog that I have &lt;a class="reference external" href="https://github.com/eliben/pycparser"&gt;some interest&lt;/a&gt; in
&lt;a class="reference external" href="https://eli.thegreenplace.net/tag/compilation"&gt;compiler&lt;/a&gt;
&lt;a class="reference external" href="https://eli.thegreenplace.net/tag/recursive-descent-parsing"&gt;front-ends&lt;/a&gt;.
So when I heard about a new(-ish) DSL for
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Parse_tree"&gt;concrete syntax trees&lt;/a&gt; (CST), I
couldn't resist playing with it a bit.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/rust-analyzer/ungrammar/tree/master"&gt;Ungrammar&lt;/a&gt; is used
in &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;rust-analyzer&lt;/span&gt;&lt;/tt&gt; to define and access a …&lt;/p&gt;</summary><content type="html">&lt;p&gt;It won't be news to the readers of this blog that I have &lt;a class="reference external" href="https://github.com/eliben/pycparser"&gt;some interest&lt;/a&gt; in
&lt;a class="reference external" href="https://eli.thegreenplace.net/tag/compilation"&gt;compiler&lt;/a&gt;
&lt;a class="reference external" href="https://eli.thegreenplace.net/tag/recursive-descent-parsing"&gt;front-ends&lt;/a&gt;.
So when I heard about a new(-ish) DSL for
&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Parse_tree"&gt;concrete syntax trees&lt;/a&gt; (CST), I
couldn't resist playing with it a bit.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/rust-analyzer/ungrammar/tree/master"&gt;Ungrammar&lt;/a&gt; is used
in &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;rust-analyzer&lt;/span&gt;&lt;/tt&gt; to define and access a CST for Rust.
&lt;a class="reference external" href="https://rust-analyzer.github.io/blog/2020/10/24/introducing-ungrammar.html"&gt;This blog post&lt;/a&gt;
by its creator provides much more details. According to the author, Ungrammar
is &amp;quot;the ASDL for concrete syntax trees&amp;quot;. This sounded interesting,
since I've been &lt;a class="reference external" href="https://eli.thegreenplace.net/2014/06/04/using-asdl-to-describe-asts-in-compilers"&gt;dabbling in ASDL in the past&lt;/a&gt;,
and also have experience with similar techniques for defining
&lt;a class="reference external" href="https://github.com/eliben/pycparser"&gt;pycparser ASTs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The result is &lt;a class="reference external" href="https://github.com/eliben/go-ungrammar"&gt;go-ungrammar&lt;/a&gt;,
a re-implementation of Ungrammar in Go. The input is an Ungrammar file defining
some CST; for example, here's a simple calculator language:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;Program = Stmt*

Stmt = AssignStmt | Expr

AssignStmt = &amp;#39;set&amp;#39; &amp;#39;ident&amp;#39; &amp;#39;=&amp;#39; Expr

Expr =
    Literal
  | UnaryExpr
  | ParenExpr
  | BinExpr

UnaryExpr = op:(&amp;#39;+&amp;#39; | &amp;#39;-&amp;#39;) Expr

ParenExpr = &amp;#39;(&amp;#39; Expr &amp;#39;)&amp;#39;

BinExpr = lhs:Expr op:(&amp;#39;+&amp;#39; | &amp;#39;-&amp;#39; | &amp;#39;*&amp;#39; | &amp;#39;/&amp;#39; | &amp;#39;%&amp;#39;) rhs:Expr

Literal = &amp;#39;int_literal&amp;#39; | &amp;#39;ident&amp;#39;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Ungrammar looks a bit like EBNF, but not &lt;em&gt;quite&lt;/em&gt; (hence the name &amp;quot;ungrammar&amp;quot;).
It's much simpler because it doesn't need to concern itself with precedence,
ambiguities and so on, also leaving all the (often complex) lexical rules to the
lexer. It simply defines a &lt;em&gt;tree&lt;/em&gt; that can be used to represent parsed language.
It's also different from ASTs in that it preserves all tokens, including
delimiters and other syntax elements. This is useful for tools like language
servers that need a full-fidelity representation of the source code.&lt;/p&gt;
&lt;div class="section" id="implementation-notes"&gt;
&lt;h2&gt;Implementation notes&lt;/h2&gt;
&lt;p&gt;&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;go-ungrammar&lt;/span&gt;&lt;/tt&gt; uses a classical &lt;a class="reference external" href="https://github.com/eliben/go-ungrammar/blob/main/lexer.go"&gt;hand-written lexical analyzer&lt;/a&gt;
and a &lt;a class="reference external" href="https://github.com/eliben/go-ungrammar/blob/main/parser.go"&gt;recursive
descent parser&lt;/a&gt;.
Just for fun, I spent more time on error recovery than strictly necessary for
such a simple input language. The lexer &lt;a class="reference external" href="https://www.youtube.com/watch?v=dQw4w9WgXcQ"&gt;never gives up&lt;/a&gt; when encountering non-sensical
input; it simply emits an &lt;tt class="docutils literal"&gt;ERROR&lt;/tt&gt; token and keeps going. The parser doesn't
quit on the first error either; instead, it collects all the errors it
encounters and tries to recover from each one (the &lt;tt class="docutils literal"&gt;synchronize()&lt;/tt&gt; method in
the parser code). As an example of this in action, consider this faulty
Ungrammar input:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;foo = @
bar = ( joe
x = y
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;At first glance, there are at least a couple of issues here:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;tt class="docutils literal"&gt;&amp;#64;&lt;/tt&gt; is not a valid Ungrammar token&lt;/li&gt;
&lt;li&gt;The &lt;tt class="docutils literal"&gt;(&lt;/tt&gt; in the second rule is unterminated; as all programmers know,
unterminated grouping elements spell trouble because the compiler can get
easily confused until it finds a valid terminator&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;go-ungrammar&lt;/span&gt;&lt;/tt&gt; runs it will report an error that looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;1:7: unknown token starting with &amp;#39;@&amp;#39; (and 2 more errors)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://github.com/eliben/go-ungrammar/blob/main/errorlist.go"&gt;concrete error type&lt;/a&gt; returned by
the parser collects all the errors, so we can iterate over them and display them
all:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;1:7: unknown token starting with &amp;#39;@&amp;#39;
2:1: expected rule, got bar
3:1: expected &amp;#39;)&amp;#39;, got x
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The parser recovers after the first error expecting to see the RHS
(right-hand-side) for the &lt;tt class="docutils literal"&gt;foo&lt;/tt&gt; rule, but doesn't find any. This is a good
place to discuss parser recovery. The Ungrammar language has a significant
ambiguity:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;foo = bar baz = barn
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Are &lt;tt class="docutils literal"&gt;bar baz&lt;/tt&gt; the RHS sequence for rule &lt;tt class="docutils literal"&gt;foo&lt;/tt&gt;, or is &lt;tt class="docutils literal"&gt;baz =&lt;/tt&gt; the beginning
of a new rule? Note that the language is whitespace-insensitive, so this really
does come up; just look at the example calculator Ungrammar above - this is
encountered on pretty much any new rule.&lt;/p&gt;
&lt;p&gt;The way &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;go-ungrammar&lt;/span&gt;&lt;/tt&gt; resolves the ambiguity is by using an &lt;tt class="docutils literal"&gt;NODE =&lt;/tt&gt;
lookahead, deciding it's the beginning of a new rule (&lt;tt class="docutils literal"&gt;NODE&lt;/tt&gt; is an Ungrammar
term for &amp;quot;plain identifier&amp;quot;).&lt;/p&gt;
&lt;p&gt;Back to our recovery example: the second error is the parser complaining that
it expected some rule after &lt;tt class="docutils literal"&gt;foo =&lt;/tt&gt; but found none; an empty RHS is invalid
in Ungrammar and the &lt;tt class="docutils literal"&gt;&amp;#64;&lt;/tt&gt; was reported and skipped. So the parser complains
that it found a new rule definition instead of the RHS for an existing rule.
At this point it re-synchronizes and parses the &lt;tt class="docutils literal"&gt;bar =&lt;/tt&gt; rule. Then it runs into
the third error - the &lt;tt class="docutils literal"&gt;(&lt;/tt&gt; is unterminated. Still, the parser recovers and
keeps going.&lt;/p&gt;
&lt;p&gt;Even with all these errors, the parser will produce a partial result - a tree
equivalent to this input:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;bar = joe
x = y
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;For &lt;tt class="docutils literal"&gt;foo&lt;/tt&gt; there was simply nothing to parse. For &lt;tt class="docutils literal"&gt;bar&lt;/tt&gt;, the parser reported
the missing &lt;tt class="docutils literal"&gt;)&lt;/tt&gt; but parsed the contents anyway. It then fully recovered and
was able to parse &lt;tt class="docutils literal"&gt;x = y&lt;/tt&gt; properly. Being able to parse incomplete input and
produce partial trees is very important for error recovery, and especially for
tools like language servers that need to be resilient in the presence of partial
input the user is busy typing in.&lt;/p&gt;
&lt;p&gt;I enjoyed coding this resilient parser; while it's probably an overkill for
a language as simple as Ungrammar, it's a good kata for frontend construction.&lt;/p&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Go"></category><category term="Compilation"></category><category term="Recursive descent parsing"></category></entry><entry><title>Deciphering Haskell's applicative and monadic parsers</title><link href="https://eli.thegreenplace.net/2017/deciphering-haskells-applicative-and-monadic-parsers/" rel="alternate"></link><published>2017-11-27T05:28:00-08:00</published><updated>2024-05-04T19:46:23-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2017-11-27:/2017/deciphering-haskells-applicative-and-monadic-parsers/</id><summary type="html">&lt;p&gt;This post follows the construction of parsers described in &lt;a class="reference external" href="http://www.cs.nott.ac.uk/~pszgmh/pih.html"&gt;Graham Hutton's
&amp;quot;Programming in Haskell&amp;quot; (2nd edition)&lt;/a&gt;. It's my attempt to work through
chapter 13 in this book and understand the details of applicative and monadic
combination of parsers presented therein.&lt;/p&gt;
&lt;div class="section" id="basic-definitions-for-the-parser-type"&gt;
&lt;h2&gt;Basic definitions for the Parser type&lt;/h2&gt;
&lt;p&gt;A parser parameterized on …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;This post follows the construction of parsers described in &lt;a class="reference external" href="http://www.cs.nott.ac.uk/~pszgmh/pih.html"&gt;Graham Hutton's
&amp;quot;Programming in Haskell&amp;quot; (2nd edition)&lt;/a&gt;. It's my attempt to work through
chapter 13 in this book and understand the details of applicative and monadic
combination of parsers presented therein.&lt;/p&gt;
&lt;div class="section" id="basic-definitions-for-the-parser-type"&gt;
&lt;h2&gt;Basic definitions for the Parser type&lt;/h2&gt;
&lt;p&gt;A parser parameterized on some type &lt;tt class="docutils literal"&gt;a&lt;/tt&gt; is:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kr"&gt;newtype&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's a function taking a &lt;tt class="docutils literal"&gt;String&lt;/tt&gt; and returning a list of &lt;tt class="docutils literal"&gt;(a,String)&lt;/tt&gt;
pairs, where &lt;tt class="docutils literal"&gt;a&lt;/tt&gt; is a value of the parameterized type and &lt;tt class="docutils literal"&gt;String&lt;/tt&gt; is (by
convention) the unparsed remainder of the input. The returned list is
potentially empty, which signals a failure in parsing &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;. It might have made
more sense to define &lt;tt class="docutils literal"&gt;Parser&lt;/tt&gt; as a &lt;tt class="docutils literal"&gt;type&lt;/tt&gt; alias for the function, but
&lt;tt class="docutils literal"&gt;type&lt;/tt&gt;s can't be made into instances of typeclasses; therefore, we use
&lt;tt class="docutils literal"&gt;netwype&lt;/tt&gt; with a dummy constructor named &lt;tt class="docutils literal"&gt;P&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;With this &lt;tt class="docutils literal"&gt;Parser&lt;/tt&gt; type, the act of actually parsing a string is expressed
with the following helper function. It's not strictly necessary, but it helps
make code cleaner by hiding &lt;tt class="docutils literal"&gt;P&lt;/tt&gt; from users of the parser.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The most basic parsing primitive plucks off the first character from a given
string:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Char&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nf"&gt;item&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;of&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="kt"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="kt"&gt;:&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;xs&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here's how it works in practice:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&amp;gt; parse item &amp;quot;foo&amp;quot;
[(&amp;#39;f&amp;#39;,&amp;quot;oo&amp;quot;)]
&amp;gt; parse item &amp;quot;f&amp;quot;
[(&amp;#39;f&amp;#39;,&amp;quot;&amp;quot;)]
&amp;gt; parse item &amp;quot;&amp;quot;
[]
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="parser-as-a-functor"&gt;
&lt;h2&gt;Parser as a Functor&lt;/h2&gt;
&lt;p&gt;We'll start by making &lt;tt class="docutils literal"&gt;Parser&lt;/tt&gt; an instance of &lt;tt class="docutils literal"&gt;Functor&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kr"&gt;instance&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Functor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;where&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- fmap :: (a -&amp;gt; b) -&amp;gt; Parser a -&amp;gt; Parser b&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;fmap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;of&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="kt"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;With &lt;tt class="docutils literal"&gt;fmap&lt;/tt&gt; we can create a new parser from an existing parser, with a
function applied to the parser's output. For example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&amp;gt; parse (fmap toUpper item) &amp;quot;foo&amp;quot;
[(&amp;#39;F&amp;#39;,&amp;quot;oo&amp;quot;)]
&amp;gt; parse (fmap toUpper item) &amp;quot;&amp;quot;
[]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's check that the functor laws work for this definition. The first law:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;fmap id = id
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Is fairly obvious when we substitute &lt;tt class="docutils literal"&gt;id&lt;/tt&gt; for &lt;tt class="docutils literal"&gt;g&lt;/tt&gt; in the definition of
&lt;tt class="docutils literal"&gt;fmap&lt;/tt&gt;. We get:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;fmap id p = P (\inp -&amp;gt; case parse p inp of
                        []        -&amp;gt; []
                        [(v,out)] -&amp;gt; [(id v,out)])
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which takes the parse result of &lt;tt class="docutils literal"&gt;p&lt;/tt&gt; and passes it through without
modification. In other words, it's equivalent to &lt;tt class="docutils literal"&gt;p&lt;/tt&gt; itself, and hence the
first law holds.&lt;/p&gt;
&lt;p&gt;Verifying the second law:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;fmap (g . h) = fmap g . fmap h
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;... is similarly straightforward and is left as an exercise to the reader.&lt;/p&gt;
&lt;p&gt;While it's not obvious why a &lt;tt class="docutils literal"&gt;Functor&lt;/tt&gt; instance for &lt;tt class="docutils literal"&gt;Parser&lt;/tt&gt; is useful in
its own right, it's actually required to make &lt;tt class="docutils literal"&gt;Parser&lt;/tt&gt; into an
&lt;tt class="docutils literal"&gt;Applicative&lt;/tt&gt;, and also when combining parsers using applicative style.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="parser-as-an-applicative"&gt;
&lt;h2&gt;Parser as an Applicative&lt;/h2&gt;
&lt;p&gt;Consider parsing conditional expressions in a fictional language:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;if &amp;lt;expr&amp;gt; then &amp;lt;expr&amp;gt; else &amp;lt;expr&amp;gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;To parse such expressions we'd like to say:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;Parse the token &lt;tt class="docutils literal"&gt;if&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;Parse an &amp;lt;expr&amp;gt;&lt;/li&gt;
&lt;li&gt;Parse the token &lt;tt class="docutils literal"&gt;then&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;Parse an &amp;lt;expr&amp;gt;&lt;/li&gt;
&lt;li&gt;Parse the token &lt;tt class="docutils literal"&gt;else&lt;/tt&gt;&lt;/li&gt;
&lt;li&gt;Parse an &amp;lt;expr&amp;gt;&lt;/li&gt;
&lt;li&gt;If all of this was successful, combine all the parsed expressions into some
sort of result, like an AST node.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Such sequences, along with alternation (an expression is either &lt;em&gt;this&lt;/em&gt; or
&lt;em&gt;that&lt;/em&gt;) are two of the critical basic blocks of constructing non-trivial
parsers. Let's see a popular way to accomplish this in Haskell (for a complete
example demonstrating how to construct a parser for this particular conditional
expression, see the last section in this post).&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://en.wikipedia.org/wiki/Parser_combinator"&gt;Parser combinators&lt;/a&gt; is a
popular technique for constructing complex parsers from simpler parsers, by
means of higher-order functions. In Haskell, one of the ways in which parsers
can be elegantly combined is using applicative style. Here's the &lt;tt class="docutils literal"&gt;Applicative&lt;/tt&gt;
instance for &lt;tt class="docutils literal"&gt;Parser&lt;/tt&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kr"&gt;instance&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Applicative&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;where&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- pure :: a -&amp;gt; Parser a&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;pure&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- &amp;lt;*&amp;gt; :: Parser (a -&amp;gt; b) -&amp;gt; Parser a -&amp;gt; Parser b&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;pg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;of&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                            &lt;/span&gt;&lt;span class="kt"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                            &lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;px&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Recall how we created a parser that applied &lt;tt class="docutils literal"&gt;toUpper&lt;/tt&gt; to its result using
&lt;tt class="docutils literal"&gt;fmap&lt;/tt&gt;? We can now do the same in applicative style:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&amp;gt; parse (pure toUpper &amp;lt;*&amp;gt; item) &amp;quot;foo&amp;quot;
[(&amp;#39;F&amp;#39;,&amp;quot;oo&amp;quot;)]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's see why this works. While not too exciting on its own, this application of
a single-argument function is a good segue to more complicated use cases.&lt;/p&gt;
&lt;p&gt;Looking at the &lt;tt class="docutils literal"&gt;Applicative&lt;/tt&gt; instance, &lt;tt class="docutils literal"&gt;pure toUpper&lt;/tt&gt; translates to
&lt;tt class="docutils literal"&gt;P (\inp &lt;span class="pre"&gt;-&amp;gt;&lt;/span&gt; [(toUpper,inp)]&lt;/tt&gt; - a parser that passes its input through
unchanged, returning &lt;tt class="docutils literal"&gt;toUpper&lt;/tt&gt; as a result. Now, substituting &lt;tt class="docutils literal"&gt;item&lt;/tt&gt; into
the definition of &lt;tt class="docutils literal"&gt;&amp;lt;*&amp;gt;&lt;/tt&gt; we get:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pg &amp;lt;*&amp;gt; item = P (\inp -&amp;gt; case parse pg inp of
                            []        -&amp;gt; []
                            [(g,out)] -&amp;gt; parse (fmap g item) out)

... pg is (pure toUpper), the parsing of which always succeeds, returning
    [(toUpper,inp)]

pg &amp;lt;*&amp;gt; item = P (\inp -&amp;gt; parse (fmap toUpper item) inp)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In other words, this is exactly the example we had for &lt;tt class="docutils literal"&gt;Functor&lt;/tt&gt; by
&lt;tt class="docutils literal"&gt;fmap&lt;/tt&gt;-ing &lt;tt class="docutils literal"&gt;toUpper&lt;/tt&gt; onto &lt;tt class="docutils literal"&gt;item&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;The more interesting case is applying functions with multiple parameters. Here's
how we define a parser that parses three items from the input, dropping the
middle result:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;dropMiddle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Char&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="kt"&gt;Char&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nf"&gt;dropMiddle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;pure&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kr"&gt;where&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Following the application of nested &lt;tt class="docutils literal"&gt;&amp;lt;*&amp;gt;&lt;/tt&gt; operators is tricky because it
builds a run-time chain of functions referring to other functions. This chain
is only collapsed when the parser is used to actually &lt;tt class="docutils literal"&gt;parse&lt;/tt&gt; some input, so
it is necessary to keep a lot of context &amp;quot;on the fly&amp;quot;. To better understand how
this works, we can break the definition of &lt;tt class="docutils literal"&gt;dropMiddle&lt;/tt&gt; into parts as follows
(since &lt;tt class="docutils literal"&gt;&amp;lt;*&amp;gt;&lt;/tt&gt; is left-associative):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;dropMiddle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;pure&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kr"&gt;where&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Applying the first &lt;tt class="docutils literal"&gt;&amp;lt;*&amp;gt;&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;pg &amp;lt;*&amp;gt; item = P (\inp -&amp;gt; case parse pg inp of
                            []        -&amp;gt; []
                            [(g,out)] -&amp;gt; parse (fmap g item) out)

... pg is (pure selector), the parsing of which always succeeds, returning
    [(selector,inp)]

pg &amp;lt;*&amp;gt; item = P (\inp -&amp;gt; parse (fmap selector item) inp)  --= app1
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's call this parser &lt;tt class="docutils literal"&gt;app1&lt;/tt&gt; and apply the second &lt;tt class="docutils literal"&gt;&amp;lt;*&amp;gt;&lt;/tt&gt; in the sequence.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;app1 &amp;lt;*&amp;gt; item = P (\inp -&amp;gt; case parse app1 inp of
                            []        -&amp;gt; []
                            [(g,out)] -&amp;gt; parse (fmap g item) out)  --= app2
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We'll call this &lt;tt class="docutils literal"&gt;app2&lt;/tt&gt; and move on. Similarly, applying the third &lt;tt class="docutils literal"&gt;&amp;lt;*&amp;gt;&lt;/tt&gt; in
the sequence produces:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;app2 &amp;lt;*&amp;gt; item = P (\inp -&amp;gt; case parse app2 inp of
                            []        -&amp;gt; []
                            [(g,out)] -&amp;gt; parse (fmap g item) out)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is &lt;tt class="docutils literal"&gt;dropMiddle&lt;/tt&gt;. It's a chain of parsers expressed as a compbination of
higher-order functions (closures, actually).&lt;/p&gt;
&lt;p&gt;To see how this combined parser actually parses input, let's trace through the
execution of:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&amp;gt; parse dropMiddle &amp;quot;pumpkin&amp;quot;
[((&amp;#39;p&amp;#39;,&amp;#39;m&amp;#39;),&amp;quot;pkin&amp;quot;)]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;tt class="docutils literal"&gt;dropMiddle&lt;/tt&gt; is &lt;tt class="docutils literal"&gt;app2 &amp;lt;*&amp;gt; item&lt;/tt&gt;, so we have:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;-- parse dropMiddle

parse P (\inp -&amp;gt; case parse app2 inp of
                   []         -&amp;gt; []
                   [(g,out)]  -&amp;gt; parse (fmap g item) out)
      &amp;quot;pumpkin&amp;quot;

.. substituting &amp;quot;pumpkin&amp;quot; into inp

case parse app2 &amp;quot;pumpkin&amp;quot; of
 []         -&amp;gt; []
 [(g,out)]  -&amp;gt; parse (fmap g item) out
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now &lt;tt class="docutils literal"&gt;parse app2 &amp;quot;pumpkin&amp;quot;&lt;/tt&gt; is going to be invoked; &lt;tt class="docutils literal"&gt;app2&lt;/tt&gt; is &lt;tt class="docutils literal"&gt;app1 &amp;lt;*&amp;gt;
item&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;-- parse app2

case parse app1 &amp;quot;pumpkin&amp;quot; of
 []         -&amp;gt; []
 [(g,out)]  -&amp;gt; parse (fmap g item) out
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Similarly, we get to &lt;tt class="docutils literal"&gt;parse app1 &amp;quot;pumpkin&amp;quot;&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;-- parse app1

parse (fmap selector item) &amp;quot;pumpkin&amp;quot;

.. following the definition of fmap

parse P (\inp -&amp;gt; case parse item inp of
                  []        -&amp;gt; []
                  [(v,out)] -&amp;gt; [(selector v,out)])
      &amp;quot;pumpkin&amp;quot;

.. Since (parse item &amp;quot;pumpkin&amp;quot;) returns [(&amp;#39;p&amp;#39;,&amp;quot;umpkin&amp;quot;)], we get:

[(selector &amp;#39;p&amp;#39;,&amp;quot;umpkin&amp;quot;)]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Now going back to &lt;tt class="docutils literal"&gt;parse app2&lt;/tt&gt;, knowing what &lt;tt class="docutils literal"&gt;parse app1 &amp;quot;pumpkin&amp;quot;&lt;/tt&gt; returns:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;parse (fmap (selector &amp;#39;p&amp;#39;) item) &amp;quot;umpkin&amp;quot;

.. following the definition of fmap

parse P (\inp -&amp;gt; case parse item inp of
                  []        -&amp;gt; []
                  [(v,out)] -&amp;gt; [(selector &amp;#39;p&amp;#39; v,out)])
      &amp;quot;umpkin&amp;quot;

[(selector &amp;#39;p&amp;#39; &amp;#39;u&amp;#39;,&amp;quot;mpkin&amp;quot;)]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Finally, &lt;tt class="docutils literal"&gt;dropMiddle&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;app2 &amp;lt;*&amp;gt; item = P (\inp -&amp;gt; case parse app2 inp of
                            []        -&amp;gt; []
                            [(g,out)] -&amp;gt; parse (fmap g item) out)

.. Since (parse app2 &amp;quot;pumpkin&amp;quot;) returns [(selector &amp;#39;p&amp;#39; &amp;#39;u&amp;#39;,&amp;quot;mpkin&amp;quot;)]

parse (fmap (selector &amp;#39;p&amp;#39; &amp;quot;u&amp;quot;) item) &amp;quot;mpkin&amp;quot;

.. If we follow the definition of fmap again, we&amp;#39;ll get:

[(selector &amp;#39;p&amp;#39; &amp;#39;u&amp;#39; &amp;#39;m&amp;#39;,&amp;quot;pkin&amp;quot;)]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is the final result of applying &lt;tt class="docutils literal"&gt;dropMiddle&lt;/tt&gt; to &amp;quot;pumpkin&amp;quot;, and when
&lt;tt class="docutils literal"&gt;selector&lt;/tt&gt; is invoked we get &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;[(('p','m'),&amp;quot;pkin&amp;quot;)]&lt;/span&gt;&lt;/tt&gt;, as expected.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="parser-as-a-monad"&gt;
&lt;h2&gt;Parser as a Monad&lt;/h2&gt;
&lt;p&gt;Parsers can also be expressed and combined using monadic style. Here's the
&lt;tt class="docutils literal"&gt;Monad&lt;/tt&gt; instance for &lt;tt class="docutils literal"&gt;Parser&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kr"&gt;instance&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Monad&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;where&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- return :: a -&amp;gt; Parser a&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pure&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;-- (&amp;gt;&amp;gt;=) :: Parser a -&amp;gt; (a -&amp;gt; Parser b) -&amp;gt; Parser b&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;P&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;\&lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;case&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;of&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="kt"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's take the simple example of applying &lt;tt class="docutils literal"&gt;toUpper&lt;/tt&gt; to &lt;tt class="docutils literal"&gt;item&lt;/tt&gt; again, this
time using monadic operators:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&amp;gt; parse (item &amp;gt;&amp;gt;= (\x -&amp;gt; return $ toUpper x)) &amp;quot;foo&amp;quot;
[(&amp;#39;F&amp;#39;,&amp;quot;oo&amp;quot;)]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Substituting in the definition of &lt;tt class="docutils literal"&gt;&amp;gt;&amp;gt;=&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;item &amp;gt;&amp;gt;= (\x -&amp;gt; return $ toUpper x) =
  P (\inp -&amp;gt; case parse item inp of
                []        -&amp;gt; []
                [(v,out)] -&amp;gt; parse (return $ toUpper v) out)

... if item succeeds, this is a parser that will always succeed with
    the upper-cased result of item
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;When writing in monadic style, however, we won't typically be using the &lt;tt class="docutils literal"&gt;&amp;gt;&amp;gt;=&lt;/tt&gt;
operator explicitly; instead, we'll use the &lt;tt class="docutils literal"&gt;do&lt;/tt&gt; notation. Recall that in the
general multi-parameter case, this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;m1 &amp;gt;&amp;gt;= \x1 -&amp;gt;
  m2 &amp;gt;&amp;gt;= \x2 -&amp;gt;
    ...
      mn &amp;gt;&amp;gt;= \xn -&amp;gt; f x1 x2 ... xn
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Is equivalent to this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;do x1 &amp;lt;- m1
   x2 &amp;lt;- m2
   ...
   xn &amp;lt;- mn
   f x1 x2 ... xn
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;So we can also rewrite our example as:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&amp;gt; parse (do x &amp;lt;- item; return $ toUpper x) &amp;quot;foo&amp;quot;
[(&amp;#39;F&amp;#39;,&amp;quot;oo&amp;quot;)]
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;tt class="docutils literal"&gt;do&lt;/tt&gt; notation starts looking much more attractive for multiple parameters,
however. Here's &lt;tt class="docutils literal"&gt;dropMiddle&lt;/tt&gt; in monadic style written directly &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;dropMiddleM :: Parser (Char,Char)
dropMiddleM = item &amp;gt;&amp;gt;= \x -&amp;gt;
                item &amp;gt;&amp;gt;= \_ -&amp;gt;
                  item &amp;gt;&amp;gt;= \z -&amp;gt; return (x,z)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And now rewritten using &lt;tt class="docutils literal"&gt;do&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;dropMiddleM&amp;#39; :: Parser (Char,Char)
dropMiddleM&amp;#39; =
  do  x &amp;lt;- item
      item
      z &amp;lt;- item
      return (x,z)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Let's do a detailed breakdown of what's happening here to better understand the
monadic sequencing mechanics. I'll be using the direct style (&lt;tt class="docutils literal"&gt;dropMiddleM&lt;/tt&gt;)
to unravel the applications of &lt;tt class="docutils literal"&gt;&amp;gt;&amp;gt;=&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;item &amp;gt;&amp;gt;= \x -&amp;gt;
  item &amp;gt;&amp;gt;= \_ -&amp;gt;
    item &amp;gt;&amp;gt;= \z -&amp;gt; return (x,z)

.. applying the first &amp;gt;&amp;gt;=, calling the right-hand side rhsX

P (\inp -&amp;gt; case parse item inp of
              []        -&amp;gt; []
              [(v,out)] -&amp;gt; parse (rhsX v) out)

.. the result of parsing the first item is passed in as the argument to rhsX,
   which then returns the next application of &amp;gt;&amp;gt;=; As usual, we acknowledge
   the error propagation and ignore it for simplicity.

P (\inp -&amp;gt; case parse item inp of
              []        -&amp;gt; []
              [(v,out)] -&amp;gt; parse (rhsY v) out)

... and similarly for rhsZ; the final result is invoking &amp;quot;parse return (x,z)&amp;quot;
    where x is the result of parsing the first item and z the result of
    parsing the third.
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-complete-example"&gt;
&lt;h2&gt;A complete example&lt;/h2&gt;
&lt;p&gt;As a complete example, I've expanded the parser grammar found in the book to
support conditional expressions. The full example is &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2017/haskell-parsers/exprparser.hs"&gt;available here&lt;/a&gt;.
Recall that wa want to parse expressions of the form:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;if &amp;lt;expr&amp;gt; then &amp;lt;expr&amp;gt; else &amp;lt;expr&amp;gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is the monadic parser &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;ifexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nf"&gt;ifexpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;do&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;if&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;cond&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;then&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;thenExpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;else&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;elseExpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kr"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cond&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;then&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;elseExpr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;thenExpr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And this is the equivalent applicative version (&lt;tt class="docutils literal"&gt;&amp;lt;$&amp;gt;&lt;/tt&gt; is just an infix
synonym for &lt;tt class="docutils literal"&gt;fmap&lt;/tt&gt;):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nf"&gt;ifexpr&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;::&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Parser&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="nf"&gt;ifexpr&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;$&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;if&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;then&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;else&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;*&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;expr&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kr"&gt;where&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;_&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cond&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;_&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;_&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cond&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;then&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kr"&gt;else&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which one is better? It's really a matter of personal taste. Since both the
monadic and applicative styles deal in &lt;tt class="docutils literal"&gt;Parser&lt;/tt&gt;s, they can be freely mixed
and combined.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Failures could also be signaled by using &lt;tt class="docutils literal"&gt;Maybe&lt;/tt&gt;, but a list lets us
express multiple results (for example a string that can be parsed in
multiple ways). We're not going to be using multiple results in this
article, but it's good to keep this option open.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;We could also use the monadic operator &lt;tt class="docutils literal"&gt;&amp;gt;&amp;gt;&lt;/tt&gt; for statements that
don't create a new assignment, but using &lt;tt class="docutils literal"&gt;&amp;gt;&amp;gt;=&lt;/tt&gt; everywhere for
consistency makes it a bit easier to understand.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;The return value of this parser is &lt;tt class="docutils literal"&gt;Int&lt;/tt&gt;, because it evaluates the
parsed expression on the fly - this technique is called &lt;em&gt;Syntax Directed
Translation&lt;/em&gt; in the Dragon book. Note also that the conditional clauses
are evaluated eagerly, which is valid only when no side effects are
present.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Haskell"></category><category term="Recursive descent parsing"></category></entry><entry><title>Parsing expressions by precedence climbing</title><link href="https://eli.thegreenplace.net/2012/08/02/parsing-expressions-by-precedence-climbing" rel="alternate"></link><published>2012-08-02T05:48:43-07:00</published><updated>2024-05-04T19:46:23-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2012-08-02:/2012/08/02/parsing-expressions-by-precedence-climbing</id><summary type="html">
        &lt;p&gt;I've written &lt;a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/"&gt;previously&lt;/a&gt; about the problem recursive descent parsers have with expressions, especially when the language has multiple levels of operator precedence.&lt;/p&gt;
&lt;p&gt;There are several ways to attack this problem. The Wikipedia article on &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Operator-precedence_parser"&gt;operator-precedence parsers&lt;/a&gt; mentions three algorithms: Shunting Yard, top-down operator precedence (TDOP) and precedence climbing. I have …&lt;/p&gt;</summary><content type="html">
        &lt;p&gt;I've written &lt;a class="reference external" href="https://eli.thegreenplace.net/2009/03/14/some-problems-of-recursive-descent-parsers/"&gt;previously&lt;/a&gt; about the problem recursive descent parsers have with expressions, especially when the language has multiple levels of operator precedence.&lt;/p&gt;
&lt;p&gt;There are several ways to attack this problem. The Wikipedia article on &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Operator-precedence_parser"&gt;operator-precedence parsers&lt;/a&gt; mentions three algorithms: Shunting Yard, top-down operator precedence (TDOP) and precedence climbing. I have already covered &lt;a class="reference external" href="https://eli.thegreenplace.net/2009/03/20/a-recursive-descent-parser-with-an-infix-expression-evaluator/"&gt;Shunting Yard&lt;/a&gt; and &lt;a class="reference external" href="https://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing/"&gt;TDOP&lt;/a&gt; in this blog. Here I aim to present the third method (and the one that actually ends up being used a lot in practice) - precedence climbing.&lt;/p&gt;
&lt;div class="section" id="precedence-climbing-what-it-aims-to-achieve"&gt;
&lt;h3&gt;Precedence climbing - what it aims to achieve&lt;/h3&gt;
&lt;p&gt;It's not necessary to be familiar with the other algorithms for expression parsing in order to understand precedence climbing. In fact, I think that precedence climbing is the simplest of them all. To explain it, I want to first present what the algorithm is trying to achieve. After this, I will explain how it does this, and finally will present a fully functional implementation in Python.&lt;/p&gt;
&lt;p&gt;So the basic goal of the algorithm is the following: treat an expression as a bunch of nested sub-expressions, where each sub-expression has in common the lowest precedence level of the the operators it contains.&lt;/p&gt;
&lt;p&gt;Here's a simple example:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2 + 3 * 4 * 5 - 6
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Assuming that the precedence of &lt;tt class="docutils literal"&gt;+&lt;/tt&gt; (and &lt;tt class="docutils literal"&gt;-&lt;/tt&gt;) is 1 and the precedence of &lt;tt class="docutils literal"&gt;*&lt;/tt&gt; (and &lt;tt class="docutils literal"&gt;/&lt;/tt&gt;) is 2, we have:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2 + 3 * 4 * 5 - 6

|---------------|   : prec 1
    |-------|       : prec 2
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The sub-expression multiplying the three numbers has a minimal precedence of 2. The sub-expression spanning the whole original expression has a minimal precedence of 1.&lt;/p&gt;
&lt;p&gt;Here's a more complex example, adding a power operator &lt;tt class="docutils literal"&gt;^&lt;/tt&gt; with precedence 3:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2 + 3 ^ 2 * 3 + 4

|---------------|   : prec 1
    |-------|       : prec 2
    |---|           : prec 3
&lt;/pre&gt;&lt;/div&gt;
&lt;div class="section" id="associativity"&gt;
&lt;h4&gt;Associativity&lt;/h4&gt;
&lt;p&gt;Binary operators, in addition to precedence, also have the concept of &lt;em&gt;associativity&lt;/em&gt;. Simply put, &lt;em&gt;left associative&lt;/em&gt; operators stick to the left stronger than to the right; &lt;em&gt;right associative&lt;/em&gt; operators vice versa.&lt;/p&gt;
&lt;p&gt;Some examples. Since addition is left associative, this:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2 + 3 + 4
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Is equivalent to this:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;(2 + 3) + 4
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;On the other hand, power (exponentiation) is right associative. This:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2 ^ 3 ^ 4
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Is equivalent to this:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2 ^ (3 ^ 4)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The precedence climbing algorithm also needs to handle associativity correctly.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="nested-parenthesized-sub-expressions"&gt;
&lt;h4&gt;Nested parenthesized sub-expressions&lt;/h4&gt;
&lt;p&gt;Finally, we all know that parentheses can be used to explicitly group sub-expressions, beating operator precedence. So the following expression computes the addition &lt;em&gt;before&lt;/em&gt; the multiplication:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2 * (3 + 5) * 7
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As we'll see, the algorithm has a special provision to cleverly handle nested sub-expressions.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="precedence-climbing-how-it-actually-works"&gt;
&lt;h3&gt;Precedence climbing - how it actually works&lt;/h3&gt;
&lt;p&gt;First let's define some terms. &lt;em&gt;Atoms&lt;/em&gt; are either numbers or parenthesized expressions. &lt;em&gt;Expressions&lt;/em&gt; consist of atoms connected by binary operators &lt;a class="footnote-reference" href="#id4" id="id1"&gt;[1]&lt;/a&gt;. Note how these two terms are mutually dependent. This is normal in the land of grammars and parsers.&lt;/p&gt;
&lt;p&gt;The algorithm is &lt;em&gt;operator-guided&lt;/em&gt;. Its fundamental step is to consume the next atom and look at the operator following it. If the operator has precedence lower than the lowest acceptable for the current step, the algorithm returns. Otherwise, it calls itself in a loop to handle the sub-expression. In pseudo-code, it looks like this &lt;a class="footnote-reference" href="#id5" id="id2"&gt;[2]&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;compute_expr(min_prec):
  result = compute_atom()

  while cur token is a binary operator with precedence &amp;gt;= min_prec:
    prec, assoc = precedence and associativity of current token
    if assoc is left:
      next_min_prec = prec + 1
    else:
      next_min_prec = prec
    rhs = compute_expr(next_min_prec)
    result = compute operator(result, rhs)

  return result
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Each recursive call here handles a sequence of operator-connected atoms sharing the same minimal precedence.&lt;/p&gt;
&lt;div class="section" id="an-example"&gt;
&lt;h4&gt;An example&lt;/h4&gt;
&lt;p&gt;To get a feel for how the algorithm works, let's start with an example:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2 + 3 ^ 2 * 3 + 4
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's recommended to follow the execution of the algorithm through this expression with, on paper. The computation is kicked off by calling &lt;tt class="docutils literal"&gt;compute_expr(1)&lt;/tt&gt;, because 1 is the minimal operator precedence among all operators we've defined. Here is the &amp;quot;call tree&amp;quot; the algorithm produces for this expression:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;* compute_expr(1)                # Initial call on the whole expression
  * compute_atom() --&amp;gt; 2
  * compute_expr(2)              # Loop entered, operator &amp;#39;+&amp;#39;
    * compute_atom() --&amp;gt; 3
    * compute_expr(3)
      * compute_atom() --&amp;gt; 2
      * result --&amp;gt; 2             # Loop not entered for &amp;#39;*&amp;#39; (prec &amp;lt; &amp;#39;^&amp;#39;)
    * result = 3 ^ 2 --&amp;gt; 9
    * compute_expr(3)
      * compute_atom() --&amp;gt; 3
      * result --&amp;gt; 3             # Loop not entered for &amp;#39;+&amp;#39; (prec &amp;lt; &amp;#39;*&amp;#39;)
    * result = 9 * 3 --&amp;gt; 27
  * result = 2 + 27 --&amp;gt; 29
  * compute_expr(2)              # Loop entered, operator &amp;#39;+&amp;#39;
    * compute_atom() --&amp;gt; 4
    * result --&amp;gt; 4               # Loop not entered - end of expression
  * result = 29 + 4 --&amp;gt; 33
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="handling-precedence"&gt;
&lt;h4&gt;Handling precedence&lt;/h4&gt;
&lt;p&gt;Note that the algorithm makes one recursive call per binary operator. Some of these calls are short lived - they will only consume an atom and return it because the &lt;tt class="docutils literal"&gt;while&lt;/tt&gt; loop is not entered (this happens on the second 2, as well as on the second 3 in the example expression above). Some are longer lived. The initial call to &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt; will compute the whole expression.&lt;/p&gt;
&lt;p&gt;The &lt;tt class="docutils literal"&gt;while&lt;/tt&gt; loop is the essential ingredient here. It's the thing that makes sure that the current &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt; call handles all consecutive operators with the given minimal precedence before exiting.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="handling-associativity"&gt;
&lt;h4&gt;Handling associativity&lt;/h4&gt;
&lt;p&gt;In my opinion, one of the coolest aspects of this algorithm is the simple and elegant way it handles associativity. It's all in that condition that either sets the minimal precedence for the next call to the current one, or current one plus one.&lt;/p&gt;
&lt;p&gt;Here's how this works. Assume we have this sub-expression somewhere:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;8 * 9 * 10

  ^
  |
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The arrow marks where the &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt; call is, having entered the &lt;tt class="docutils literal"&gt;while&lt;/tt&gt; loop. &lt;tt class="docutils literal"&gt;prec&lt;/tt&gt; is 2. Since the associativity of &lt;tt class="docutils literal"&gt;*&lt;/tt&gt; is left, &lt;tt class="docutils literal"&gt;next_min_prec&lt;/tt&gt; is set to 3. The recursive call to &lt;tt class="docutils literal"&gt;compute_expr(3)&lt;/tt&gt;, after consuming an atom, sees the next &lt;tt class="docutils literal"&gt;*&lt;/tt&gt; token:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;8 * 9 * 10

      ^
      |
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Since the precedence of &lt;tt class="docutils literal"&gt;*&lt;/tt&gt; is 2, while &lt;tt class="docutils literal"&gt;min_prec&lt;/tt&gt; is 3, the &lt;tt class="docutils literal"&gt;while&lt;/tt&gt; loop never runs and the call returns. So the original &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt; will get to handle the second multiplication, not the internal call. Essentially, this means that the expression is grouped as follows:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;(8 * 9) * 10
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Which is exactly what we want from left associativity.&lt;/p&gt;
&lt;p&gt;In contrast, for this expression:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;8 ^ 9 ^ 10
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The precedence of &lt;tt class="docutils literal"&gt;^&lt;/tt&gt; is 3, and since it's right associative, the &lt;tt class="docutils literal"&gt;min_prec&lt;/tt&gt; for the recursive call stays 3. This will mean that the recursive call &lt;em&gt;will&lt;/em&gt; consume the next &lt;tt class="docutils literal"&gt;^&lt;/tt&gt; operator before returning to the original &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt;, grouping the expression as follows:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;8 ^ (9 ^ 10)
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="handling-sub-expressions"&gt;
&lt;h4&gt;Handling sub-expressions&lt;/h4&gt;
&lt;p&gt;The algorithm pseudo-code presented above doesn't explain how parenthesized sub-expressions are handled. Consider this expression:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;2000 * (4 - 3) / 100
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It's not clear how the &lt;tt class="docutils literal"&gt;while&lt;/tt&gt; loop can handle this. The answer is &lt;tt class="docutils literal"&gt;compute_atom&lt;/tt&gt;. When it sees a left paren, it knows that a sub-expression will follow, so it calls &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt; on the sub expression (which lasts until the matching right paren), and returns its result as the result of the atom. So &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt; is oblivious to the existence of sub-expressions.&lt;/p&gt;
&lt;p&gt;Finally, in order to stay short the pseudo-code leaves some interesting details out. What follows is a full implementation of the algorithm that fills all the gaps.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="a-python-implementation"&gt;
&lt;h3&gt;A Python implementation&lt;/h3&gt;
&lt;p&gt;Here is a Python implementation of expression parsing by precedence climbing. It's kept short for simplicity, but can be be easily expanded to cover a more real-world language of expressions. The following sections present the code in small chunks. The whole code is &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2012/rd_infix_precedence.py"&gt;available here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I'll start with a small tokenizer class that breaks text into tokens and keeps a state. The grammar is very simple: numeric expressions, the basic arithmetic operators &lt;tt class="docutils literal"&gt;+, &lt;span class="pre"&gt;-,&lt;/span&gt; *, /, ^&lt;/tt&gt; and parens - &lt;tt class="docutils literal"&gt;(, )&lt;/tt&gt;.&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;Tok = namedtuple(&lt;span style="color: #7f007f"&gt;&amp;#39;Tok&amp;#39;&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;name value&amp;#39;&lt;/span&gt;)


&lt;span style="color: #00007f; font-weight: bold"&gt;class&lt;/span&gt; &lt;span style="color: #00007f"&gt;Tokenizer&lt;/span&gt;(&lt;span style="color: #00007f"&gt;object&lt;/span&gt;):
    &lt;span style="color: #7f007f"&gt;&amp;quot;&amp;quot;&amp;quot; Simple tokenizer object. The cur_token attribute holds the current&lt;/span&gt;
&lt;span style="color: #7f007f"&gt;        token (Tok). Call get_next_token() to advance to the&lt;/span&gt;
&lt;span style="color: #7f007f"&gt;        next token. cur_token is None before the first token is&lt;/span&gt;
&lt;span style="color: #7f007f"&gt;        taken and after the source ends.&lt;/span&gt;
&lt;span style="color: #7f007f"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    TOKPATTERN = re.compile(&lt;span style="color: #7f007f"&gt;&amp;quot;\s*(?:(\d+)|(.))&amp;quot;&lt;/span&gt;)

    &lt;span style="color: #00007f; font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #00007f"&gt;__init__&lt;/span&gt;(&lt;span style="color: #00007f"&gt;self&lt;/span&gt;, source):
        &lt;span style="color: #00007f"&gt;self&lt;/span&gt;._tokgen = &lt;span style="color: #00007f"&gt;self&lt;/span&gt;._gen_tokens(source)
        &lt;span style="color: #00007f"&gt;self&lt;/span&gt;.cur_token = &lt;span style="color: #00007f"&gt;None&lt;/span&gt;

    &lt;span style="color: #00007f; font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #00007f"&gt;get_next_token&lt;/span&gt;(&lt;span style="color: #00007f"&gt;self&lt;/span&gt;):
        &lt;span style="color: #7f007f"&gt;&amp;quot;&amp;quot;&amp;quot; Advance to the next token, and return it.&lt;/span&gt;
&lt;span style="color: #7f007f"&gt;        &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
        &lt;span style="color: #00007f; font-weight: bold"&gt;try&lt;/span&gt;:
            &lt;span style="color: #00007f"&gt;self&lt;/span&gt;.cur_token = &lt;span style="color: #00007f"&gt;self&lt;/span&gt;._tokgen.next()
        &lt;span style="color: #00007f; font-weight: bold"&gt;except&lt;/span&gt; StopIteration:
            &lt;span style="color: #00007f"&gt;self&lt;/span&gt;.cur_token = &lt;span style="color: #00007f"&gt;None&lt;/span&gt;
        &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; &lt;span style="color: #00007f"&gt;self&lt;/span&gt;.cur_token

    &lt;span style="color: #00007f; font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #00007f"&gt;_gen_tokens&lt;/span&gt;(&lt;span style="color: #00007f"&gt;self&lt;/span&gt;, source):
        &lt;span style="color: #00007f; font-weight: bold"&gt;for&lt;/span&gt; number, operator &lt;span style="color: #0000aa"&gt;in&lt;/span&gt; &lt;span style="color: #00007f"&gt;self&lt;/span&gt;.TOKPATTERN.findall(source):
            &lt;span style="color: #00007f; font-weight: bold"&gt;if&lt;/span&gt; number:
                &lt;span style="color: #00007f; font-weight: bold"&gt;yield&lt;/span&gt; Tok(&lt;span style="color: #7f007f"&gt;&amp;#39;NUMBER&amp;#39;&lt;/span&gt;, number)
            &lt;span style="color: #00007f; font-weight: bold"&gt;elif&lt;/span&gt; operator == &lt;span style="color: #7f007f"&gt;&amp;#39;(&amp;#39;&lt;/span&gt;:
                &lt;span style="color: #00007f; font-weight: bold"&gt;yield&lt;/span&gt; Tok(&lt;span style="color: #7f007f"&gt;&amp;#39;LEFTPAREN&amp;#39;&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;(&amp;#39;&lt;/span&gt;)
            &lt;span style="color: #00007f; font-weight: bold"&gt;elif&lt;/span&gt; operator == &lt;span style="color: #7f007f"&gt;&amp;#39;)&amp;#39;&lt;/span&gt;:
                &lt;span style="color: #00007f; font-weight: bold"&gt;yield&lt;/span&gt; Tok(&lt;span style="color: #7f007f"&gt;&amp;#39;RIGHTPAREN&amp;#39;&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;)&amp;#39;&lt;/span&gt;)
            &lt;span style="color: #00007f; font-weight: bold"&gt;else&lt;/span&gt;:
                &lt;span style="color: #00007f; font-weight: bold"&gt;yield&lt;/span&gt; Tok(&lt;span style="color: #7f007f"&gt;&amp;#39;BINOP&amp;#39;&lt;/span&gt;, operator)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next, &lt;tt class="docutils literal"&gt;compute_atom&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;&lt;span style="color: #00007f; font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #00007f"&gt;compute_atom&lt;/span&gt;(tokenizer):
    tok = tokenizer.cur_token
    &lt;span style="color: #00007f; font-weight: bold"&gt;if&lt;/span&gt; tok.name == &lt;span style="color: #7f007f"&gt;&amp;#39;LEFTPAREN&amp;#39;&lt;/span&gt;:
        tokenizer.get_next_token()
        val = compute_expr(tokenizer, &lt;span style="color: #007f7f"&gt;1&lt;/span&gt;)
        &lt;span style="color: #00007f; font-weight: bold"&gt;if&lt;/span&gt; tokenizer.cur_token.name != &lt;span style="color: #7f007f"&gt;&amp;#39;RIGHTPAREN&amp;#39;&lt;/span&gt;:
            parse_error(&lt;span style="color: #7f007f"&gt;&amp;#39;unmatched &amp;quot;(&amp;quot;&amp;#39;&lt;/span&gt;)
        tokenizer.get_next_token()
        &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; val
    &lt;span style="color: #00007f; font-weight: bold"&gt;elif&lt;/span&gt; tok &lt;span style="color: #0000aa"&gt;is&lt;/span&gt; &lt;span style="color: #00007f"&gt;None&lt;/span&gt;:
            parse_error(&lt;span style="color: #7f007f"&gt;&amp;#39;source ended unexpectedly&amp;#39;&lt;/span&gt;)
    &lt;span style="color: #00007f; font-weight: bold"&gt;elif&lt;/span&gt; tok.name == &lt;span style="color: #7f007f"&gt;&amp;#39;BINOP&amp;#39;&lt;/span&gt;:
        parse_error(&lt;span style="color: #7f007f"&gt;&amp;#39;expected an atom, not an operator &amp;quot;%s&amp;quot;&amp;#39;&lt;/span&gt; % tok.value)
    &lt;span style="color: #00007f; font-weight: bold"&gt;else&lt;/span&gt;:
        &lt;span style="color: #00007f; font-weight: bold"&gt;assert&lt;/span&gt; tok.name == &lt;span style="color: #7f007f"&gt;&amp;#39;NUMBER&amp;#39;&lt;/span&gt;
        tokenizer.get_next_token()
        &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; &lt;span style="color: #00007f"&gt;int&lt;/span&gt;(tok.value)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;It handles true atoms (numbers in our case), as well as parenthesized sub-expressions.&lt;/p&gt;
&lt;p&gt;Here is &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt; itself, which is very close to the pseudo-code shown above:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;&lt;span style="color: #007f00"&gt;# For each operator, a (precedence, associativity) pair.&lt;/span&gt;
OpInfo = namedtuple(&lt;span style="color: #7f007f"&gt;&amp;#39;OpInfo&amp;#39;&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;prec assoc&amp;#39;&lt;/span&gt;)

OPINFO_MAP = {
    &lt;span style="color: #7f007f"&gt;&amp;#39;+&amp;#39;&lt;/span&gt;:    OpInfo(&lt;span style="color: #007f7f"&gt;1&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;LEFT&amp;#39;&lt;/span&gt;),
    &lt;span style="color: #7f007f"&gt;&amp;#39;-&amp;#39;&lt;/span&gt;:    OpInfo(&lt;span style="color: #007f7f"&gt;1&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;LEFT&amp;#39;&lt;/span&gt;),
    &lt;span style="color: #7f007f"&gt;&amp;#39;*&amp;#39;&lt;/span&gt;:    OpInfo(&lt;span style="color: #007f7f"&gt;2&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;LEFT&amp;#39;&lt;/span&gt;),
    &lt;span style="color: #7f007f"&gt;&amp;#39;/&amp;#39;&lt;/span&gt;:    OpInfo(&lt;span style="color: #007f7f"&gt;2&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;LEFT&amp;#39;&lt;/span&gt;),
    &lt;span style="color: #7f007f"&gt;&amp;#39;^&amp;#39;&lt;/span&gt;:    OpInfo(&lt;span style="color: #007f7f"&gt;3&lt;/span&gt;, &lt;span style="color: #7f007f"&gt;&amp;#39;RIGHT&amp;#39;&lt;/span&gt;),
}

&lt;span style="color: #00007f; font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #00007f"&gt;compute_expr&lt;/span&gt;(tokenizer, min_prec):
    atom_lhs = compute_atom(tokenizer)

    &lt;span style="color: #00007f; font-weight: bold"&gt;while&lt;/span&gt; &lt;span style="color: #00007f"&gt;True&lt;/span&gt;:
        cur = tokenizer.cur_token
        &lt;span style="color: #00007f; font-weight: bold"&gt;if&lt;/span&gt; (cur &lt;span style="color: #0000aa"&gt;is&lt;/span&gt; &lt;span style="color: #00007f"&gt;None&lt;/span&gt; &lt;span style="color: #0000aa"&gt;or&lt;/span&gt; cur.name != &lt;span style="color: #7f007f"&gt;&amp;#39;BINOP&amp;#39;&lt;/span&gt;
                        &lt;span style="color: #0000aa"&gt;or&lt;/span&gt; OPINFO_MAP[cur.value].prec &amp;lt; min_prec):
            &lt;span style="color: #00007f; font-weight: bold"&gt;break&lt;/span&gt;

        &lt;span style="color: #007f00"&gt;# Inside this loop the current token is a binary operator&lt;/span&gt;
        &lt;span style="color: #00007f; font-weight: bold"&gt;assert&lt;/span&gt; cur.name == &lt;span style="color: #7f007f"&gt;&amp;#39;BINOP&amp;#39;&lt;/span&gt;

        &lt;span style="color: #007f00"&gt;# Get the operator&amp;#39;s precedence and associativity, and compute a&lt;/span&gt;
        &lt;span style="color: #007f00"&gt;# minimal precedence for the recursive call&lt;/span&gt;
        op = cur.value
        prec, assoc = OPINFO_MAP[op]
        next_min_prec = prec + &lt;span style="color: #007f7f"&gt;1&lt;/span&gt; &lt;span style="color: #00007f; font-weight: bold"&gt;if&lt;/span&gt; assoc == &lt;span style="color: #7f007f"&gt;&amp;#39;LEFT&amp;#39;&lt;/span&gt; &lt;span style="color: #00007f; font-weight: bold"&gt;else&lt;/span&gt; prec

        &lt;span style="color: #007f00"&gt;# Consume the current token and prepare the next one for the&lt;/span&gt;
        &lt;span style="color: #007f00"&gt;# recursive call&lt;/span&gt;
        tokenizer.get_next_token()
        atom_rhs = compute_expr(tokenizer, next_min_prec)

        &lt;span style="color: #007f00"&gt;# Update lhs with the new value&lt;/span&gt;
        atom_lhs = compute_op(op, atom_lhs, atom_rhs)

    &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; atom_lhs
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The only difference is that this code makes token handling more explicit. It basically follows the usual &amp;quot;recursive-descent protocol&amp;quot;. Each recursive call has the current token available in &lt;tt class="docutils literal"&gt;tokenizer.cur_tok&lt;/tt&gt;, and makes sure to consume all the tokens it has handled (by calling &lt;tt class="docutils literal"&gt;tokenizer.get_next_token()&lt;/tt&gt;).&lt;/p&gt;
&lt;p&gt;One additional small piece is missing. &lt;tt class="docutils literal"&gt;compute_op&lt;/tt&gt; simply performs the arithmetic computation for the supported binary operators:&lt;/p&gt;
&lt;div class="highlight" style="background: #ffffff"&gt;&lt;pre style="line-height: 125%"&gt;&lt;span style="color: #00007f; font-weight: bold"&gt;def&lt;/span&gt; &lt;span style="color: #00007f"&gt;compute_op&lt;/span&gt;(op, lhs, rhs):
    lhs = &lt;span style="color: #00007f"&gt;int&lt;/span&gt;(lhs); rhs = &lt;span style="color: #00007f"&gt;int&lt;/span&gt;(rhs)
    &lt;span style="color: #00007f; font-weight: bold"&gt;if&lt;/span&gt; op == &lt;span style="color: #7f007f"&gt;&amp;#39;+&amp;#39;&lt;/span&gt;:   &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; lhs + rhs
    &lt;span style="color: #00007f; font-weight: bold"&gt;elif&lt;/span&gt; op == &lt;span style="color: #7f007f"&gt;&amp;#39;-&amp;#39;&lt;/span&gt;: &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; lhs - rhs
    &lt;span style="color: #00007f; font-weight: bold"&gt;elif&lt;/span&gt; op == &lt;span style="color: #7f007f"&gt;&amp;#39;*&amp;#39;&lt;/span&gt;: &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; lhs * rhs
    &lt;span style="color: #00007f; font-weight: bold"&gt;elif&lt;/span&gt; op == &lt;span style="color: #7f007f"&gt;&amp;#39;/&amp;#39;&lt;/span&gt;: &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; lhs / rhs
    &lt;span style="color: #00007f; font-weight: bold"&gt;elif&lt;/span&gt; op == &lt;span style="color: #7f007f"&gt;&amp;#39;^&amp;#39;&lt;/span&gt;: &lt;span style="color: #00007f; font-weight: bold"&gt;return&lt;/span&gt; lhs ** rhs
    &lt;span style="color: #00007f; font-weight: bold"&gt;else&lt;/span&gt;:
        parse_error(&lt;span style="color: #7f007f"&gt;&amp;#39;unknown operator &amp;quot;%s&amp;quot;&amp;#39;&lt;/span&gt; % op)
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="in-the-real-world-clang"&gt;
&lt;h3&gt;In the real world - Clang&lt;/h3&gt;
&lt;p&gt;Precedence climbing is being used in real world tools. One example is &lt;a class="reference external" href="http://clang.llvm.org/"&gt;Clang&lt;/a&gt;, the C/C++/ObjC front-end. Clang's parser is hand-written recursive descent, and it uses precedence climbing for efficient parsing of expressions. If you're interested to see the code, it's &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;Parser::ParseExpression&lt;/span&gt;&lt;/tt&gt; in &lt;tt class="docutils literal"&gt;lib/Parse/ParseExpr.cpp&lt;/tt&gt; &lt;a class="footnote-reference" href="#id6" id="id3"&gt;[3]&lt;/a&gt;. This method plays the role of &lt;tt class="docutils literal"&gt;compute_expr&lt;/tt&gt;. The role of &lt;tt class="docutils literal"&gt;compute_atom&lt;/tt&gt; is played by &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;Parser::ParseCastExpression&lt;/span&gt;&lt;/tt&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="other-resources"&gt;
&lt;h3&gt;Other resources&lt;/h3&gt;
&lt;p&gt;Here are some resources I found useful while writing this article:&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;The Wikipedia page for &lt;a class="reference external" href="http://en.wikipedia.org/wiki/Operator-precedence_parser"&gt;Operator-precedence parsing&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;a class="reference external" href="http://antlr.org/papers/Clarke-expr-parsing-1986.pdf"&gt;article by Keith Clarke&lt;/a&gt; (PDF), one of the early inventors of the technique.&lt;/li&gt;
&lt;li&gt;&lt;a class="reference external" href="http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm"&gt;This page&lt;/a&gt; by Theodore Norvell, about parsing expressions by recursive descent.&lt;/li&gt;
&lt;li&gt;The Clang source code (exact locations given in the previous section).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
&lt;i&gt;&lt;b&gt;Update (2016-11-02):&lt;/b&gt; Andy Chu &lt;a href="http://www.oilshell.org/blog/2016/11/01.html"&gt;notes&lt;/a&gt;
that precedence climbing and &lt;a href="https://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing"&gt;TDOP&lt;/a&gt;
are pretty much the same algorithm, formulated a bit differently. I tend to agree,
and also note that &lt;a href="https://eli.thegreenplace.net/2009/03/20/a-recursive-descent-parser-with-an-infix-expression-evaluator"&gt;Shunting Yard&lt;/a&gt;
is again the same algorithm, except that the explicit recursion is replaced by
a stack.&lt;/i&gt;
&lt;/p&gt;
&lt;img class="align-center" src="https://eli.thegreenplace.net/images/hline.jpg" style="width: 320px; height: 5px;" /&gt;
&lt;table class="docutils footnote" frame="void" id="id4" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#id1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;There are a couple of simplifications made here on purpose. First, I assume only numeric expressions. Identifiers that represent variables can also be viewed as atoms. Second, I ignore unary operators. These are quite easy to incorporate into the algorithm by also treating them as atoms. I leave them out for succinctness.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="id5" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#id2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;In this article I present a parser that computes the result of a numeric expression on-the-fly. Modifying it for accumulating the result into some kind of a parse tree is trivial.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="id6" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#id3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Clang's source code is constantly in flow. This information is correct at least for the date the article was written.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

    </content><category term="misc"></category><category term="Compilation"></category><category term="Recursive descent parsing"></category></entry></feed>