ASTs (Abstract Syntax Trees) are an important data structure in compiler front-ends. If you've written a few parsers, you almost definitely ran into the need to describe the result of the parsing in terms of an AST. While the kinds of nodes such ASTs have and their structure is very specific to the source language, many commonalities come up. In other words, coding "yet another AST" gets really old after you've done it a few times.
Worry not, as you'd expect from the programmer crowd, this problem was "solved" by adding another level of abstraction. Yes, an abstraction over Abstract Syntax Trees, oh my! The abstraction here is some textual format (let's call it a DSL to sound smart) that describes what the AST looks like, along with machinery to auto-generate the code that implements this AST.
Most solutions in this domain are ad-hoc, but one that I've seen used more than once is ASDL - Abstract Syntax Definition Language. The self-description from the website sounds about right:
The Zephyr Abstract Syntax Description Lanuguage (ASDL) is a language designed to describe the tree-like data structures in compilers. Its main goal is to provide a method for compiler components written in different languages to interoperate. ASDL makes it easier for applications written in a variety of programming languages to communicate complex recursive data structures.
To give an example, here's a short snippet from an ASDL definition of a simple programming language:
program = Program(class* classes)
class = Class(identifier name, identifier? parent, feature* features)
[...]
expression = Assign(identifier name, expression expr)
| StaticDispatch(expression expr, identifier type_name,
identifier name, expression* actual)
| Dispatch(expression expr, identifier name, expression* actual)
[...]
The way to read this is: a program node consists of one or more classes. Each class has these children: a name which is an identifier, an optional parent which is also an identifier, and a (potentially empty) list of features, each of which is a feature node. And so on.
The full details are available in the paper "The Zephyr Abstract Syntax Definition Language" by Wang et.al. Unfortunately, a link to this paper isn't always trivial to find, so I have a PDF copy in the docs directory of my asdl_parser project, which I'm going to discuss soon.
Type safety in ASDL
In addition to providing a concise description of nodes from which code (in many languages) can be generated automatically, I like ASDL for another reason. It provides some type safety when constructing the AST in the parser.
Take the snippet above, for example. A program has the classes attribute, which is a (potentially empty) sequence of class nodes. Each such class has to be a Class, which is precisely defined. It can be nothing else. The expression below that shows it differently - an expression can be either a Assign, StaticDispatch, etc.
The set of possibilities is statically defined. This makes it possible to insert some degree of static checking into the (auto-generated) AST construction code. So the constructed AST can't be completely bogus even before semantic analysis is applied. Even though I love Python, I do appreciate a bit of static type checking in the right places. Key data structures like ASTs are, I believe, one of the places when such type checking makes sense.
ASDL in upstream CPython
Starting with Python 2.5, the CPython compiler (the part responsible for emitting bytecode from Python source) uses an ASDL description to create an AST for Python source. The AST is created by the parser (from the parse tree - more details in PEP 339), and is then used to create the control-flow graph, from which bytecode is emitted.
The ASDL description lives in Parser/Python.asdl in the CPython source tree. Parser/asdl_c.py is a script that runs whenever someone modifies this ASDL description. It uses the Parser/asdl.py module to parse the ASDL file into an internal form and then emits C code that describes the ASTs. This C code lives in Include/Python-ast.h and Python/Python-ast.c [1].
This may be more details than you wanted to hear :-) The gist of it, however, is - CPython's ASTs are described in ASDL, and if you want a quick glance of how these ASTs look, Parser/Python.asdl is the file to look at.
My rewrite of the ASDL parser
Up until very recently, the ASDL description of the CPython AST was parsed by a tool that relied on the SPARK parsing toolkit. In fact, Parser/spark.py was carried around in the distribution just for this purpose.
A few months ago I was looking for something to conveniently implement the AST for a toy compiler I was hacking on. Being a CPython developer, ASDL immediately sprang to mind, but I was reluctant to carry the SPARK dependency and/or learn how to use it. The ASDL language seemed simple enough to not require such machinery. Surely a simple recursive-descent parser would do. So I implemented my own stand-alone parser for ASDL, using modern Python 3.x - and it's available in a public Github repository right here. Feel free to use it, and let me know how it goes!
Since my parser turned out to be much simpler and easier to grok than upstream CPython's SPARK-based parser, I proposed to replace it in Issue 19655. After some delays (caused mainly by waiting for 3.4 release and then getting distracted by other stuff), the change landed in the default branch (on its way to 3.5) about a month ago. The result is pleasing - the new parser is shorter, doesn't require the SPARK dependency (which was now dropped), has tests and is much more maintainable.
In the interest of not changing too much at once, I left the interface to the C code generator (Parser/asdl_c.py) the same, so there is absolutely no difference in the produced C code. Some time in the future it may make sense to revise this decision. The C generator is also fairly old code that could use some modernization and tests.
Historical note - AST description in pycparser
I first ran into this problem (high-level description of ASTs) when I was working on pycparser (which is quite an old project by now).
Back at the time, I looked at the compiler module of Python 2.x and liked its approach of simple textual description of the AST which is then parsed and from which the code for AST nodes is emitted. The compiler module was a maintenance headache (because it duplicated a lot of the AST logic from the actual compiler) and is gone in Python 3.x, replaced by the ast module which provides access to the same C-based AST generated by Parser/asdl_c.py as is used by the CPython compiler.
pycparser's AST description is a simple textual file that's very similar in spirit to ASDL. If I were to do this today, I'd probably also pick ASDL since it's more "standard", as well as for the extra type safety guarantees it provides.
[1] | Even though these files are auto-generated, they are also checked into the CPython Mercurial repository. This is because we don't want people building Python from source to depend on the tools required to generate such files. Only core CPython developers who want to play with the internals need them. |