Python internals: Working with Python ASTs

Starting with Python 2.5, the Python compiler (the part that takes your source-code and translates it to Python VM code for the VM to execute) works as follows [1]:

Parse source code into a parse tree (Parser/pgen.c)
Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
Transform AST into a Control Flow Graph (Python/compile.c)
Emit bytecode based on the Control Flow Graph (Python/compile.c)

Previously, the only place one could tap into the compilation process was to obtain the parse tree with the parser module. But parse trees are much less convenient to use than ASTs for code transformation and generation. This is why the addition of the _ast module in Python 2.5 was welcome - it became much simpler to play with ASTs created by Python and even modify them. Also, the python built-in compile function can now accept an AST object in addition to source code.

Python 2.6 then took another step forward, including the higher-level ast module in its standard library. ast is a convenient Python-written toolbox to aid working with _ast [2]. All in all we now have a very convenient framework for processing Python source code. A full Python-to-AST parser is included with the standard distribution - what more could we ask? This makes all kinds of language transformation tasks with Python very simple.

What follows are a few examples of cool things that can be done with the new _ast and ast modules.

Manually building ASTs

import ast

node = ast.Expression(ast.BinOp(
                ast.Str('xy'),
                ast.Mult(),
                ast.Num(3)))

fixed = ast.fix_missing_locations(node)

codeobj = compile(fixed, '<string>', 'eval')
print eval(codeobj)

Let's see what is going on here. First we manually create an AST node, using the AST node classes exported by ast [3]. Then the convenient fix_missing_locations function is called to patch the lineno and col_offset attributes of the node and its children.

Another useful function that can help is ast.dump. Here's a formatted dump of the node we've created:

Expression(
  body=BinOp(
         left=Str(s='xy'),
         op=Mult(),
         right=Num(n=3)))

The most useful single-place reference for the various AST nodes and their structure is Parser/Python.asdl in the source distribution.

Breaking compilation into pieces

Given some source code, we first parse it into an AST, and then compile this AST into a code object that can be evaluated:

import ast

source = '6 + 8'
node = ast.parse(source, mode='eval')

print eval(compile(node, '<string>', mode='eval'))

Again, ast.dump can be helpful to show the AST that was created:

Expression(
  body=BinOp(
         left=Num(n=6),
         op=Add(),
         right=Num(n=8)))

Simple visiting and transformation of ASTs

import ast

class MyVisitor(ast.NodeVisitor):
    def visit_Str(self, node):
        print 'Found string "%s"' % node.s


class MyTransformer(ast.NodeTransformer):
    def visit_Str(self, node):
        return ast.Str('str: ' + node.s)


node = ast.parse('''
favs = ['berry', 'apple']
name = 'peter'

for item in favs:
    print '%s likes %s' % (name, item)
''')

MyTransformer().visit(node)
MyVisitor().visit(node)

This prints:

Found string "str: berry"
Found string "str: apple"
Found string "str: peter"
Found string "str: %s likes %s"

The visitor class implements methods that are called for relevant AST nodes (for example visit_Str is called for Str nodes). The transformer is a bit more complex. It calls relevant methods for AST nodes and then replaces them with the returned value of the methods.

To prove that the transformed code is perfectly valid, we can just compile and execute it:

node = ast.fix_missing_locations(node)
exec compile(node, '<string>', 'exec')

As expected [4], this prints:

str: str: peter likes str: berry
str: str: peter likes str: apple

Reproducing Python source from AST nodes

Armin Ronacher [5] wrote a module named codegen that uses the facilities of ast to print back Python source from an AST. Here's how to show the source for the node we transformed in the previous example:

import codegen
print codegen.to_source(node)

And the result:

favs = ['str: berry', 'str: apple']
name = 'str: peter'
for item in favs:
    print 'str: %s likes %s' % (name, item)

Yep, looks right. codegen is very useful for debugging or tools that transform Python code and want to save the results [6]. Unfortunately, the version you get from Armin's website isn't suitable for the ast that made it into the standard library. A slightly patched version of codegen that works with the standard 2.6 library can be downloaded here.

So why is this useful?

Many tools require parsing the source code of the language they operate upon. With Python, this task has been trivialized by the built-in methods to parse Python source into convenient ASTs. Since there's very little (if any) type checking done in a Python compiler, in classical terms we can say that a complete Python front-end is provided. This can be utilized in:

IDEs for various "intellisense" needs
Static code checking tools like pylint and pychecker
Python code generators like pythoscope
Alternative Python interpreters
Compilers from Python to other languages

There are surely other uses I'm missing. If you're aware of a library/tool that uses ast, let me know.

[1]	Taken from the excellent PEP 339. This PEP is well worth the read - it explains each of the 4 steps in details with useful pointers into the source code where more information can be obtained.

[2]	`_ast` is implemented in `Python/Python-ast.[ch]` which can be obtained from the source distribution.

[3]	Actually, they are exported by `_ast`, but `ast` does `from _ast import *`

[4]	Why so many `str:`? It's not a mistake!

[5]	The author of the `ast` module.

[6]	For example, the pythoscope tool for auto generating unit-tests from code could probably benefit from `ast` and `codegen`. Currently it seems to be working on the level of Python parse trees instead.