Python internals: Working with Python ASTs

November 28th, 2009 at 1:02 pm

Starting with Python 2.5, the Python compiler (the part that takes your source-code and translates it to Python VM code for the VM to execute) works as follows [1]:

  1. Parse source code into a parse tree (Parser/pgen.c)
  2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
  3. Transform AST into a Control Flow Graph (Python/compile.c)
  4. Emit bytecode based on the Control Flow Graph (Python/compile.c)

Previously, the only place one could tap into the compilation process was to obtain the parse tree with the parser module. But parse trees are much less convenient to use than ASTs for code transformation and generation. This is why the addition of the _ast module in Python 2.5 was welcome – it became much simpler to play with ASTs created by Python and even modify them. Also, the python built-in compile function can now accept an AST object in addition to source code.

Python 2.6 then took another step forward, including the higher-level ast module in its standard library. ast is a convenient Python-written toolbox to aid working with _ast [2]. All in all we now have a very convenient framework for processing Python source code. A full Python-to-AST parser is included with the standard distribution – what more could we ask? This makes all kinds of language transformation tasks with Python very simple.

What follows are a few examples of cool things that can be done with the new _ast and ast modules.

Manually building ASTs

import ast

node = ast.Expression(ast.BinOp(
                ast.Str('xy'),
                ast.Mult(),
                ast.Num(3)))

fixed = ast.fix_missing_locations(node)

codeobj = compile(fixed, '<string>', 'eval')
print eval(codeobj)

Let’s see what is going on here. First we manually create an AST node, using the AST node classes exported by ast [3]. Then the convenient fix_missing_locations function is called to patch the lineno and col_offset attributes of the node and its children.

Another useful function that can help is ast.dump. Here’s a formatted dump of the node we’ve created:

Expression(
  body=BinOp(
         left=Str(s='xy'),
         op=Mult(),
         right=Num(n=3)))

The most useful single-place reference for the various AST nodes and their structure is Parser/Python.asdl in the source distribution.

Breaking compilation into pieces

Given some source code, we first parse it into an AST, and then compile this AST into a code object that can be evaluated:

import ast

source = '6 + 8'
node = ast.parse(source, mode='eval')

print eval(compile(node, '<string>', mode='eval'))

Again, ast.dump can be helpful to show the AST that was created:

Expression(
  body=BinOp(
         left=Num(n=6),
         op=Add(),
         right=Num(n=8)))

Simple visiting and transformation of ASTs

import ast

class MyVisitor(ast.NodeVisitor):
    def visit_Str(self, node):
        print 'Found string "%s"' % node.s


class MyTransformer(ast.NodeTransformer):
    def visit_Str(self, node):
        return ast.Str('str: ' + node.s)


node = ast.parse('''
favs = ['berry', 'apple']
name = 'peter'

for item in favs:
    print '%s likes %s' % (name, item)
''')

MyTransformer().visit(node)
MyVisitor().visit(node)

This prints:

Found string "str: berry"
Found string "str: apple"
Found string "str: peter"
Found string "str: %s likes %s"

The visitor class implements methods that are called for relevant AST nodes (for example visit_Str is called for Str nodes). The transformer is a bit more complex. It calls relevant methods for AST nodes and then replaces them with the returned value of the methods.

To prove that the transformed code is perfectly valid, we can just compile and execute it:

node = ast.fix_missing_locations(node)
exec compile(node, '<string>', 'exec')

As expected [4], this prints:

str: str: peter likes str: berry
str: str: peter likes str: apple

Reproducing Python source from AST nodes

Armin Ronacher [5] wrote a module named codegen that uses the facilities of ast to print back Python source from an AST. Here’s how to show the source for the node we transformed in the previous example:

import codegen
print codegen.to_source(node)

And the result:

favs = ['str: berry', 'str: apple']
name = 'str: peter'
for item in favs:
    print 'str: %s likes %s' % (name, item)

Yep, looks right. codegen is very useful for debugging or tools that transform Python code and want to save the results [6]. Unfortunately, the version you get from Armin’s website isn’t suitable for the ast that made it into the standard library. A slightly patched version of codegen that works with the standard 2.6 library can be downloaded here.

So why is this useful?

Many tools require parsing the source code of the language they operate upon. With Python, this task has been trivialized by the built-in methods to parse Python source into convenient ASTs. Since there’s very little (if any) type checking done in a Python compiler, in classical terms we can say that a complete Python front-end is provided. This can be utilized in:

  • IDEs for various "intellisense" needs
  • Static code checking tools like pylint and pychecker
  • Python code generators like pythoscope
  • Alternative Python interpreters
  • Compilers from Python to other languages

There are surely other uses I’m missing. If you’re aware of a library/tool that uses ast, let me know.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] Taken from the excellent PEP 339. This PEP is well worth the read – it explains each of the 4 steps in details with useful pointers into the source code where more information can be obtained.
[2] _ast is implemented in Python/Python-ast.[ch] which can be obtained from the source distribution.
[3] Actually, they are exported by _ast, but ast does from _ast import *
[4] Why so many str:? It’s not a mistake!
[5] The author of the ast module.
[6] For example, the pythoscope tool for auto generating unit-tests from code could probably benefit from ast and codegen. Currently it seems to be working on the level of Python parse trees instead.

Related posts:

  1. ASTs for analyzing C
  2. Python internals: adding a new statement to Python
  3. Python internals: Symbol tables, part 1
  4. Python internals: Symbol tables, part 2
  5. Abstract vs. Concrete Syntax Trees

13 Responses to “Python internals: Working with Python ASTs”

  1. Christopher LozinskiNo Gravatar Says:

    Thank you very much. I have long wondered why we are typing text to generate code. With this stuff, I can build an editor that only allows legal statements. That goes a long way to getting rid of lots of errors.

    I did not quite understand all the details, but the fact that the compiler can take AST graphs is all i need to know at this point.

    Thank You
    Christopher Lozinski

  2. alexmajyNo Gravatar Says:

    I like it and “Abstract vs. Concrete Syntax Trees”, thank you

  3. José HernándezNo Gravatar Says:

    This is interesting because it seems Python is moving more and more towards Lisp, albeit with a significantly different syntax. It adopted functional programming built-in functions (map, reduce, filter). Lisp programs have the unique ability to analyze and modify other Lisp programs. AST is a significant step towards bringing this ability to Python.

  4. elibenNo Gravatar Says:

    Jose,

    I wouldn’t say Python is currently moving towards Lisp. On the contrary – there was a great debate for the Python 3000 move about the functional tools – map, reduce, lambda. Guido wanted them removed from the language (because list and generator comprehensions are better), but at last was convinced to keep them – though I think reduce moved from a built-in to a module (functools) in the standard library.

  5. E.TNo Gravatar Says:

    One of the fascinating features that Lisp has is the ability to do meta-programming (macro), and I think that with ast module, Python (partially) obtains this ability in a somehow weird way.

  6. Lora IrelandNo Gravatar Says:

    This is great! Thanks for this post. I am starting python and this is a big help.

  7. grookNo Gravatar Says:

    really interesting :) thanks for the topic

  8. Anders HovmöllerNo Gravatar Says:

    I found some bugs in codegen.py. I’ve put the new version with my fixes (and an assert that will save your life if you find any new bugs!) on github: https://github.com/boxed/Misc/blob/master/codegen.py

    I’m using codegen.py to experiment with an automated translation of PyObjC code to Objective-C. It’s something like 90% complete and, I think, pretty cool. The amount of manual work if you want to switch from PyObjC to ObjC is cut quite severely with this tool. If you’re interested it’s at: https://github.com/boxed/Misc/blob/master/codegen_objc.py

  9. Chen-Ming YaoNo Gravatar Says:

    Hello!
    Thanks for the information here. It’s really helpful.
    I’m a master student from Taiwan. For my research, i intend to write a compiler to compile Python to CIL(C Intermediate Language), and then apply static analysis on it. First, I will parse Python to Python AST; Then translate Pyhton AST to C AST; Finally, use the existing library to convert C AST to CIL.
    There is a static analyzer for CIL written in Ocaml implemented by our group, so we want to reuse it. I have searched many related resources, but encountered some difficulties. The Python parse module is most convenient, but it may be hard to connect with our static analyzer.
    I wonder whether there is source code of Python-to-AST parser that i could refer to. That would be very helpful for writing my own compiler.
    Thank you.

  10. elibenNo Gravatar Says:

    Chen-Ming Yao,

    This article shows that ast.parse in the ast module can transform Python to an AST, and then gives an example of how it can be transformed and worked on. I don’t think you need anything else – just write the code that translates it to your target language [not that it's an easy task, but at least the front-end is quite complete].

  11. Chen-Ming YaoNo Gravatar Says:

    Hi! again.
    First, thanks for your advice before. Our group took your suggestion.
    The couple of months, I worked on the the translation from Python AST to our target language but one problem bothered me. The syntax ast.dump produced is quite “ugly” that makes my translator hard to write. I wonder maybe I implemented it in a wrong way. I use the parser generator (Ocamllex&Ocamlyacc) to translate Python AST to C AST, and then CIL currently.
    In your opinion, did I choose the wrong way to implement my translator? or I miss some information about the syntax ast.dump produced (I have refered to Python.asdl.).
    Could you give me some advice?
    Anyway, thanks a lot for your help!

  12. elibenNo Gravatar Says:

    Chen-Ming Yao,

    I’m not sure why you’d want to parse back the output of ast.dump – or did I misunderstand you? ast.dump is just printing the internal AST representation in a readable textual format. The internal AST representation is the one you should start from – it is structured data. Naturally you use Python to access it (because it’s Python data structures), but you could take it from a C extension and I guess anywhere you want from there.

  13. Chen-Ming YaoNo Gravatar Says:

    Sorry, maybe I didn’t describe my implementation and question clearly.
    My translation did start from the AST representation that ast.dump printed. Then I parse this AST representation to C AST. Currently, this parser is written by Ocamllex&Ocamlyacc. (in consideration of the C AST to CIL API and the static analysis afterward all written in Ocaml)
    As you pointed out, Python internal AST representation is in Python data structure. I couldn’t use Ocaml to acess it, so I thought about writing my own parser. That’s why I used ast.dump results as input and C AST as output.
    I hope I describe the situation clearly this time…
    I really not sure whether I choose a wrong approach? I thought about maybe I should use other programming languages before, but I didn’t know if it will affect the execution of C AST to CIL and static analysis later?
    Thanks a lot again!

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)