As I wrote here, I've commonly found myself in the need to analyze C source code programmatically. In that post, I've also mentioned c2c, a nice open-source tool that analyzes C source code and can generate ASTs as an intermediate step. However, c2c is written in C and hence not convenient enough to extend and hack.

So I've decided to give my Python skills more practice and write an analyzer for C in Python, using PLY for the lexer & parser. The project is already online - with the lexer functioning and a set of tests for it (the focus for now is ANSI C90, assuming it has been preprocessed with some standard cpp).

When I sat down to implement the parser, the issue of the AST quickly came up. I want my parser to build the AST that can later be processed. But what kind of AST to build ? How detailed to make it ? These are untrivial questions.

I turned to Python itself for the answers. The standard compiler module has a built-in AST walker that allows to walk ASTs generated from Python's code. The AST format itself is defined in a text file, and the corresponding Python module is cleverly generated automatically (ast.txt and astgen.py in Tools/compiler of Python's source distribution). I like this approach, because it allows for a very detailed AST (which is good for convenient recursive walking) and avoids writing tons of boilerplate code by employing code generation.

Curiously, the Python compiler itself (CPython) uses another, though similar technique. It defines the Python grammar using ASDL (Abstract Syntax Description Language), and generates the C code for the compiler from it.

Anyway, now I'm in the process of deciding on the best AST approach for my C analyzer. I like the method of generating the AST code automatically from a readable specification quite a lot, so there's a good chance I'll borrow astgen.py for my needs.

I'll report on the progress of this project in the future.