I've been looking for a good, open source code to parse C for a long time. Many people recommend the LCC "regargetable compiler". Indeed, it is open source and it knows how to parse C. However, what it builds from the code as it parses it is not an AST, but code in a simplified assembly language. While this makes lcc very comfortable for retargeting generation of code from C to some new platform's assembly, it doesn't help when one wants to do static analysis of C code.
Writing a parser of my own has crossed my mind more than once, but this is a more difficult job than seems at first. C's grammar is far from trivial, it's context sensitive and hence requires some bidirectional data passing between the lexer and the parser. Like with many things, while it's easy to build a partial, toy parser, it is far more difficult to build a full-strength one, which will tackle all the quirks of the C grammar successfully.
However, as I've written here, there's another tool out there - c2c. It was written as a part of Cilk - an extension language for C. c2c's aim is to be a "type checking preprocessor". In fact, it is a full C parser with a very advanced grammar, that creates ASTs and even knows how to unparse them back into C. For example, consider this code:
#define PI 3.14 #if 0 #define TAR(x) x #else #define TAR(x) (2 + x) #endif int foo(int jar) { int koo = jar / PI; if (koo > 5) return TAR(6); else return 0; }Here's the AST c2c creates from it:
Proc: Decl: foo (0x003DAAB0) top_decl Fdcl: List: Args: Decl: jar (0x003DAB90) formal_decl Prim: int nil nil Returns: Prim: int nil nil Block: Type: (0x003D2648) Prim: void List: decl Decl: koo (0x003DBD38) block_decl Prim: int ImplicitCast: Type: Prim: int Binop: / Prim: double ImplicitCast: Type: Prim: double Id: jar Decl: jar (0x003DAB90) formal_decl Const: double 3.14 nil Live: koo List: stmts IfElse: Binop: > Prim: int Id: koo Decl: koo (0x003DBD38) block_decl Live: koo Const: int 5 Return: Binop: + Prim: int Value: Const: int 8 Const: int 2 Const: int 6 Return: Const: int 0
While it is a bit wordy (c2c also does type checking and adds type information into the AST), it is quite easy to follow and see that it indeed represents the C code.
One problem I had with c2c is that it assumes the presence of cpp - the C preprocessor. Luckily, the LCC project comes with an open source cpp which does its work quite well. It wasn't difficult making the two tools work together, and not at last I have a workable C -> AST translator.
Naturally, the analysis of C I want to do is not in C itself. I much prefer doing it in Perl or Lisp. Therefore, I'll work on translating the AST into some more program-friendly format (one idea is a s-expr) and then read it into the higher-level language for analysis.