I've been looking for a good, open source code to parse C for a long time. Many people recommend the
LCC "regargetable compiler". Indeed, it is open source and it knows how to parse C. However, what it builds from the code as it parses it is not an AST, but code in a simplified assembly language. While this makes lcc very comfortable for retargeting generation of code from C to some new platform's assembly, it doesn't help when one wants to do static analysis of C code.
Writing a parser of my own has crossed my mind more than once, but this is a more difficult job than seems at first. C's grammar is far from trivial, it's context sensitive and hence requires some bidirectional data passing between the lexer and the parser. Like with many things, while it's easy to build a partial, toy parser, it is far more difficult to build a full-strength one, which will tackle all the quirks of the C grammar successfully.
However, as I've written
here, there's another tool out there - c2c. It was written as a part of Cilk - an extension language for C. c2c's aim is to be a "type checking preprocessor". In fact, it is a full C parser with a very advanced grammar, that creates ASTs and even knows how to unparse them back into C. For example, consider this code:
#define PI 3.14
#if 0
#define TAR(x) x
#else
#define TAR(x) (2 + x)
#endif
int foo(int jar)
{
int koo = jar / PI;
if (koo > 5)
return TAR(6);
else
return 0;
}
Here's the AST c2c creates from it:
Proc:
Decl: foo (0x003DAAB0) top_decl
Fdcl:
List: Args:
Decl: jar (0x003DAB90) formal_decl
Prim: int
nil
nil
Returns:
Prim: int
nil
nil
Block:
Type: (0x003D2648)
Prim: void
List: decl
Decl: koo (0x003DBD38) block_decl
Prim: int
ImplicitCast:
Type:
Prim: int
Binop: /
Prim: double
ImplicitCast:
Type:
Prim: double
Id: jar
Decl: jar (0x003DAB90) formal_decl
Const: double 3.14
nil
Live: koo
List: stmts
IfElse:
Binop: >
Prim: int
Id: koo
Decl: koo (0x003DBD38) block_decl
Live: koo
Const: int 5
Return:
Binop: +
Prim: int
Value:
Const: int 8
Const: int 2
Const: int 6
Return:
Const: int 0
While it is a bit wordy (c2c also does type checking and adds type information into the AST), it is quite easy to follow and see that it indeed represents the C code.
One problem I had with c2c is that it assumes the presence of cpp - the C preprocessor. Luckily, the LCC project comes with an open source cpp which does its work quite well. It wasn't difficult making the two tools work together, and not at last I have a workable C -> AST translator.
Naturally, the analysis of C I want to do is not in C itself. I much prefer doing it in Perl or Lisp. Therefore, I'll work on translating the AST into some more program-friendly format (one idea is a s-expr) and then read it into the higher-level language for analysis.