I've been looking for a good, open source code to parse C for a long time. Many people recommend the LCC "regargetable compiler". Indeed, it is open source and it knows how to parse C. However, what it builds from the code as it parses it is not an AST, but code in a simplified assembly language. While this makes lcc very comfortable for retargeting generation of code from C to some new platform's assembly, it doesn't help when one wants to do static analysis of C code.

Writing a parser of my own has crossed my mind more than once, but this is a more difficult job than seems at first. C's grammar is far from trivial, it's context sensitive and hence requires some bidirectional data passing between the lexer and the parser. Like with many things, while it's easy to build a partial, toy parser, it is far more difficult to build a full-strength one, which will tackle all the quirks of the C grammar successfully.

However, as I've written here, there's another tool out there - c2c. It was written as a part of Cilk - an extension language for C. c2c's aim is to be a "type checking preprocessor". In fact, it is a full C parser with a very advanced grammar, that creates ASTs and even knows how to unparse them back into C. For example, consider this code:

#define PI 3.14

#if 0
#define TAR(x) x
#else
#define TAR(x) (2 + x)
#endif

int foo(int jar)
{
    int koo = jar / PI;
    
    if (koo > 5)
        return TAR(6);
    else
        return 0;
}
Here's the AST c2c creates from it:
Proc:
  Decl: foo (0x003DAAB0) top_decl
    Fdcl:
      List: Args:
        Decl: jar (0x003DAB90) formal_decl
          Prim: int
          nil
          nil
      Returns:
        Prim: int
    nil
    nil
  Block:
    Type: (0x003D2648)
      Prim: void
    List: decl
      Decl: koo (0x003DBD38) block_decl
        Prim: int
        ImplicitCast:
          Type:
            Prim: int
          Binop: /
            Prim: double
            ImplicitCast:
              Type:
                Prim: double
              Id: jar
                Decl: jar (0x003DAB90) formal_decl

            Const: double 3.14
        nil
        Live: koo
    List: stmts
      IfElse:
        Binop: >
          Prim: int
          Id: koo
            Decl: koo (0x003DBD38) block_decl

              Live: koo
          Const: int 5
        Return:
          Binop: +
            Prim: int
            Value:
              Const: int 8
            Const: int 2
            Const: int 6
        Return:
          Const: int 0

While it is a bit wordy (c2c also does type checking and adds type information into the AST), it is quite easy to follow and see that it indeed represents the C code.

One problem I had with c2c is that it assumes the presence of cpp - the C preprocessor. Luckily, the LCC project comes with an open source cpp which does its work quite well. It wasn't difficult making the two tools work together, and not at last I have a workable C -> AST translator.

Naturally, the analysis of C I want to do is not in C itself. I much prefer doing it in Perl or Lisp. Therefore, I'll work on translating the AST into some more program-friendly format (one idea is a s-expr) and then read it into the higher-level language for analysis.