the answer for parsing C ?

November 15th, 2007 at 6:18 am

I’ve been looking for a good, open source code to parse C for a long time. Many people recommend the LCC “regargetable compiler”. Indeed, it is open source and it knows how to parse C. However, what it builds from the code as it parses it is not an AST, but code in a simplified assembly language. While this makes lcc very comfortable for retargeting generation of code from C to some new platform’s assembly, it doesn’t help when one wants to do static analysis of C code.

Writing a parser of my own has crossed my mind more than once, but this is a more difficult job than seems at first. C’s grammar is far from trivial, it’s context sensitive and hence requires some bidirectional data passing between the lexer and the parser. Like with many things, while it’s easy to build a partial, toy parser, it is far more difficult to build a full-strength one, which will tackle all the quirks of the C grammar successfully.

However, as I’ve written here, there’s another tool out there – c2c. It was written as a part of Cilk – an extension language for C. c2c’s aim is to be a “type checking preprocessor”. In fact, it is a full C parser with a very advanced grammar, that creates ASTs and even knows how to unparse them back into C. For example, consider this code:

#define PI 3.14

#if 0
#define TAR(x) x
#define TAR(x) (2 + x)

int foo(int jar)
    int koo = jar / PI;
    if (koo > 5)
        return TAR(6);
        return 0;

Here’s the AST c2c creates from it:

  Decl: foo (0x003DAAB0) top_decl
      List: Args:
        Decl: jar (0x003DAB90) formal_decl
          Prim: int
        Prim: int
    Type: (0x003D2648)
      Prim: void
    List: decl
      Decl: koo (0x003DBD38) block_decl
        Prim: int
            Prim: int
          Binop: /
            Prim: double
                Prim: double
              Id: jar
                Decl: jar (0x003DAB90) formal_decl

            Const: double 3.14
        Live: koo
    List: stmts
        Binop: >
          Prim: int
          Id: koo
            Decl: koo (0x003DBD38) block_decl

              Live: koo
          Const: int 5
          Binop: +
            Prim: int
              Const: int 8
            Const: int 2
            Const: int 6
          Const: int 0

While it is a bit wordy (c2c also does type checking and adds type information into the AST), it is quite easy to follow and see that it indeed represents the C code.

One problem I had with c2c is that it assumes the presence of cpp – the C preprocessor. Luckily, the LCC project comes with an open source cpp which does its work quite well. It wasn’t difficult making the two tools work together, and not at last I have a workable C -> AST translator.

Naturally, the analysis of C I want to do is not in C itself. I much prefer doing it in Perl or Lisp. Therefore, I’ll work on translating the AST into some more program-friendly format (one idea is a s-expr) and then read it into the higher-level language for analysis.

Related posts:

  1. Parsing C: more on #line directives
  2. On parsing the C standard library headers
  3. Adventures in parsing C: ASTs for switch statements
  4. Parsing VHDL is [very] hard
  5. is “meta programming” the answer ?

2 Responses to “the answer for parsing C ?”

  1. Ken DyckNo Gravatar Says:

    I’d never heard of c2c before until you’d mentioned in your other post. Sounds like a useful little tool.

    I generally use ANTLR when I need to do any kind of parsing. They have a sample C grammar that can generate an AST, too, though I’ve never actually used it.

  2. EisbawNo Gravatar Says:

    FYI, c2c is now called cilk2c, which can be found at as subfolder cilk2c in the source tarball.

    -Mark Ruvald