I'm now in the process of writing a parser for ANSI C in Python. It's almost done, but isn't yet in a really usable form and changes frequently [1]. When it's finished, I'll write extensively about it. This post is just a rant of what one has to go through to write a complete parser for C. Since this is a complex task that takes me a long time to implement, I've decided it's a good idea to dump my thoughts on the subject once in a while, to keep my future self updated of the design decisions I took.

To parse C code, one inevitably has to deal with the runtime C library (libc). Each compiler comes with one, configured for its own special needs. The problem is, when you're just writing a general parser for the language, which library to use ?

Not using any is, unfortunately, not an option. The code will probably use stdio.h and other headers, which contain macro and type declarations without which the code just can't be parsed.


I ended up using the headers of newlib - a generic GCC-compliant library for embedded systems. They alone were not enough, and I had to add a couple from Mingw (the Windows port of GCC) - stdarg.h and stddef.h, because these headers are not being distributed with newlib, but rather it relies on finding them with the GCC compiler.

To make everything compile I also had to define the symbol __extension__ to be empty (to disable various GCC extensions, as my parser doesn't support them anyway) and __i386__ to let newlib know which architecture I'm targeting.

Extending #line

This wasn't all, unfortunately, and I also had to modify my grammar to support unforseen uses of the #line directive. For now, it supported the canonical definition from K&R:

#line constant "filename"
#line constant

For the "filename" part I used the standard "string literal" token in my lexer. Unfortunately, it didn't work out as the filename may contain Windows paths, that look like this: d:\stuff\include\file.h, and this isn't a valid string literal since \i is an invalid escape sequence. So, to support this I had to change the definition of the char and string constant. Not too bad, because it can be easily caught at a later stage.

The problems haven't ended here, however. In one of the header files indirectly included from stdio.h, the following struct is defined (in the output of cpp, of course):

struct _on_exit_args {
    void *  _fnargs[32];
    void *  _dso_handle[32];

    long _fntypes;
    #line 77 "D:\eli\cpp_stuff\libc_include/sys/reent.h"

    long _is_cxa;

A #line directive inside a struct ? Give me a break. This definitely isn't part of the formal C grammar, and I'm not sure about the validity of this construct, because #line directives belong to the subtle no-man's land between the preprocessor and the compiler. To make this work, I was forced to make #line a valid struct_declaration in my parser.


After all these changes, the parser successfully parsed a simple C file that includes stdio.h. The resulting preprocessed C file is about 5000 lines long (most of which are empty space left by cpp from macro declarations), and the parser took 0.25 seconds to parse it, which isn't too bad, I guess, but can surely be further improved.

Now I can finally manage complete C programs with the parser, and I can also see that it successfully grinds through stdio.h, which is encouraging.

More on this saga later.

[1]But if you insist, the code is here