On parsing the C standard library headers

October 10th, 2008 at 3:13 pm

Introduction

I’m now in the process of writing a parser for ANSI C in Python. It’s almost done, but isn’t yet in a really usable form and changes frequently [1]. When it’s finished, I’ll write extensively about it. This post is just a rant of what one has to go through to write a complete parser for C. Since this is a complex task that takes me a long time to implement, I’ve decided it’s a good idea to dump my thoughts on the subject once in a while, to keep my future self updated of the design decisions I took.

To parse C code, one inevitably has to deal with the runtime C library (libc). Each compiler comes with one, configured for its own special needs. The problem is, when you’re just writing a general parser for the language, which library to use ?

Not using any is, unfortunately, not an option. The code will probably use stdio.h and other headers, which contain macro and type declarations without which the code just can’t be parsed.

newlib

I ended up using the headers of newlib – a generic GCC-compliant library for embedded systems. They alone were not enough, and I had to add a couple from Mingw (the Windows port of GCC) – stdarg.h and stddef.h, because these headers are not being distributed with newlib, but rather it relies on finding them with the GCC compiler.

To make everything compile I also had to define the symbol __extension__ to be empty (to disable various GCC extensions, as my parser doesn’t support them anyway) and __i386__ to let newlib know which architecture I’m targeting.

Extending #line

This wasn’t all, unfortunately, and I also had to modify my grammar to support unforseen uses of the #line directive. For now, it supported the canonical definition from K&R:

#line constant "filename"
#line constant

For the "filename" part I used the standard "string literal" token in my lexer. Unfortunately, it didn’t work out as the filename may contain Windows paths, that look like this: d:\stuff\include\file.h, and this isn’t a valid string literal since \i is an invalid escape sequence. So, to support this I had to change the definition of the char and string constant. Not too bad, because it can be easily caught at a later stage.

The problems haven’t ended here, however. In one of the header files indirectly included from stdio.h, the following struct is defined (in the output of cpp, of course):

struct _on_exit_args {
    void *  _fnargs[32];
    void *  _dso_handle[32];

    long _fntypes;
    #line 77 "D:\eli\cpp_stuff\libc_include/sys/reent.h"

    long _is_cxa;
};

A #line directive inside a struct ? Give me a break. This definitely isn’t part of the formal C grammar, and I’m not sure about the validity of this construct, because #line directives belong to the subtle no-man’s land between the preprocessor and the compiler. To make this work, I was forced to make #line a valid struct_declaration in my parser.

Finally

After all these changes, the parser successfully parsed a simple C file that includes stdio.h. The resulting preprocessed C file is about 5000 lines long (most of which are empty space left by cpp from macro declarations), and the parser took 0.25 seconds to parse it, which isn’t too bad, I guess, but can surely be further improved.

Now I can finally manage complete C programs with the parser, and I can also see that it successfully grinds through stdio.h, which is encouraging.

More on this saga later.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] But if you insist, the code is here

Related posts:

  1. Faking standard C header files for pycparser
  2. C++11: using unique_ptr with standard library containers
  3. pyelftools – Python library for parsing ELF and DWARF
  4. Parsing C: more on #line directives
  5. Adventures in parsing C: ASTs for switch statements

6 Responses to “On parsing the C standard library headers”

  1. _Mark_No Gravatar Says:

    Why do you think there’s anything unusuabl about that #line directive? if you have any interesting macro expansions, or any C generated by other language tools (look at yacc output, for example.)

    newlib is a good “easy” choice, since it’s based on an old BSD libc so it started out fairly cross-platform and not especially gcc-centric like linux headers tend to be. Depending on what you plan to do with this parser (programmatic augmentation of C code with additional checks? automatic conversion of headers to pyrex or ctypes declarations?) you might consider treating that as merely a starting point and trying GNU libc as your next challenge…

  2. elibenNo Gravatar Says:

    It’s not unusual, but it isn’t specified as valid in K&R2, I think. And I used the grammar provided there as a basis for the parser.

    I see no reason to use the #line directive unless a file was included, and why would a file be included in the middle of a struct spec.

  3. Yossi KreininNo Gravatar Says:

    When I parse the output of cpp, I usually deal with #line at the lexer level, not the parser, exactly because #line is valid everywhere, including things like:

    int
    #line 5 "file"
    n = 9;

    Basically it’s independent of the language grammar (for example, I run cpp on source files which aren’t C); the idea to me is that you have a token stream, and each token remembers its definition location (file, line), and #line “escapes” the normal token stream and modifies the current file and line maintained by the lexer, but never gets past the lexer.

  4. elibenNo Gravatar Says:

    @Yossi,

    I considered this possibility, but rejected it because handling #line at the level of the lexer is “hackish”. While at the parser level I can simply have the rules (PLY syntax):

    """ pp_line_spec    : INT_CONST_DEC STRING_LITERAL 
                                | INT_CONST_DEC 
            """
    
    """ pp_line  : PPHASH PPLINE pp_line_spec PPEND
    """
    

    At the lexer level I will be forced to parse forward manually after I’ve detected #line. It’s pretty simple to pass the line number from the parser to the lexer (by holding an offset) and get all the line numbers correctly in this manner.

    But what you’re saying regarding the complete freedom of #line’s location is true, and perhaps I will be now forced to reconsider this design decision and do it at the lexer level.

  5. John D. MitchellNo Gravatar Says:

    The C programming language is a fair bit different than what’s described in K&R2. If you’re looking to make your parser support the language, you might want to get a copy of the C89 version of the language standard and if you want to be current you’ll need a version of the C99 standard as well.

    Also, beware that GCC plays games with a slew of extensions that are not part of the C language standards and that can play havoc with your grammars.

    In terms of things like #line, I suggest looking at Antlr’s (http://antlr.org/) notion of having multiple channels. I.e., things like #line go into a side-channel separate from the mainline code.

  6. _Mark_No Gravatar Says:

    “I see no reason to use the #line directive unless a file was included, and why would a file be included in the middle of a struct spec.”

    That’s why I suggested the yacc output example; the #line directives in the generated C code lead to debugging information and error messages that point to lines in the higher level source, which don’t line up neatly with C data structures. include directives are just one special case of that…