On parsing C, type declarations and fake headers

pycparser has become fairly popular in the past couple of years (especially following its usage in cffi). This means I get more questions by email, which leads me to getting tired of answering the same questions :-)

So this blog post is a one-stop shop for the (by far) most frequently asked question about pycparser - how to handle headers that your code #includes.

I've certainly written about this before, and it's mentioned in the README, but I feel that additional details are needed to provide a more complete answer to the different variations of this question.

First, a disclaimer. This post assumes some level of familiarity with the C programming language and how it's compiled. You must know about the C preprocessor (the thing that handles directives like #include and #define), and have a general understanding of how multiple source files (most often a .c file and any number of .h files) get combined into a single translation unit for compilation. If you don't have a strong grasp of these concepts, I would hold off using pycparser until you learn more about them.

So what's the problem?

The problem arises when the code you want to analyze with pycparser #includes a header file:

#include <someheader.h>

int foo() {
    // my code
}

Since this is true of virtually all real-life code, it's a problem almost everyone faces.

How to handle headers with pycparser

In general, pycparser does not concern itself with headers, or C preprocessor directives in general. The CParser object expects preprocessed code in its parse method, period. So you have two choices:

Provide preprocessed code to pycparser. This means you first preprocess the code by invoking, say, gcc -E (or clang -E, or cpp, or whatever way you have to preprocess code [1]).
Use pycparser's parse_file convenience function; it will invoke the preprocessor for you. Here's an example.

Great, so now you can handle headers. However, this is unlikely to solve all your problems, because pycparser will have trouble parsing some library headers; first and foremost, it will probably have trouble parsing the standard library headers.

Why? Because while pycparser fully supports C99, many library headers are full of compiler extensions and other clever tricks for compatibility across multiple platforms. While it's entirely possible to parse them with pycparser [2], this requires work. Work that you may not have the skills or the time to do. Work that, fortunately, is almost certainly unnecessary.

Why isn't it necessary? Because, in all likeness, you don't really need pycparser to parse those headers at all.

What pycparser actually needs to parse headers for

To understand this bold claim, you must first understand why pycparser needs to parse headers. Let's start with a more basic question - why does the C compiler need to parse headers your file includes?

For a number of reasons; some of them syntactic, but most of them semantic. Syntactic issues are those that may prevent the compiler from parsing the code. #defines are one, types are another.

For example, the C code:

{
    T * x;
}

Cannot be properly parsed unless we know whether:

Either T or x are macros #defined to something.
T is a type that was previously created with a typedef.

For a thorough explanation of this issue, look at this article and other related postings on my website.

Semantic reasons are those that won't prevent the compiler from parsing the code, but will prevent it from properly understanding and verifying it. For example, declarations of functions being used. Full declarations of structs, and so on. These take up the vast majority of real-world header files. But as it turns out, since pycparser only cares about parsing the code into an AST, and doesn't do any semantic analysis or further processing, it doesn't care about these issues. In other words, given the code:

{
    foo(a.b);
}

pycparser can construct a proper AST (given that none of foo, a or b are type names). It doesn't care what the actual declaration of foo is, whether a is indeed a variable of struct type, or whether it has a field named b [3].

So pycparser requires very little from header files. This is how the idea of "fake headers" was born.

Fake headers

Let's get back to this simple code sample:

#include <someheader.h>

int foo() {
    // my code
}

So we've established two key ideas:

pycparser needs to know what someheader.h contains so it can properly parse the code.
pycparser needs only a very small subset of someheader.h to perform its task.

The idea of fake headers is simple. Instead of actually parsing someheader.h and all the other headers it transitively includes (this probably includes lots of system and standard library headers too), why not create a "fake" someheader.h that only contains the parts of the original that are necessary for parsing - the #defines and the typedefs.

The cool part about typedefs is that pycparser doesn't actually care what a type is defined to be. T may be a pointer to function accepting an array of struct types, but all pycparser needs to see is:

typedef int T;

So it knows that T is a type. It doesn't care what kind of type it is.

So what do you have to do to parse your program?

OK, so now you hopefully have a better understanding of what headers mean for pycparser, and how to work around having to parse tons of system headers. What does this actually mean for your program, though? Will you now have to scour through all your headers, "faking them out"? Unlikely. If your code is standards-compliant C, then most likely pycparser will have no issue parsing all your headers. But you probably don't want it to parse the system hedaders. In addition to being nonstandard, these headers are usually large, which means longer parsing time and larger ASTs.

So my suggestion would be: let pycparser parse your headers, but fake out the system headers, and possibly any other large library headers used by your code. As far as the standard headers, pycparser already provides you with nice fakes in its utils folder. All you need to do is provide this flag to the preprocessor [4]:

-I<PATH-TO-PYCPARSER>/utils/fake_libc_include

And it will be able to find header files like stdio.h and sys/types.h with the proper types defined.

I'll repeat: the flag shown above is almost certainly sufficient to parse a C99 program that only relies on the C runtime (i.e. has no other library dependencies).

Real-world example

OK, enough theory. Now I want to work through an example to help ground these suggestions in reality. I'll take some well-known open-source C project and use pycparser to parse one of its files, fully showing all the steps taken until a successful parse is done. I'll pick Redis.

Let's start at the beginning, by cloning the Redis git repo:

/tmp$ git clone git@github.com:antirez/redis.git

I'll be using the latest released pycparser (version 2.13 at the time of writing). I'll also clone its repository into /tmp so I can easily access the fake headers:

/tmp$ git clone git@github.com:eliben/pycparser.git

A word on methodology - when initially exploring how to parse a new project, I'm always preprocessing separately. Once I figure out the flags/settings/extra faking required to successfully parse the code, it's all very easy to put in a script.

Let's take the main Redis file (redis/src/redis.c) and attempt to preprocess it. The first preprocessor invocation simply adds the include paths for Redis's own headers (they live in redis/src) and pycparser's fake libc headers:

/tmp$ gcc -E -Iredis/src -Ipycparser/utils/fake_libc_include redis/src/redis.c > redis_pp.c
# 48 "redis/src/redis.h" 2
In file included from redis/src/redis.c:30:0:
redis/src/redis.h:48:17: fatal error: lua.h: No such file or directory
 #include <lua.h>
             ^
compilation terminated.

Oops, no good. Redis is looking for Lua headers. Let's see if it carries this dependency along:

/tmp$ find redis -name lua
redis/deps/lua

Indeed! We should be able to add the Lua headers to the preprocessor path too:

/tmp$ gcc -E -Iredis/src -Ipycparser/utils/fake_libc_include \
             -Iredis/deps/lua/src redis/src/redis.c > redis_pp.c

Great, no more errors. Now let's try to parse it with pycparser. I'll load pycparser in an interactive terminal, but any other technique (such as running one of the example scripts will work):

: import pycparser
: pycparser.parse_file('/tmp/redis_pp.c')
... backtrace
---> 55         raise ParseError("%s: %s" % (coord, msg))

ParseError: /usr/include/x86_64-linux-gnu/sys/types.h:194:20: before: __attribute__

This error is strange. Note where it occurs: in a system header included in the preprocessed file. But we should have no system headers there - we specified the fake headers path. What gives?

The reason this is happening is that gcc knows about some pre-set system header directories and will add them to its search path. We can block this, making sure it only looks in the directories we explicitly specify with -I, by providing it with the -nostdinc flag. Let's re-run the preprocessor:

/tmp$ gcc -nostdinc -E -Iredis/src -Ipycparser/utils/fake_libc_include \
                       -Iredis/deps/lua/src redis/src/redis.c > redis_pp.c

Now I'll try to parse the preprocessed code again:

: pycparser.parse_file('/tmp/redis_pp.c')
... backtrace
---> 55         raise ParseError("%s: %s" % (coord, msg))

ParseError: redis/src/sds.h:74:5: before: __attribute__

OK, progress! If we look in the code where this error occurs, we'll note a GNU-specific __attribute__ pycparser doesn't support. No problem, let's just #define it away:

$ gcc -nostdinc -E -D'__attribute__(x)=' -Iredis/src \
                   -Ipycparser/utils/fake_libc_include \
                   -Iredis/deps/lua/src redis/src/redis.c > redis_pp.c

If I try to parse again, it works:

: pycparser.parse_file('/tmp/redis_pp.c')
<pycparser.c_ast.FileAST at 0x7f15fc321cf8>

I can also run one of the example scripts now to see we can do something more interesting with the AST:

/tmp$ python pycparser/examples/func_defs.py redis_pp.c
sdslen at redis/src/sds.h:47
sdsavail at redis/src/sds.h:52
rioWrite at redis/src/rio.h:93
rioRead at redis/src/rio.h:106
rioTell at redis/src/rio.h:119
rioFlush at redis/src/rio.h:123
redisLogRaw at redis/src/redis.c:299
redisLog at redis/src/redis.c:343
redisLogFromHandler at redis/src/redis.c:362
ustime at redis/src/redis.c:385
mstime at redis/src/redis.c:396
exitFromChild at redis/src/redis.c:404
dictVanillaFree at redis/src/redis.c:418
... many more lines
main at redis/src/redis.c:3733

This lets us see all the functions defined in redis.c and the headers included in it using pycparser.

This was fairly straightforward - all I had to do is set the right preprocessor flags, really. In some cases, it may be a bit more difficult. The most obvious problem that you may encounter is a new header you'll need to fake away. Luckily, that's very easy - just take a look at the existing ones (say at stdio.h). These headers can be copied to other names/directories, to make sure the preprocessor will find them properly. If you think there's a standard header I forgot to include in the fake headers, please open an issue and I'll add it.

Note that we didn't have to fake out the headers of Redis (or Lua for that matter). pycparser handled them just fine. The same has a high chance of being true for your C project as well.

[1]	On Linux, at least `gcc` should be there on the command line. On OS X, you'll need to install "command-line developer tools" to get a command-line `clang`. If you're in Microsoft-land, I recommend downloading pre-built clang binaries for Windows.

[2]	And this has been done by many folks. pycparser was made to parse the standard C library, `windows.h`, parts of the Linux kernel headers, and so on.

[3]

Note that this describes the most common use of pycparser, which is to perform simple analyses on the source, or rewrite parts of existing source in some way. More complex uses may actually require full parsing of type definitions, structs and function declarations. In fact, you can certainly create a real C compiler using pycparser as the frontend. These uses will require full parsing of headers, so fake headers won't do. As I mentioned above, it's possible to make pycparser parse the actual headers of libraries and so on; it just takes more work.

[4]	Depending on the exact preprocessor you're using, you may need to provide it with another flag telling it to ignore the system headers whose paths are hard-coded in it. Read on to the example for more details.