From C to AST and back to C with pycparser
March 7th, 2011 at 8:02 amEver since I first released pycparser, people were asking me if it’s possible to generate C code back from the ASTs it creates. My answer was always – "sure, it was done by other users and doesn’t sound very difficult".
But recently I thought, why not add an example to pycparser‘s distribution showing how one could go about it. So this is exactly what I did, and such an example (examples/c-to-c.py) is part of pycparser version 2.03 which was released today.
Dumping C back from pycparser ASTs turned out to be not too difficult, but not as trivial as I initially imagined. Some particular points of interest I ran into:
- I couldn’t use the generic node visitor distributed with pycparser, because I needed to accumulate generated strings from a node’s children.
- C types were, as usual, a problem. This led to an interesting application of non-trivial recursive AST visiting. To properly print out types, I had to accumulate pointer, array and function modifiers (see the _generate_type method for more details) while traversing down the tree, using this information in the innermost nodes.
- C statements are also problematic, because some expressions can be both parts of other expressions and statements on their own right. This makes it a bit tricky to decide when to add semicolons after expressions.
- ASTs encode operator precedence implicitly (i.e. there’s no need for it). But how do I print it back into C? Just parenthesizing both sides of each operator quickly gets ugly. So the code uses some heuristics to not parenthesize some nodes that surely have precedence higher than all binary operators. a = b + (c * k) definitely looks better than a = (b) + ((c) * (k)), though both would parse back into the same AST. This applies not only to operators but also to things like structure references. *foo->bar and (*foo)->bar mean different things to a C compiler, and c-to-c.py knows to parenthesize the left-side only when necessary.
Here’s a sample function before being parsed into an AST:
const Entry* HashFind(const Hash* hash, const char* key)
{
unsigned int index = hash_func(key, hash->table_size);
Node* temp = hash->heads[index];
while (temp != NULL)
{
if (!strcmp(key, temp->entry->key))
return temp->entry;
temp = temp->next;
}
return NULL;
}
And here it is when dumped back from a parsed AST by c-to-c.py:
const Entry *HashFind(const Hash *hash, const char *key)
{
int unsigned index = hash_func(key, hash->table_size);
Node *temp = hash->heads[index];
while (temp != NULL)
{
if (!strcmp(key, temp->entry->key))
return temp->entry;
temp = temp->next;
}
return NULL;
}
Indentation and whitespace aside, it looks almost exactly the same. Note the curiosity on the declaration of index. In C you can specify several type names before a variable (such as unsigned int or long long int), but c-to-c.py has no idea in what order to print them back. The order itself doesn’t really matter to a C compiler – unsigned int and int unsigned are exactly the same in its eyes. unsigned int is just a convention used by most programmers.
A final word: since this is just an example, I didn’t invest too much into the validation of c-to-c.py – it’s considered "alpha" quality at best. If you find any bugs, please open an issue and I’ll have it fixed.
Related posts:

March 7th, 2011 at 10:11
will this support c++ in the future?
March 7th, 2011 at 11:50
Samuel,
Almost certainly not. I recommend you take a look at clang for parsing and analyzing C++. It has Python bindings AFAIK.
March 8th, 2011 at 09:17
Do you think it would be feasible to generate Python from the AST?
Use case is a C library which would be useful to run on Google App Engine (supporting only Python and Java).
March 8th, 2011 at 10:12
Marcel,
You mean compile C to Python? Interesting idea, I didn’t consider it. While in theory this is probably feasible, C files rarely live in a vacuum and are usually linked with other C files and libraries.
March 8th, 2011 at 12:01
Eli,
Could it compile one C file to one Python file, and referencing the others, say:
#include “xyz.h” -> import xyz
What if the library only reads files – nothing else?
Maybe those standard libraries could be adapted in the resulting Python source manually. Automatically isn’t so easy – or is it?!
March 8th, 2011 at 17:21
Marcel,
As I said, theoretically I see no problem with this, I think it can be done with some effort. Surely simple snippets of C code can be translated into functionally equivalent Python. I’m not sure it’s not more work than it’s worth, depending on the exact problem.
March 11th, 2012 at 21:23
In all the examples I see AST being generated from existing C code.
Is there a way to build the AST with python code and only then generate the C code?
(I need to write a parser from some input to C code and I want to go through AST first so that my generated C code will be valid and indented).
March 12th, 2012 at 07:43
Kobi,
Yes. The
c_astmodule is fairly general and you can use it to construct an AST manually (or by some alternative procedure) and not necessarily through the C parser. The C generator doesn’t care about the source of the AST, just its contents. Note that the AST still has a lot of details that will need to be filled in correctly by your parser.