From C to AST and back to C with pycparser

Ever since I first released pycparser, people were asking me if it's possible to generate C code back from the ASTs it creates. My answer was always - "sure, it was done by other users and doesn't sound very difficult".

But recently I thought, why not add an example to pycparser's distribution showing how one could go about it. So this is exactly what I did, and such an example (examples/c-to-c.py) is part of pycparser version 2.03 which was released today.

Dumping C back from pycparser ASTs turned out to be not too difficult, but not as trivial as I initially imagined. Some particular points of interest I ran into:

I couldn't use the generic node visitor distributed with pycparser, because I needed to accumulate generated strings from a node's children.
C types were, as usual, a problem. This led to an interesting application of non-trivial recursive AST visiting. To properly print out types, I had to accumulate pointer, array and function modifiers (see the _generate_type method for more details) while traversing down the tree, using this information in the innermost nodes.
C statements are also problematic, because some expressions can be both parts of other expressions and statements on their own right. This makes it a bit tricky to decide when to add semicolons after expressions.
ASTs encode operator precedence implicitly (i.e. there's no need for it). But how do I print it back into C? Just parenthesizing both sides of each operator quickly gets ugly. So the code uses some heuristics to not parenthesize some nodes that surely have precedence higher than all binary operators. a = b + (c * k) definitely looks better than a = (b) + ((c) * (k)), though both would parse back into the same AST. This applies not only to operators but also to things like structure references. *foo->bar and (*foo)->bar mean different things to a C compiler, and c-to-c.py knows to parenthesize the left-side only when necessary.

Here's a sample function before being parsed into an AST:

const Entry* HashFind(const Hash* hash, const char* key)
{
    unsigned int index = hash_func(key, hash->table_size);
    Node* temp = hash->heads[index];

    while (temp != NULL)
    {
        if (!strcmp(key, temp->entry->key))
            return temp->entry;

        temp = temp->next;
    }

    return NULL;
}

And here it is when dumped back from a parsed AST by c-to-c.py:

const Entry *HashFind(const Hash *hash, const char *key)
{
  int unsigned index = hash_func(key, hash->table_size);
  Node *temp = hash->heads[index];
  while (temp != NULL)
  {
    if (!strcmp(key, temp->entry->key))
      return temp->entry;

    temp = temp->next;
  }

  return NULL;
}

Indentation and whitespace aside, it looks almost exactly the same. Note the curiosity on the declaration of index. In C you can specify several type names before a variable (such as unsigned int or long long int), but c-to-c.py has no idea in what order to print them back. The order itself doesn't really matter to a C compiler - unsigned int and int unsigned are exactly the same in its eyes. unsigned int is just a convention used by most programmers.

A final word: since this is just an example, I didn't invest too much into the validation of c-to-c.py - it's considered "alpha" quality at best. If you find any bugs, please open an issue and I'll have it fixed.