From C to AST and back to C with pycparser

March 7th, 2011 at 8:02 am

Ever since I first released pycparser, people were asking me if it’s possible to generate C code back from the ASTs it creates. My answer was always – "sure, it was done by other users and doesn’t sound very difficult".

But recently I thought, why not add an example to pycparser‘s distribution showing how one could go about it. So this is exactly what I did, and such an example (examples/c-to-c.py) is part of pycparser version 2.03 which was released today.

Dumping C back from pycparser ASTs turned out to be not too difficult, but not as trivial as I initially imagined. Some particular points of interest I ran into:

  • I couldn’t use the generic node visitor distributed with pycparser, because I needed to accumulate generated strings from a node’s children.
  • C types were, as usual, a problem. This led to an interesting application of non-trivial recursive AST visiting. To properly print out types, I had to accumulate pointer, array and function modifiers (see the _generate_type method for more details) while traversing down the tree, using this information in the innermost nodes.
  • C statements are also problematic, because some expressions can be both parts of other expressions and statements on their own right. This makes it a bit tricky to decide when to add semicolons after expressions.
  • ASTs encode operator precedence implicitly (i.e. there’s no need for it). But how do I print it back into C? Just parenthesizing both sides of each operator quickly gets ugly. So the code uses some heuristics to not parenthesize some nodes that surely have precedence higher than all binary operators. a = b + (c * k) definitely looks better than a = (b) + ((c) * (k)), though both would parse back into the same AST. This applies not only to operators but also to things like structure references. *foo->bar and (*foo)->bar mean different things to a C compiler, and c-to-c.py knows to parenthesize the left-side only when necessary.

Here’s a sample function before being parsed into an AST:

const Entry* HashFind(const Hash* hash, const char* key)
{
    unsigned int index = hash_func(key, hash->table_size);
    Node* temp = hash->heads[index];

    while (temp != NULL)
    {
        if (!strcmp(key, temp->entry->key))
            return temp->entry;

        temp = temp->next;
    }

    return NULL;
}

And here it is when dumped back from a parsed AST by c-to-c.py:

const Entry *HashFind(const Hash *hash, const char *key)
{
  int unsigned index = hash_func(key, hash->table_size);
  Node *temp = hash->heads[index];
  while (temp != NULL)
  {
    if (!strcmp(key, temp->entry->key))
      return temp->entry;

    temp = temp->next;
  }

  return NULL;
}

Indentation and whitespace aside, it looks almost exactly the same. Note the curiosity on the declaration of index. In C you can specify several type names before a variable (such as unsigned int or long long int), but c-to-c.py has no idea in what order to print them back. The order itself doesn’t really matter to a C compiler – unsigned int and int unsigned are exactly the same in its eyes. unsigned int is just a convention used by most programmers.

A final word: since this is just an example, I didn’t invest too much into the validation of c-to-c.py – it’s considered "alpha" quality at best. If you find any bugs, please open an issue and I’ll have it fixed.

Related posts:

  1. Implementing cdecl with pycparser
  2. pycparser now supports C99
  3. SICP section 5.3
  4. pycparser v1.0 is out!
  5. pycparser v1.06 released

8 Responses to “From C to AST and back to C with pycparser”

  1. SamuelNo Gravatar Says:

    will this support c++ in the future? :D

  2. elibenNo Gravatar Says:

    Samuel,

    Almost certainly not. I recommend you take a look at clang for parsing and analyzing C++. It has Python bindings AFAIK.

  3. MarcelNo Gravatar Says:

    Do you think it would be feasible to generate Python from the AST?
    Use case is a C library which would be useful to run on Google App Engine (supporting only Python and Java).

  4. elibenNo Gravatar Says:

    Marcel,

    You mean compile C to Python? Interesting idea, I didn’t consider it. While in theory this is probably feasible, C files rarely live in a vacuum and are usually linked with other C files and libraries.

  5. MarcelNo Gravatar Says:

    Eli,

    Could it compile one C file to one Python file, and referencing the others, say:
    #include “xyz.h” -> import xyz

    What if the library only reads files – nothing else?
    Maybe those standard libraries could be adapted in the resulting Python source manually. Automatically isn’t so easy – or is it?!

  6. elibenNo Gravatar Says:

    Marcel,

    As I said, theoretically I see no problem with this, I think it can be done with some effort. Surely simple snippets of C code can be translated into functionally equivalent Python. I’m not sure it’s not more work than it’s worth, depending on the exact problem.

  7. KobiNo Gravatar Says:

    In all the examples I see AST being generated from existing C code.
    Is there a way to build the AST with python code and only then generate the C code?
    (I need to write a parser from some input to C code and I want to go through AST first so that my generated C code will be valid and indented).

  8. elibenNo Gravatar Says:

    Kobi,

    Yes. The c_ast module is fairly general and you can use it to construct an AST manually (or by some alternative procedure) and not necessarily through the C parser. The C generator doesn’t care about the source of the AST, just its contents. Note that the AST still has a lot of details that will need to be filled in correctly by your parser.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)