Adventures in parsing C: ASTs for switch statements

February 3rd, 2012 at 4:02 pm

Last week I received an email from a user of pycparser that mentioned the strange AST that results when pycparser parses a switch statement.

Let’s take the following snippet of C code for example. Don’t look for semantic sense in it – it’s just used to test the parser:

switch (myvar) {
    case 10:
        k = 10;
        p = k + 1;
        return 10;
    case 20:
    case 30:
        return 20;
    default:
        break;
}

And the AST pycparser was generating for this code:

Switch:
  ID: myvar
  Compound:
    Case:
      Constant: int, 10
      Assignment: =
        ID: k
        Constant: int, 10
    Assignment: =
      ID: p
      BinaryOp: +
        ID: k
        Constant: int, 1
    Return:
      Constant: int, 10
    Case:
      Constant: int, 20
      Case:
        Constant: int, 30
        Return:
          Constant: int, 20
    Default:
      Break:

There are two problems here:

  1. Only the first statement inside each case is made a child of that case – the other statements are siblings.
  2. Two consecutive case statements without any other statements in between (fall-through) cause the second case to become the child of the first one. If additional consecutive case statements follow, they nest even further.

Since the parser follows the C grammar pretty closely, I immediately went to look into the C99 standard, and indeed, this is exactly the parse tree that it mandates. Here’s the relevant portion of the language grammar (from section A.2.3):

(6.8) statement:
              labeled-statement
              compound-statement
              expression-statement
              selection-statement
              iteration-statement
              jump-statement
(6.8.1) labeled-statement:
              identifier : statement
              case constant-expression : statement
              default : statement

Note that a case (and default, which is equivalent to case in this whole discussion) must be followed by one, and only one other statement. This explains why pycparser parses the code above the way it does.

However, the goal of pycparser is not to generate a parse tree. It is to generate an abstract syntax tree (AST), which follows the language semantics rather than its grammar. Hey, I already wrote about this stuff!

So today I fixed this part of pycparser, by adding a dedicated AST transformation after parsing a switch statement. The transformation isn’t really complicated, and the AST pycparser generates now is much friendlier. Here it is, for the same code:

Switch:
  ID: myvar
  Compound:
    Case:
      Constant: int, 10
      Assignment: =
        ID: k
        Constant: int, 10
      Assignment: =
        ID: p
        BinaryOp: +
          ID: k
          Constant: int, 1
      Return:
        Constant: int, 10
    Case:
      Constant: int, 20
    Case:
      Constant: int, 30
      Return:
        Constant: int, 20
    Default:
      Break:

As you can see, the problems mentioned above were fixed. This fix is available in the pycparser Mercurial repository and will be part of the next release.

Related posts:

  1. Parsing C++ in Python with Clang
  2. pycparser now supports C99
  3. ASTs for analyzing C
  4. On parsing the C standard library headers
  5. the answer for parsing C ?

7 Responses to “Adventures in parsing C: ASTs for switch statements”

  1. Marius GedminasNo Gravatar Says:

    Can it parse Duff’s device correctly?

  2. elibenNo Gravatar Says:

    Marius,

    It appears so: https://gist.github.com/1734923

    I think that in terms of AST that’s correct. Semantically, case statements are just labels wherever they are under the switch.

  3. Kay SchluehrNo Gravatar Says:

    However, the goal of pycparser is not to generate a parse tree. It is to generate an abstract syntax tree (AST), which follows the language semantics rather than its grammar.

    But that’s somehow a degeneration because one corrects a formal model of syntax by an informal consideration about semantics instead of refactoring the grammar to represent the intended meaning in a formal way.

    It is not possible to refactor grammars in full generality because there is no function that can implement ‘=’ with respect to a given language L and two arbitrary CFGs but one can use the CFGs as generative grammars and approximate the set of all syntactically correct expressions, which yields a good test base for equality of the parsers generated from both languages. This allows for careful corrections as well.

  4. elibenNo Gravatar Says:

    Kay,

    I think it would be hard to refactor the C grammar to allow folding all statements following a case into it until the next case is encountered, without introducing significant ambiguity.

  5. Kay SchluehrNo Gravatar Says:

    I need to take your word here.

    Anyway, you brought to my attention an interesting aspect of the ANSI C grammar. Thanks for this!

  6. ugurNo Gravatar Says:

    “Note that a case (and default, which is equivalent to case in this whole discussion) must be followed by one, and only one other statement”

    Does it mean that C code in your example is incorrect? In fact most of the time there are more than one statement following case, now the quoted sentence confuses me.

  7. elibenNo Gravatar Says:

    ugur,

    No, it’s correct. The statement affects the parse tree generated by the grammar for the given code. Instead of all statements following the case being its children, only the first one is. The other ones are siblings (children of the parent compound statement).

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)