Adventures in parsing C: ASTs for switch statements
February 3rd, 2012 at 4:02 pmLast week I received an email from a user of pycparser that mentioned the strange AST that results when pycparser parses a switch statement.
Let’s take the following snippet of C code for example. Don’t look for semantic sense in it – it’s just used to test the parser:
switch (myvar) {
case 10:
k = 10;
p = k + 1;
return 10;
case 20:
case 30:
return 20;
default:
break;
}
And the AST pycparser was generating for this code:
Switch:
ID: myvar
Compound:
Case:
Constant: int, 10
Assignment: =
ID: k
Constant: int, 10
Assignment: =
ID: p
BinaryOp: +
ID: k
Constant: int, 1
Return:
Constant: int, 10
Case:
Constant: int, 20
Case:
Constant: int, 30
Return:
Constant: int, 20
Default:
Break:
There are two problems here:
- Only the first statement inside each case is made a child of that case – the other statements are siblings.
- Two consecutive case statements without any other statements in between (fall-through) cause the second case to become the child of the first one. If additional consecutive case statements follow, they nest even further.
Since the parser follows the C grammar pretty closely, I immediately went to look into the C99 standard, and indeed, this is exactly the parse tree that it mandates. Here’s the relevant portion of the language grammar (from section A.2.3):
(6.8) statement:
labeled-statement
compound-statement
expression-statement
selection-statement
iteration-statement
jump-statement
(6.8.1) labeled-statement:
identifier : statement
case constant-expression : statement
default : statement
Note that a case (and default, which is equivalent to case in this whole discussion) must be followed by one, and only one other statement. This explains why pycparser parses the code above the way it does.
However, the goal of pycparser is not to generate a parse tree. It is to generate an abstract syntax tree (AST), which follows the language semantics rather than its grammar. Hey, I already wrote about this stuff!
So today I fixed this part of pycparser, by adding a dedicated AST transformation after parsing a switch statement. The transformation isn’t really complicated, and the AST pycparser generates now is much friendlier. Here it is, for the same code:
Switch:
ID: myvar
Compound:
Case:
Constant: int, 10
Assignment: =
ID: k
Constant: int, 10
Assignment: =
ID: p
BinaryOp: +
ID: k
Constant: int, 1
Return:
Constant: int, 10
Case:
Constant: int, 20
Case:
Constant: int, 30
Return:
Constant: int, 20
Default:
Break:
As you can see, the problems mentioned above were fixed. This fix is available in the pycparser Mercurial repository and will be part of the next release.
Related posts:

February 4th, 2012 at 04:11
Can it parse Duff’s device correctly?
February 4th, 2012 at 05:22
Marius,
It appears so: https://gist.github.com/1734923
I think that in terms of AST that’s correct. Semantically,
casestatements are just labels wherever they are under theswitch.February 4th, 2012 at 13:36
But that’s somehow a degeneration because one corrects a formal model of syntax by an informal consideration about semantics instead of refactoring the grammar to represent the intended meaning in a formal way.
It is not possible to refactor grammars in full generality because there is no function that can implement ‘=’ with respect to a given language L and two arbitrary CFGs but one can use the CFGs as generative grammars and approximate the set of all syntactically correct expressions, which yields a good test base for equality of the parsers generated from both languages. This allows for careful corrections as well.
February 4th, 2012 at 15:53
Kay,
I think it would be hard to refactor the C grammar to allow folding all statements following a
caseinto it until the nextcaseis encountered, without introducing significant ambiguity.February 4th, 2012 at 19:46
I need to take your word here.
Anyway, you brought to my attention an interesting aspect of the ANSI C grammar. Thanks for this!
February 8th, 2012 at 14:52
“Note that a case (and default, which is equivalent to case in this whole discussion) must be followed by one, and only one other statement”
Does it mean that C code in your example is incorrect? In fact most of the time there are more than one statement following case, now the quoted sentence confuses me.
February 8th, 2012 at 14:58
ugur,
No, it’s correct. The statement affects the parse tree generated by the grammar for the given code. Instead of all statements following the
casebeing its children, only the first one is. The other ones are siblings (children of the parent compound statement).