I've written here and in other places about the type/variable name ambiguity that arises when parsing C code. I've also hinted that in C++ it's much worse, without giving details. Well, today while reading an interesting report on GLR parsing, I came across a great example of this ambiguity in C++; one that should make every parser writer cringe. I've modified it a bit for simplicity.
Here's a snippet of C++ code:
int aa(int arg) {
return arg;
}
class C {
int foo(int bb) {
return (aa)(bb);
}
};
Nothing fancy. The weird thing here is (aa)(bb), which in this case calls the function aa with the argument bb. aa is taken as a name, and names can be put inside parens - the C++ grammar allows it. I've asked Clang to dump the AST resulting from parsing this code. Here it is:
class C {
class C;
int foo(int bb) (CompoundStmt 0x3bac758 <a.cpp:6:21, line:8:5>
(ReturnStmt 0x3bac738 <line:7:9, col:23>
(CallExpr 0x3bac6f0 <col:16, col:23> 'int'
(ImplicitCastExpr 0x3bac6d8 <col:16, col:19> 'int (*)(int)' <FunctionToPointerDecay>
(ParenExpr 0x3bac668 <col:16, col:19> 'int (int)' lvalue
(DeclRefExpr 0x3bac640 <col:17> 'int (int)' lvalue Function 0x3bac1d0 'aa' 'int (int)')))
(ImplicitCastExpr 0x3bac720 <col:21> 'int' <LValueToRValue>
(DeclRefExpr 0x3bac688 <col:21> 'int' lvalue ParmVar 0x3bac4f0 'bb' 'int')))))
As we can see, Clang parsed this to a function call, as expected.
Now let's modify the code a bit:
int aa(int arg) {
return arg;
}
class C {
int foo(int bb) {
return (aa)(bb);
}
typedef int aa;
};
The only difference is the typedef added to the end of the class. Here's Clang's AST dump for the second snippet:
class C {
class C;
int foo(int bb) (CompoundStmt 0x2a79788 <a.cpp:6:21, line:8:5>
(ReturnStmt 0x2a79768 <line:7:9, col:23>
(CStyleCastExpr 0x2a79740 <col:16, col:23> 'aa':'int' <NoOp>
(ImplicitCastExpr 0x2a79728 <col:20, col:23> 'int' <LValueToRValue>
(ParenExpr 0x2a796f8 <col:20, col:23> 'int' lvalue
(DeclRefExpr 0x2a796d0 <col:21> 'int' lvalue ParmVar 0x2a79500 'bb' 'int'))))))
typedef int aa;
};
Clang now interprets (aa)(bb) as a cast from bb to type aa. Why?
Because in C++, type declarations in a class are visible throughout the class. Yes, that's right, even in methods defined before them. The typedef defines aa as a type, which inside the class scope masks the external aa name. This affects parsing. The cruel thing here is that the parser only finds out about aa being a type after it went over the foo method.
It's not unsolvable, of course, but it's another good example of what makes real-world programming languages hard to parse, and another case where a straightforward generated LALR(1) parser would completely bomb without significant "lexer hacking".