I've written here and in other places about the type/variable name ambiguity that arises when parsing C code. I've also hinted that in C++ it's much worse, without giving details. Well, today while reading an interesting report on GLR parsing, I came across a great example of this ambiguity in C++; one that should make every parser writer cringe. I've modified it a bit for simplicity.

Here's a snippet of C++ code:

int aa(int arg) {
    return arg;
}

class C {
    int foo(int bb) {
        return (aa)(bb);
    }
};

Nothing fancy. The weird thing here is (aa)(bb), which in this case calls the function aa with the argument bb. aa is taken as a name, and names can be put inside parens - the C++ grammar allows it. I've asked Clang to dump the AST resulting from parsing this code. Here it is:

class C {
    class C;
    int foo(int bb) (CompoundStmt 0x3bac758 <a.cpp:6:21, line:8:5>
  (ReturnStmt 0x3bac738 <line:7:9, col:23>
    (CallExpr 0x3bac6f0 <col:16, col:23> 'int'
      (ImplicitCastExpr 0x3bac6d8 <col:16, col:19> 'int (*)(int)' <FunctionToPointerDecay>
        (ParenExpr 0x3bac668 <col:16, col:19> 'int (int)' lvalue
          (DeclRefExpr 0x3bac640 <col:17> 'int (int)' lvalue Function 0x3bac1d0 'aa' 'int (int)')))
      (ImplicitCastExpr 0x3bac720 <col:21> 'int' <LValueToRValue>
        (DeclRefExpr 0x3bac688 <col:21> 'int' lvalue ParmVar 0x3bac4f0 'bb' 'int')))))

As we can see, Clang parsed this to a function call, as expected.

Now let's modify the code a bit:

int aa(int arg) {
    return arg;
}

class C {
    int foo(int bb) {
        return (aa)(bb);
    }

    typedef int aa;
};

The only difference is the typedef added to the end of the class. Here's Clang's AST dump for the second snippet:

class C {
    class C;
    int foo(int bb) (CompoundStmt 0x2a79788 <a.cpp:6:21, line:8:5>
  (ReturnStmt 0x2a79768 <line:7:9, col:23>
    (CStyleCastExpr 0x2a79740 <col:16, col:23> 'aa':'int' <NoOp>
      (ImplicitCastExpr 0x2a79728 <col:20, col:23> 'int' <LValueToRValue>
        (ParenExpr 0x2a796f8 <col:20, col:23> 'int' lvalue
          (DeclRefExpr 0x2a796d0 <col:21> 'int' lvalue ParmVar 0x2a79500 'bb' 'int'))))))


    typedef int aa;
};

Clang now interprets (aa)(bb) as a cast from bb to type aa. Why?

Because in C++, type declarations in a class are visible throughout the class. Yes, that's right, even in methods defined before them. The typedef defines aa as a type, which inside the class scope masks the external aa name. This affects parsing. The cruel thing here is that the parser only finds out about aa being a type after it went over the foo method.

It's not unsolvable, of course, but it's another good example of what makes real-world programming languages hard to parse, and another case where a straightforward generated LALR(1) parser would completely bomb without significant "lexer hacking".


Comments

comments powered by Disqus