The type / variable name ambiguity in C++

June 28th, 2012 at 2:41 pm

I’ve written here and in other places about the type/variable name ambiguity that arises when parsing C code. I’ve also hinted that in C++ it’s much worse, without giving details. Well, today while reading an interesting report on GLR parsing, I came across a great example of this ambiguity in C++; one that should make every parser writer cringe. I’ve modified it a bit for simplicity.

Here’s a snippet of C++ code:

int aa(int arg) {
    return arg;
}

class C {
    int foo(int bb) {
        return (aa)(bb);
    }
};

Nothing fancy. The weird thing here is (aa)(bb), which in this case calls the function aa with the argument bb. aa is taken as a name, and names can be put inside parens – the C++ grammar allows it. I’ve asked Clang to dump the AST resulting from parsing this code. Here it is:

class C {
    class C;
    int foo(int bb) (CompoundStmt 0x3bac758 <a.cpp:6:21, line:8:5>
  (ReturnStmt 0x3bac738 <line:7:9, col:23>
    (CallExpr 0x3bac6f0 <col:16, col:23> 'int'
      (ImplicitCastExpr 0x3bac6d8 <col:16, col:19> 'int (*)(int)' <FunctionToPointerDecay>
        (ParenExpr 0x3bac668 <col:16, col:19> 'int (int)' lvalue
          (DeclRefExpr 0x3bac640 <col:17> 'int (int)' lvalue Function 0x3bac1d0 'aa' 'int (int)')))
      (ImplicitCastExpr 0x3bac720 <col:21> 'int' <LValueToRValue>
        (DeclRefExpr 0x3bac688 <col:21> 'int' lvalue ParmVar 0x3bac4f0 'bb' 'int')))))

As we can see, Clang parsed this to a function call, as expected.

Now let’s modify the code a bit:

int aa(int arg) {
    return arg;
}

class C {
    int foo(int bb) {
        return (aa)(bb);
    }

    typedef int aa;
};

The only difference is the typedef added to the end of the class. Here’s Clang’s AST dump for the second snippet:

class C {
    class C;
    int foo(int bb) (CompoundStmt 0x2a79788 <a.cpp:6:21, line:8:5>
  (ReturnStmt 0x2a79768 <line:7:9, col:23>
    (CStyleCastExpr 0x2a79740 <col:16, col:23> 'aa':'int' <NoOp>
      (ImplicitCastExpr 0x2a79728 <col:20, col:23> 'int' <LValueToRValue>
        (ParenExpr 0x2a796f8 <col:20, col:23> 'int' lvalue
          (DeclRefExpr 0x2a796d0 <col:21> 'int' lvalue ParmVar 0x2a79500 'bb' 'int'))))))


    typedef int aa;
};

Clang now interprets (aa)(bb) as a cast from bb to type aa. Why?

Because in C++, type declarations in a class are visible throughout the class. Yes, that’s right, even in methods defined before them. The typedef defines aa as a type, which inside the class scope masks the external aa name. This affects parsing. The cruel thing here is that the parser only finds out about aa being a type after it went over the foo method.

It’s not unsolvable, of course, but it’s another good example of what makes real-world programming languages hard to parse, and another case where a straightforward generated LALR(1) parser would completely bomb without significant "lexer hacking".

Related posts:

  1. How Clang handles the type / variable name ambiguity of C/C++
  2. Reading C type declarations
  3. The context sensitivity of C’s grammar
  4. Variable initialization in C++
  5. type system

5 Responses to “The type / variable name ambiguity in C++”

  1. QbNo Gravatar Says:

    An immediate question would be : why does the CC grammar allows (aa) in the first case? what’s the point of allowing a function name inside parens?
    well, ok , since everything is an expression, (aa) is an expression which evaluates to a function type, which then is used for a function call.
    The same would be for

    struct C { void operator () (int arg) {} };
    C aa;
    (aa)(bb);

  2. elibenNo Gravatar Says:

    Qb,

    You can call functions through pointers, so everything evaluating to a function pointer should be allowed, and any expression can be parenthesized, so it makes sense.

  3. Elazar LeibovichNo Gravatar Says:

    Are you implying that these complexities are necessary for “real world language”? I really think that these are historical accidents of a programming language not well designed who became popular, and that this complexities could be avoided (see Pascal), but I’l be glad to be proved wrong.

  4. ShoufuNo Gravatar Says:

    I’m from C not sure about C++, but I think though the code is not incorrect, the ambiguity could be avoided by a good code style.
    for function call (variable): return aa (bb);
    for type cast: (aa) bb;

    BTW, it should not be difficult to parse, the only mater is named ‘scope’.

  5. elibenNo Gravatar Says:

    Shoufu,

    I think you’re missing the point. True, this isn’t good coding style. But it’s valid C/C++ nonetheless, which means the compiler has to be able to handle it.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)