My previous articles on the context sensitivity and ambiguity of the C/C++ grammar (one, two, three) can probably make me sound pessimistic about the prospect of correctly parsing C/C++, which couldn't be farther from the truth. My gripe is not with the grammar itself (although I admit it's needlessly complex), it's with the inability of Yacc-generated LALR(1) parsers to parse it without considerable hacks. As I've mentioned numerous times before, industrial-strength compilers for C/C++ exist after all, so they do manage to somehow parse these languages.

One of the newest, and in my eyes the most exciting of C/C++ compilers is Clang. Originally developed by Apple as a front-end to LLVM, it's been a vibrant open-source project for the past couple of years with participation from many companies and individuals (although Apple remains the main driving force in the community). Clang, similarly to LLVM, features a modular library-based design and a very clean C++ code-base. Clang's parser is hand-written, based on a standard recursive-descent parsing algorithm.

In this post I want to explain how Clang manages to overcome the ambiguities I mentioned in the previous articles.

No lexer hack

There is no "lexer hack" in Clang. Information flows in a single direction - from the lexer to the parser, not back. How is this managed?

The thing is that the Clang lexer doesn't distinguish between user-defined types and other identifiers. All are marked with the identifier token.

For this code:

typedef int mytype;
mytype bb;

The Clang parser encounters the following tokens (-dump-tokens):

typedef 'typedef'   [StartOfLine]   Loc=<z.c:1:1>
int 'int'           [LeadingSpace]  Loc=<z.c:1:9>
identifier 'mytype' [LeadingSpace]  Loc=<z.c:1:13>
semi ';'                            Loc=<z.c:1:19>
identifier 'mytype' [StartOfLine]   Loc=<z.c:2:1>
identifier 'bb'     [LeadingSpace]  Loc=<z.c:2:8>
semi ';'                            Loc=<z.c:2:10>
eof ''                              Loc=<z.c:4:1>

Note how mytype is always reported as an identifier, both before and after Clang figures out it's actually a user-defined type.

Figuring out what's a type

So if the Clang lexer always reports mytype as an identifier, how does the parser figure out when it is actually a type? By keeping a symbol table.

Well, actually it's not the parser that keeps the symbol table, it's Sema. Sema is the Clang module responsible for semantic analysis and AST construction. It gets invoked from the parser through a generic "actions" interface, which in theory could serve a different client. Although conceptually the parser and Sema are coupled, the actions interface provides a clean separation in the code. The parser is responsible for driving the parsing process, and Sema is responsible for handling semantic information. In this particular case, the symbol table is semantic information, so it's handled by Sema.

To follow this process through, we'll start in Parser::ParseDeclarationSpecifiers [1]. In the C/C++ grammar, type names are part of the "specifiers" in a declaration (that also include things like extern or inline), and following the "recursive-descent protocol", Clang will usually feature a parsing method per grammar rule. When this method encounters an identifier (tok::identifier), it asks Sema whether it's actually a type by calling Actions.getTypeName [2].

Sema::getTypeName calls Sema::LookupName to do the actual name lookup. For C, name lookup rules are relatively simple - you just climb the lexical scope stack the code belongs to, trying to find a scope that defines the name as a type. I've mentioned before that all names in C (including type names) obey lexical scoping rules. With this mechanism, Clang implements the required nested symbol table. Note that this symbol table is queried by Clang in places where a type is actually expected and allowed, not only in declarations. For example, it's also done to disambiguate function calls from casts in some cases.

How does a type actually get into this table, though?

When the parser is done parsing a typedef (and any declarator, for that matter), it calls Sema::ActOnDeclarator. When the latter notices a new typedef and makes sure everything about it is kosher (e.g. it does not re-define a name in the same scope), it adds the new name to the symbol table at the current scope.

In Clang's code this whole process looks very clean and intuitive, but in a generated LALR(1) parser it would be utterly impossible, because leaving out the special token for type names and merging it with identifier would create a tons of unresolvable reduce-reduce conflicts in the grammar. This is why Yacc-based parsers require a lexer hack to handle this issue.

Class-wide declarations in C++

In the previous post I mentioned how C++ makes this type lookup problem much more difficult by forcing declarations inside a class to be visible throughout the class, even in code that appears before them. Here's a short reminder:

int aa(int arg) {
    return arg;
}

class C {
    int foo(int bb) {
        return (aa)(bb);
    }

    typedef int aa;
};

In this code, even though the typedef appears after foo, the parser must figure out that (aa)(bb) is a cast of bb to type aa, and not the function call aa(bb).

We've seen how Clang can manage to figure out that aa is a type. However, when it parses foo it hasn't even seen the typedef yet, so how does that work?

Delayed parsing of inline method bodies

To solve the problem described above, Clang employs a clever technique. When parsing an inline member function declaration/definition, it does full parsing and semantic analysis of the declaration, leaving the definition for later.

Specifically, the body of an inline method definition is lexed and the tokens are kept in a special buffer for later (this is done by Parser::ParseCXXInlineMethodDef). Once the parser has finished parsing the class, it calls Parser::ParseLexedMethodDefs that does the actual parsing and semantic analysis of the saved method bodies. At this point, all the types declared inside the class are available, so the parser can correctly disambiguate wherever required.

Annotation tokens

Although the above is enough to understand how Clang approaches the problem, I want to mention another trick it uses to make parsing more efficient in some cases.

The Sema::getTypeName method mentioned earlier can be costly. It performs a lookup in a set of nested scopes, which may be expensive if the scopes are deeply nested and a name is not actually a type (which is probably most often the case). It's alright (and inevitable!) to do this lookup once, but Clang would like to avoid repeating it for the same token when it backtracks trying to parse a statement in a different way.

A word on what "backtracks" means in this context. Recursive descent parsers are naturally (by their very structure) backtracking. That is, they may try a number of different ways to parse a single grammatical production (be that a statement, an expression, a declaration, or whatever), before finding an approach that succeeds. In this process, the same token may need to be queried more than once.

To avoid this, Clang has special "annotation tokens" it inserts into the token stream. The mechanism is used for other things as well, but in our case we're interested in the tok::annot_typename token. What happens is that the first time the parser encounters a tok::identifier and figures out it's a type, this token gets replaced by tok::annot_typename. The next time the parser encounters this token, it won't have to lookup whether it's a type once again, because it's no longer a generic tok::identifier [3].

Disclaimer and conclusion

It's important to keep in mind that the cases examined in this post do not represent the full complexity of the C++ grammar. In C++, constructs like qualified names (foo::bar::baz) and templates complicate matters considerably. However, I just wanted to focus on the cases I specifically discussed in previous posts, explaining how Clang addresses them.

To conclude, we've seen how Clang's recursive descent parser manages some of the ambiguities of the C/C++ grammar. For a task that complex, it's inevitable for the code to become non-trivial [4]. That said, Clang has actually managed to keep its code-base relatively clean and logically structured, while at the same time sticking to its aggressive performance goals. Someone with a general understanding of how front-ends work shouldn't require more than a few hours of immersion in the Clang code-base to be able to answer questions about "how does it do that".

[1]As a rule, all Parser code lives in lib/Parse in the Clang source tree. Sema code lives in lib/Sema.
[2]Here and later I'll skip a lot of details and variations, focusing only on the path I want to use in the example.
[3]It's very important to note that only this instance of the token in the token stream is replaced. The next instance may have already become a type (or we may have even changed the scope), so it wouldn't be semantically correct to reason about it.
[4]That Clang parses Objective-C and various extensions like CUDA or OpenCL in the same code-base doesn't help in this respect.

Comments

comments powered by Disqus