The context sensitivity of C’s grammar, revisited

A few years ago I've written about the context sensitivity of the grammar of C - the nasty typedef-name problem that makes the grammar of C ambiguous and requires a hack in the lexer to make it possible for YACC to correctly parse the grammar.

Since then, I have implemented this technique in pycparser, and it successfully parses real-world C99 code. However, it turns out that when mixed with the scope rules of C, the typedef-name problem rears its ugly head again, causing even more trouble.

The problem

The C standard states that names defined with typedef behave in a manner similar to other names in the language. In particular, they should obey the lexical scoping rules. The following is invalid:

typedef int AA;
int AA;

Since AA is first defined as a type and then redefined as a variable name, in the same scope. This, however, is valid:

typedef int AA;

int main()
{
  int AA;           /* OK - redefining AA in internal scope */
  int BB = AA * 2;  /* OK - AA is an identifier in this scope! */
}

Because int AA redefines the name AA in the scope of the main function to be the name of an integer variable, not a type.

So this is a hole in the simple solution for the typedef-name problem. The parser now has to handle another context-sensitivity - taking scopes into account. Here's another example:

int main()
{
  typedef int AA;
}

void foo()
{
  AA aa;  /* should be an error - AA isn't a type in this scope */
}

Since AA is defined as a type in the internal scope of main, this definition is invisible in the internal scope of foo - so AA can't be used as a type there.

Complications

Unfortunately, simply keeping track of the scopes isn't enough. A careful examination discovers a more serious problem. Consider this code:

typedef int AA;

void foo()
{
  AA aa;       /* OK - define variable aa of type AA */
  float AA;    /* OK - define variable AA of type float */
}

Both lines are valid C, and yet, how can the parser know it? Say our fancy symbol table is in place and the parser, when inside foo, knows it's an internal scope and that the type AA is defined in the scope above it. Still, how does it distinguish between the two different references to AA?

Here's another example. You're unlikely to see such code in real life, but the parser should still handle it:

typedef int AA;

void foo()
{
  AA AA;            /* OK - define variable AA of type AA */
  int BB = AA * 2;  /* OK - AA is just a variable name here */
}

The AA AA; line is pure evil, and yet it is valid. The lexer must somehow figure out that the first AA is a type, and the second AA is an identifier.

Just for kicks, here's another example:

typedef char AA;

void foo()
{
  int aa = sizeof(AA), AA, bb = sizeof(AA);
}

This is also perfectly valid, and on a typical 32-bit machine the value of aa is going to be 1, while the value of bb is going to be 4, since the declaration of AA as an int variable kicks in immediately, ready to be used in the same line after the comma.

Possible solutions

I do not intend to claim these are non-solvable problems. Obviously, C compilers exist and many parse these code samples correctly. One thing is clear though - this problem makes the C grammar nasty, and the pure-and-nice YACC grammar samples you find online for it are wrong [1].

After doing a lot of reading online, I found the following approaches to the "C parsing problem" most common:

Tweaking the lexer and YACC grammar

It is actually possible to correctly parse C with a YACC-generated parser, but it requires a considerable amount of tweaking both in the lexer and parser code. The exact changes required will take another article (or five) to describe, but in short, the recipe is:

The parser should keep scoping information along the parse. To make this possible, the rules for handling scope-opening chars ({, (, ) and }) have to me modified to maintain a scope level.
New types defined by typedef should be kept in a hierarchical symbol table and the parser & lexer should know their scope [2]. The lexer, in particular, now has to consult the hierarchical symbol table regarding a possible type.
Many rules in the parser have to be modified to signal to the lexer with a special flag where usage of a name as a type is allowed.

Don't underestimate the havoc these changes wreak in a YACC grammar. Rules have to be modified, split, duplicated and in general complicated, moving the grammar farther (in appearance) from the formal grammar of C.

GLR

Another way to handle ambiguity in YACC grammars is by using a GLR parser. When a GLR parser runs into a case where there is more than one parse possible [3], it parses both options. The result of such a parse is a DAG rather than a tree, and the subsequent steps of the compiler have to resolve the ambiguity.

This is a good thing, since the subsequent steps also have more information, and they are built on a much more flexible framework of tree processing. For example, while the parse tree (or DAG in case of a GLR parse) is being walked, a symbol table is usually being constructed anyway, so scope resolution is almost free. The other problems can also be resolved by applying heuristics while walking the tree.

Specifically, to approach the typedef-name problem, a GLR parser will simply use an identifier instead of a type name everywhere. In a regular YACC grammar, that would cause a lot of conflicts, but a GLR parser doesn't care about that. During the actual parse, in places where a conflict is detected, both paths in the tree will be recorded. After the parser has finished generating the ambiguous parse tree, another pass in the compiler will disambiguate it based on scope information and additional heuristics.

Hand-written parsers

The most popular option for parsing C, however, seems to be just leaving LR and YACC behind and use a hand-written recursive descent parser. This is the path now taken by GCC, as well as the new C/C++ compiler Clang. At least a few other compilers I've checked also go this route - LCC and tcc (Tiny C compiler), for example.

But why do this? Isn't YACC supposed to help us write parsers much quicker? Maybe it is, and for this reason it's probably the best approach to take when you need to quickly code a parser for some small language [4]. However, when you have a very complex parser to write, and this parser is at the core of your product, hand-coding it appears to be the preferred approach. I think Joel Spolsky put this well in his Defense of Not-Invented-Here Syndrome.

The biggest problem with YACC-based parsers, IMHO, is that you're tied to the LR(1) parsing power such grammars can provide, and are forced to live inside the YACC parsing framework. Frameworks are great as long as they give you what you need. But once your needs transcend the abilities of the framework, you often find yourself spending more time fighting with its idiosyncrasies than solving the problem itself.

A hand-written parser won't make the problems demonstrated in this article magically go away. Parsing of declarations will still be complex and resolution between types and identifiers will still have to depend on a symbol table. But since the the parser code is completely custom and doesn't have to be constrained to what YACC accepts, handling these issues is much less of a deal.

What about C++?

The problems with C's grammar are magnified ten-fold in C++, which has even more ambiguous constructs. In fact, I'm not aware of a single industrial-strength compiler that uses YACC to fully parse modern C++ - please point me to one if I'm wrong. AFAIK most C++ parsers out there are hand-written recursive descent.

[P.S. I'd like to thank huku for the interesting email discussions that helped me understand better the possible approach to solve the typedef problem within a YACC parser].

[1]	Including the C syntax pages at the end of K&R2, which blissfully ignores this problem, assuming that the lexer somehow magically infers the distinctions correctly (which isn't possible unless the lexer does a lot of parsing on its own).

[2]

Curiously, later stages of compilation definitely use a hierarchical symbol table, because of the same problems. To even know to which variable a name refers (during, say, type checking) the compiler has to know the current scope and all the scopes above it. The typedef-name problem pushes a similar symbol table into the parser.

[3]	Such as a reduce-reduce conflict.

[4]	Given, of course, that the grammar of this small language is reasonable, unlike C.