The context sensitivity of C’s grammar, revisited

May 2nd, 2011 at 5:55 am

A few years ago I’ve written about the context sensitivity of the grammar of C – the nasty typedef-name problem that makes the grammar of C ambiguous and requires a hack in the lexer to make it possible for YACC to correctly parse the grammar.

Since then, I have implemented this technique in pycparser, and it successfully parses real-world C99 code. However, it turns out that when mixed with the scope rules of C, the typedef-name problem rears its ugly head again, causing even more trouble.

The problem

The C standard states that names defined with typedef behave in a manner similar to other names in the language. In particular, they should obey the lexical scoping rules. The following is invalid:

typedef int AA;
int AA;

Since AA is first defined as a type and then redefined as a variable name, in the same scope. This, however, is valid:

typedef int AA;

int main()
{
  int AA;           /* OK - redefining AA in internal scope */
  int BB = AA * 2;  /* OK - AA is an identifier in this scope! */
}

Because int AA redefines the name AA in the scope of the main function to be the name of an integer variable, not a type.

So this is a hole in the simple solution for the typedef-name problem. The parser now has to handle another context-sensitivity – taking scopes into account. Here’s another example:

int main()
{
  typedef int AA;
}

void foo()
{
  AA aa;  /* should be an error - AA isn't a type in this scope */
}

Since AA is defined as a type in the internal scope of main, this definition is invisible in the internal scope of foo – so AA can’t be used as a type there.

Complications

Unfortunately, simply keeping track of the scopes isn’t enough. A careful examination discovers a more serious problem. Consider this code:

typedef int AA;

void foo()
{
  AA aa;       /* OK - define variable aa of type AA */
  float AA;    /* OK - define variable AA of type float */
}

Both lines are valid C, and yet, how can the parser know it? Say our fancy symbol table is in place and the parser, when inside foo, knows it’s an internal scope and that the type AA is defined in the scope above it. Still, how does it distinguish between the two different references to AA?

Here’s another example. You’re unlikely to see such code in real life, but the parser should still handle it:

typedef int AA;

void foo()
{
  AA AA;            /* OK - define variable AA of type AA */
  int BB = AA * 2;  /* OK - AA is just a variable name here */
}

The AA AA; line is pure evil, and yet it is valid. The lexer must somehow figure out that the first AA is a type, and the second AA is an identifier.

Just for kicks, here’s another example:

typedef char AA;

void foo()
{
  int aa = sizeof(AA), AA, bb = sizeof(AA);
}

This is also perfectly valid, and on a typical 32-bit machine the value of aa is going to be 1, while the value of bb is going to be 4, since the declaration of AA as an int variable kicks in immediately, ready to be used in the same line after the comma.

Possible solutions

I do not intend to claim these are non-solvable problems. Obviously, C compilers exist and many parse these code samples correctly. One thing is clear though – this problem makes the C grammar nasty, and the pure-and-nice YACC grammar samples you find online for it are wrong [1].

After doing a lot of reading online, I found the following approaches to the "C parsing problem" most common:

Tweaking the lexer and YACC grammar

It is actually possible to correctly parse C with a YACC-generated parser, but it requires a considerable amount of tweaking both in the lexer and parser code. The exact changes required will take another article (or five) to describe, but in short, the recipe is:

  • The parser should keep scoping information along the parse. To make this possible, the rules for handling scope-opening chars ({, (, ) and }) have to me modified to maintain a scope level.
  • New types defined by typedef should be kept in a hierarchical symbol table and the parser & lexer should know their scope [2]. The lexer, in particular, now has to consult the hierarchical symbol table regarding a possible type.
  • Many rules in the parser have to be modified to signal to the lexer with a special flag where usage of a name as a type is allowed.

Don’t underestimate the havoc these changes wreak in a YACC grammar. Rules have to be modified, split, duplicated and in general complicated, moving the grammar farther (in appearance) from the formal grammar of C.

GLR

Another way to handle ambiguity in YACC grammars is by using a GLR parser. When a GLR parser runs into a case where there is more than one parse possible [3], it parses both options. The result of such a parse is a DAG rather than a tree, and the subsequent steps of the compiler have to resolve the ambiguity.

This is a good thing, since the subsequent steps also have more information, and they are built on a much more flexible framework of tree processing. For example, while the parse tree (or DAG in case of a GLR parse) is being walked, a symbol table is usually being constructed anyway, so scope resolution is almost free. The other problems can also be resolved by applying heuristics while walking the tree.

Specifically, to approach the typedef-name problem, a GLR parser will simply use an identifier instead of a type name everywhere. In a regular YACC grammar, that would cause a lot of conflicts, but a GLR parser doesn’t care about that. During the actual parse, in places where a conflict is detected, both paths in the tree will be recorded. After the parser has finished generating the ambiguous parse tree, another pass in the compiler will disambiguate it based on scope information and additional heuristics.

Hand-written parsers

The most popular option for parsing C, however, seems to be just leaving LR and YACC behind and use a hand-written recursive descent parser. This is the path now taken by GCC, as well as the new C/C++ compiler Clang. At least a few other compilers I’ve checked also go this route – LCC and tcc (Tiny C compiler), for example.

But why do this? Isn’t YACC supposed to help us write parsers much quicker? Maybe it is, and for this reason it’s probably the best approach to take when you need to quickly code a parser for some small language [4]. However, when you have a very complex parser to write, and this parser is at the core of your product, hand-coding it appears to be the preferred approach. I think Joel Spolsky put this well in his Defense of Not-Invented-Here Syndrome.

The biggest problem with YACC-based parsers, IMHO, is that you’re tied to the LR(1) parsing power such grammars can provide, and are forced to live inside the YACC parsing framework. Frameworks are great as long as they give you what you need. But once your needs transcend the abilities of the framework, you often find yourself spending more time fighting with its idiosyncrasies than solving the problem itself.

A hand-written parser won’t make the problems demonstrated in this article magically go away. Parsing of declarations will still be complex and resolution between types and identifiers will still have to depend on a symbol table. But since the the parser code is completely custom and doesn’t have to be constrained to what YACC accepts, handling these issues is much less of a deal.

What about C++?

The problems with C’s grammar are magnified ten-fold in C++, which has even more ambiguous constructs. In fact, I’m not aware of a single industrial-strength compiler that uses YACC to fully parse modern C++ – please point me to one if I’m wrong. AFAIK most C++ parsers out there are hand-written recursive descent.

[P.S. I'd like to thank huku for the interesting email discussions that helped me understand better the possible approach to solve the typedef problem within a YACC parser].

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] Including the C syntax pages at the end of K&R2, which blissfully ignores this problem, assuming that the lexer somehow magically infers the distinctions correctly (which isn’t possible unless the lexer does a lot of parsing on its own).
[2] Curiously, later stages of compilation definitely use a hierarchical symbol table, because of the same problems. To even know to which variable a name refers (during, say, type checking) the compiler has to know the current scope and all the scopes above it. The typedef-name problem pushes a similar symbol table into the parser.
[3] Such as a reduce-reduce conflict.
[4] Given, of course, that the grammar of this small language is reasonable, unlike C.

Related posts:

  1. The context sensitivity of C’s grammar
  2. Generating random sentences from a context free grammar
  3. How Clang handles the type / variable name ambiguity of C/C++
  4. The type / variable name ambiguity in C++
  5. a VHDL parser in Perl

22 Responses to “The context sensitivity of C’s grammar, revisited”

  1. FranklinNo Gravatar Says:

    I don’t really see a problem with that. The context is pretty clear.
    You must not parse C code with a bunch of regex. YACC should know how to deal with scopes

    in python you can have str pointing to the number 5 and, at some other point, back to the string class

    This also can be an extension of the SFINAE rule… I guess

  2. Seo SanghyeonNo Gravatar Says:

    As far as I know Elsa is the only thing that even tries to parse C++ with a generated parser.
    http://scottmcpeak.com/elkhound/

  3. elibenNo Gravatar Says:

    Franklin

    Who said anything about regex? Besides, your comment about Python shows you have a very basic misunderstanding of the various stages of compilation. The reference pointing to different things in Python has nothing to do with parsing. In fact, in Python a type doesn’t affect the parsing flow – it’s just an identifier like any other identifier.

    Seo

    Elkhound is a GLR parser generator, not simple LR.

  4. Seo SanghyeonNo Gravatar Says:

    Eli: I know: it is not YACC. But it is not hand-written recursive descent either.

  5. Sunnil GuptaNo Gravatar Says:

    Why do you not put the downloads here for people to get your softs? It is very troublesome otherwise.

  6. Richard MooreNo Gravatar Says:

    I think parsers generated with a more powerful tool such as ANTLR could handle this using a semantic predicate.

  7. elibenNo Gravatar Says:

    Sunnil

    Sorry, but I have absolutely no idea what you’re talking about.

    Richard

    Are you familiar with industrial-strength parsers/compilers for C++ based on ANTLR?

  8. JonNo Gravatar Says:

    Eli, thanks for the write-up, very interesting, and I’m glad to see it explained so clearly. The troubles of parsing are definitely one important cause of why current software development tools are so weak.

    I am very familiar with C++’s difficulties, but not that much with C’s. Is that the only context-dependency in C’s grammar, or is there anything else?

  9. elibenNo Gravatar Says:

    Jon

    Thanks for the feedback. AFAIK this is the only context-dependency in C’s grammar. C++ has more problems.

  10. Arseny KapoulkineNo Gravatar Says:

    Why should the lexer know about the distinction between typedefname and identifiers?

    I never used YACC, but if I was going to write a C parser, the lexemes would be sets of special characters (i.e. braces, operators), strings, numbers, keywords and identifiers – no distinction between different types of identifiers.

    The parser should handle the rest.

  11. hcbNo Gravatar Says:

    Mercurium (Nanos group) uses a traditional bison grammar for C++ parsing, though in GLR mode and with a little twist:

    http://nanos.ac.upc.edu/content/why-are-we-using-custom-version-bison

    This approach is much cleaner than lexical tie ins IMHO.

  12. elibenNo Gravatar Says:

    Arseny

    Make sure you read the previous article on this topic – http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar/ – it answers your question.

    hcb

    Interesting, thanks.

  13. Arseny KapoulkineNo Gravatar Says:

    I see; simple lexer + recursive descent parser sound easier than a bunch of lex/yacc hacks. Though maybe I got too much used to recursive descent parsers.

  14. Jerry CurryNo Gravatar Says:

    Very enlightening article on how the edge cases of the scoping rules are implemented. Thanks. I enjoyed it.

    But I can’t pass on the opportunity to comment on coding style.

    If you’re in the habit of writing code where the ambiguities become interesting, you’re probably writing code that will be unnecessarily difficult for other people to support.

    E.g. I know that C/C++ follows the order of mathematical operations, so the parentheses aren’t necessary in the following:
    crossProd = (a1 * b2) – (a2 * b1);
    But if I’m writing code that needs to be supported by people who are not mathematically-minded, I include the parentheses. This is not for the compiler, but for my team members.

    If you’re writing code that in any way resembles some of the latter examples above, then even if *you* have it right when you check it in, you’re just asking some maintenance programmer to introduce bugs into the next version.

  15. sovandeNo Gravatar Says:

    AA AA;
    int BB = AA * 2;

    > The AA AA; line is pure evil, and yet it is valid. The lexer must somehow figure out that the first AA is a type, and the second AA is an identifier.

    I think you confuse the separation and domain between a lexer (regex) and a parser (context-free grammar). The lexer returns the AA token and does not care what AA _is_. For the lexer AA is a token. Yacc on the other hand will know what the token means based on the context it is used in, i.e. the grammar. Hint, the first is a declaration statement and the second an expression statement in a C grammar and hence is _parsed_ in different semantic rules. Also thinking about type and variables as in different namespaces (which is the case) should help.

  16. sovandeNo Gravatar Says:

    I’m sorry, I took that out of context and misunderstood the problem. You are absolutely right that at the lexer level it is complex.

  17. hukuNo Gravatar Says:

    Hello there,

    Your article is very nice and very comprehensible (as opposed to mine – I suck at explaining stuff in English :-) ). I also found your 3rd and 6th snippet very interesting! I’ll definitely have a look at them tomorrow (it’s 4AM here).

    What I can’t understand is why GLR is considered a solution to the typedef problem. Yeap, you end up having a parse forest instead of a tree, but what kind of heuristics is one supposed to use to get the correct tree out of the forest? If you ever try to use GLR, you’ll (imo) end up doing the same ugly hacks as LR(1). Is there a “well accepted” heuristic/technique for doing this with GLR?

    Btw, “scannerless parsing” is another technique which is said to solve the typedef problem (no offense to the authors, they have done a great work, but I doubt it solves 100% of the problem). Have a look at this paper:

    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.7828

    Keep up the cool work!
    ./hk

  18. elibenNo Gravatar Says:

    huku,

    It’s true that GLR doesn’t “magically” solve the problem. But when you’ve reached the tree/DAG walking stage you’re much better equipped to tackle the ambiguity. First, because you may already have the full symbol table built (partially up to the stage you’re currently examining), so no need to maintain it in the parser as well. Second, because it’s generally much easier to be flexible at this stage – unlike the parser where you’re tied in to the YACC structure, when tree/DAG walking you have more leeway.

    To be honest, my talking about GLR is theoretic, based on reading and thinking – I haven’t ever written a GLR parser. I probably should :-)

  19. AkainuNo Gravatar Says:

    Good article, thanks.

  20. witekNo Gravatar Says:

    You are confusing syntax and semantic. typedef-name does not influence parsing of code. (at least not acording to your examples). Even AA AA; evil example is simple, I do not see how it could be parsed in different way than TYPE identifier; then in semantic pass check if there exists type AA in this scope, and if there do-not-yet exists identifier AA in this scope. Similary for other examples.

    I also do not understand why anybody would like to change lexer in any of this examples. In all cases it should return token as Identifier(AA).

    PS. Ok. I see, there are other interesting cases presented in previous article: T (x), and F((T) * x), are most interesting and indeed shows that C grammer is not CFG. :( Fortunetly we could life without typedef and just use preprocessor macros. Would need to check how same problem looks in D language :)

  21. elibenNo Gravatar Says:

    witek

    I’m not confusing anything… honest :-) I think you should brush up on parsing, and especially parsing context-sensitive constructs with an LR parser first. An LR parser (YACC) is very rigid, and very dependent on its grammar being context free.

    As an exercise, take a C YACC grammar (there are a few out there, or you can use the one from pycparser) and try to replace the TYPENAME (or TYPEDEFNAME, or whatever it’s called in that specific grammar) with ID where relevant. Count the conflicts.

  22. Matt FenwickNo Gravatar Says:

    Very interesting and informative article; thanks! Had been looking for good examples of problems that context-sensitive languages cause for context-free parsers — these are some really helpful examples.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)