Parsing C: more on #line directives - Eli Bendersky's website

In a previous post I discussed a few issues that make parsing real-world C a tad more difficult than just translating the EBNF grammar into code. In particular, #line directives are a challenge because they are not directly specified by the grammar and require special handling.

After some consideration, I decided to heed the good advice given in this comment and handle #line directives at the lexer, and not the parser, level. As that comment rightly suggests, the following is a valid output of the C pre-processor:

int
#line 5 "file"
n = 9;

Handling this at the level of the parser is close to impossible, because one has to allow #line directives almost in any parser rule. This is difficult, not to mention the readability and simplicity hit on the grammar specification in the parser.

Anyway, moving this to the lexer wasn't very difficult, and eventually resulted in less code, which is a good sign. A fix that leaves less code but implements an extra feature is probably the best you can wish for.

To implement this, I've defined ppline as an exclusive state in the lexer (recall that I'm using PLY for this project). When the lexer sees a hash symbol (#), it looks ahead, and if it sees line, it moves into this state. If it sees anything else (like pragma), it doesn't move into the special state and keeps sending the tokens to the parser. In the ppline state, the lexer collects the line number and possibly file name until it sees the end of the line, updates its local location and doesn't send anything to the parser. Thus, #line directives are transparent for the parser - it doesn't see them at all, and only receives tokens with a different location after them.

And now, since the location is kept in the parser and not the lexer, the code is somewhat simpler. Additionaly, I no longer need special workaround rules in the parser to accept #line directives in weird places.