Ever since I started writing lots of VHDL code at work, I've been toying with the idea of writing a parser for the language. It would provide me with a simple means for writing useful small tools for organizing and analyzing large bodies of VHDL code.
A few years ago I even started looking into this seriously, but that project got flushed down the tubes once I realized that it's hard to find suitable libraries for this in Perl, which I was using at the time. In addition, VHDL turned out to be a very hairy language to parse.
Lately, after the success of PLY-based pycparser, I came back to VHDL. PLY is powerful and fast, I thought, perhaps it's feasible to parse VHDL with it?
Turns out the task is much harder than I expected. Attempting to translate the VHDL BNF definition into PLY Yacc runs into problems very quickly. The BNF is not suitable for LALR, and is full of reduce/reduce conflicts. At first I rewrote the rules to make them more general (and hence accept a bit more of invalid code, which wasn't too important for me), but more and more are coming. Yesterday I read some paper claiming that the full translation of the BNF into Yacc results in 576 reduce/reduce errors! Umph...
No problem, I can just rewrite it using a hand-tailored RD parser (which I suspect most commercial VHDL tools are using) that's more powerful than LALR and hence won't be troubled by conflicts in the BNF, right?
It's more difficult than that.
VHDL is context-sensitive in a mean way. Consider this statement inside a process:
jinx := foo(1);
Well, depending on the objects defined in the scope of the process (and its enclosing scopes), this can be either:
- A function call
- Indexing an array
- Indexing an array returned by a parameter-less function call
To parse this correctly, a parser has to carry a hierarchical symbol table (with enclosing scopes), and the current file isn't even enough. foo can be a function defined in a package. So the parser should first analyze the packages imported by the file it's parsing, and figure out the symbols defined in them.
This is just an example. The VHDL type/subtype system is a similarly context-sensitive mess that's very difficult to parse.
After some Googling, today I've encountered an old newsgroup post on comp.lang.vhdl from 1993, by a bunch of seemingly knowledgeable people discussing this issue. The verdict: yes, it's context-sensitive, and very hard to parse. But with (lots of) effort it's doable.
I'm kind-of bummed by this at the moment. I'll either find something online to adapt, give the RD parser a shot and try to minimize the damage of context-sensitivity, or drop the idea altogether. We'll see...