Ungrammar in Go and resilient parsing - Eli Bendersky's website

It won't be news to the readers of this blog that I have some interest in compiler front-ends. So when I heard about a new(-ish) DSL for concrete syntax trees (CST), I couldn't resist playing with it a bit.

Ungrammar is used in rust-analyzer to define and access a CST for Rust. This blog post by its creator provides much more details. According to the author, Ungrammar is "the ASDL for concrete syntax trees". This sounded interesting, since I've been dabbling in ASDL in the past, and also have experience with similar techniques for defining pycparser ASTs.

The result is go-ungrammar, a re-implementation of Ungrammar in Go. The input is an Ungrammar file defining some CST; for example, here's a simple calculator language:

Program = Stmt*

Stmt = AssignStmt | Expr

AssignStmt = 'set' 'ident' '=' Expr

Expr =
    Literal
  | UnaryExpr
  | ParenExpr
  | BinExpr

UnaryExpr = op:('+' | '-') Expr

ParenExpr = '(' Expr ')'

BinExpr = lhs:Expr op:('+' | '-' | '*' | '/' | '%') rhs:Expr

Literal = 'int_literal' | 'ident'

Ungrammar looks a bit like EBNF, but not quite (hence the name "ungrammar"). It's much simpler because it doesn't need to concern itself with precedence, ambiguities and so on, also leaving all the (often complex) lexical rules to the lexer. It simply defines a tree that can be used to represent parsed language. It's also different from ASTs in that it preserves all tokens, including delimiters and other syntax elements. This is useful for tools like language servers that need a full-fidelity representation of the source code.

Implementation notes

go-ungrammar uses a classical hand-written lexical analyzer and a recursive descent parser. Just for fun, I spent more time on error recovery than strictly necessary for such a simple input language. The lexer never gives up when encountering non-sensical input; it simply emits an ERROR token and keeps going. The parser doesn't quit on the first error either; instead, it collects all the errors it encounters and tries to recover from each one (the synchronize() method in the parser code). As an example of this in action, consider this faulty Ungrammar input:

foo = @
bar = ( joe
x = y

At first glance, there are at least a couple of issues here:

@ is not a valid Ungrammar token
The ( in the second rule is unterminated; as all programmers know, unterminated grouping elements spell trouble because the compiler can get easily confused until it finds a valid terminator

When go-ungrammar runs it will report an error that looks like this:

1:7: unknown token starting with '@' (and 2 more errors)

The concrete error type returned by the parser collects all the errors, so we can iterate over them and display them all:

1:7: unknown token starting with '@'
2:1: expected rule, got bar
3:1: expected ')', got x

The parser recovers after the first error expecting to see the RHS (right-hand-side) for the foo rule, but doesn't find any. This is a good place to discuss parser recovery. The Ungrammar language has a significant ambiguity:

foo = bar baz = barn

Are bar baz the RHS sequence for rule foo, or is baz = the beginning of a new rule? Note that the language is whitespace-insensitive, so this really does come up; just look at the example calculator Ungrammar above - this is encountered on pretty much any new rule.

The way go-ungrammar resolves the ambiguity is by using an NODE = lookahead, deciding it's the beginning of a new rule (NODE is an Ungrammar term for "plain identifier").

Back to our recovery example: the second error is the parser complaining that it expected some rule after foo = but found none; an empty RHS is invalid in Ungrammar and the @ was reported and skipped. So the parser complains that it found a new rule definition instead of the RHS for an existing rule. At this point it re-synchronizes and parses the bar = rule. Then it runs into the third error - the ( is unterminated. Still, the parser recovers and keeps going.

Even with all these errors, the parser will produce a partial result - a tree equivalent to this input:

bar = joe
x = y

For foo there was simply nothing to parse. For bar, the parser reported the missing ) but parsed the contents anyway. It then fully recovered and was able to parse x = y properly. Being able to parse incomplete input and produce partial trees is very important for error recovery, and especially for tools like language servers that need to be resilient in the presence of partial input the user is busy typing in.

I enjoyed coding this resilient parser; while it's probably an overkill for a language as simple as Ungrammar, it's a good kata for frontend construction.