Using sub-generators for lexical scanning in Python

August 9th, 2012 at 2:45 pm

A few days ago I watched a very interesting talk by Rob Pike about writing a non-trivial lexer in Go. Rob discussed how the traditional switch-based state machine approach is cumbersome to write, because it’s not really compatible with the algorithm we want to express. The main problem is that when we return a new token, a traditional state-machine structure forces us to explicitly pack up the state of where we are and return to the caller. Especially in cases where we just want to stay in the same state, this makes code unnecessarily convoluted.

This struck a chord with me, because I’ve already written about simplifying state machine code in Python with coroutines. I couldn’t help but wonder what would be an elegant Pythonic way to implement Rob’s template lexer (watch the talk or take a look at his slides for the syntax).

What follows is my attempt, which uses the new yield from syntax from PEP 380, and hence requires Python 3.3 (which is currently in beta, but should be released soon). I’ll present the code in small chunks with explanations; the full source is available for download here. It’s heavily commented, so should be easy to grok.

First, some helper types and constants:

TOK_TEXT        = 'TOK_TEXT'
TOK_LEFT_META   = 'TOK_LEFT_META'
TOK_RIGHT_META  = 'TOK_RIGHT_META'
TOK_PIPE        = 'TOK_PIPE'
TOK_NUMBER      = 'TOK_NUMBER'
TOK_ID          = 'TOK_ID'


# A token has
#   type: one of the TOK_* constants
#   value: string value, as taken from input
#
Token = namedtuple('Token', 'type value')

It’s a shame Python still doesn’t have a functional enum type, isn’t it? Anyway, on to the lexer class. Here’s the complete public interface:

class TemplateLexer(object):
    """ A lexer for the template language. Initialize with the input
        string, and then call lex() which generates tokens. None is
        generated at EOF (and the generator expires).
    """
    def __init__(self, input):
        self.input = input
        # self.pos points at the current character in the input string
        self.pos = 0
        # self.tokstart points at the start of the currently processed
        # token
        self.tokstart = 0
        self.state = self._lex_text

    def lex(self):
        # self.state is one of the _lex_* state functions. Each such
        # function yields the tokens it finds and then returns the next
        # state function. When EOF is encountered, None is returned as
        # the new state.
        while self.state:
            self.state = yield from self.state()

The lex method is a generator. It yields the lexed tokens one after the other, and it uses an interesting technique to do that. A technique you won’t find in pre-3.3 Python, because of the lack of the yield from construct. This new statement means: the following expression evaluates to a generator. Run that generator and yield the values it provides back to my caller. When the generator function returns, its return value is returned from yield from.

So what happens is this: self.state is always a function. Each such function represents a state in the lexing process. It can yield one or more tokens and eventually when some input means the state should change, it returns the new state function. Let’s take a look at a couple of such state functions before I go on with the explanation:

def _lex_text(self):
    # Look for the beginning of LEFT_META
    meta_start = self.input.find(self._LEFT_META, self.pos)
    if meta_start > 0:
        # Found. Emit all text until then (if any) and move to the
        # lex_left_meta state.
        self.pos = meta_start
        if self.pos > self.tokstart:
            yield self._emit(TOK_TEXT)
        return self._lex_left_meta
    else:
        # Not found. This means we're done. There may be some text
        # left until EOF, so emit it if there is.
        self.pos = len(self.input)
        if self.pos > self.tokstart:
            yield self._emit(TOK_TEXT)
        # Yield None to mark "no more tokens --> EOF"
        # Return None to stop the main lexing loop since there is no
        # next state.
        yield None
        return None

def _lex_left_meta(self):
    self.pos += len(self._LEFT_META)
    yield self._emit(TOK_LEFT_META)
    return self._lex_inside_action

For completeness, here’s the _emit helper method:

def _emit(self, toktype):
    """ Emit the current token
    """
    tok = Token(toktype, self.input[self.tokstart:self.pos])
    self.tokstart = self.pos
    return tok

Note how there are two "outlets" out of each state function. One is by yield – execution is suspended while the new token is given back to the caller. Once it’s consumed, execution proceeds right after the yield. The other is a return which signals that the current state has ended, and provides the next state function to the main lexing loop. The else branch in _lex_text shows how two tokens can be yielded one after the other, before returning. There are more examples for this in the full source code.

There are two big ideas I picked up from Rob’s talk:

  1. When some state is finished, we know the next state, so dispatching again with a switch doesn’t feel right. Why not just directly say where we want to go next?
  2. When we emit a new token, do don’t want to be forced to explicitly save the state, return to the caller, and then resume from a dispatcher switch. We just want to say "suspend here, emit this token, and then continue right after". This would not be possible in a language that does not support generators or coroutines of some kind.

I believe the approach shown in my code addresses both ideas. Without the yield from construct it would be much harder to code. yield from not only allows us to call a function that acts as a sub-generator. More importantly, it allows us to both yield and return a value from the same function, with correct semantics (prior to Python 3.3, it’s impossible to do that).

It’s not perfect, however. One thing I lament is that we still have a loop in lex. It’s much more elegant than a switch dispatcher, but is it really needed? Why, instead of returning the next state function, we can’t just call it directly from the current state? The way it is currently in my code, this would eventually blow the stack up, because none of the state functions would return from these calls before EOF is reached. Therefore, I don’t think functions are the right vehicle for such a mechanism – perhaps continuations are needed here.

Also, it’s not coded as a real coroutine, although that would be possible to achieve. However, it would also impose a certain programming style on the rest of the program, which isn’t always desirable. As it stands now, this lexer is nicely self-contained. The lex method presents a simple, Pythonic generator interface that programmers are used to.

Finally, I didn’t bother researching the efficiency (speed-wise) of this approach against a classical state machine loop. Note that the code isn’t terribly speed-conscious, e.g. it copies parts of the string to tokens, which is wasteful. It’s definitely much more pleasant to write (and easier to modify) than a classical state machine, and this is very important.

P.S. I did not attempt this code to be parallel to Rob Pike’s Go code. Instead, I sought a Pythonic solution. For a more direct attempt to adapt it to Rob’s Go snippets, check out this set of gists by Piet Delport.

Related posts:

  1. Implementing a generator/yield in a Python C extension
  2. Parsing C: more on #line directives
  3. Top-Down operator precedence parsing
  4. Co-routines as an alternative to state machines
  5. Python internals: adding a new statement to Python

4 Responses to “Using sub-generators for lexical scanning in Python”

  1. PragmaNo Gravatar Says:

    Perhaps it’s not the most efficient parser design, but using ‘yield’ in this way has some distinct advantages:

    * Readable code – very important for a hand-coded parser
    * Doesn’t rely on a massive switch-based state machine, which is good since Python’s closest equivalent is a hash-table, so using a discrete number of logic steps based on the present state is preferable.
    * Conservative parsing – doesn’t digest the entire stream at once before providing tokens to a consumer
    * Lean – doesn’t wind up the stack very much at all, and doesn’t keep old data hanging around

    The only drawback I can think of is that this approach is that a consumer of this lexer will need some buffering in order to perform any look-ahead into the token stream, which is required for some grammars; generators are forward-only after all. That’s kind of a no-brainer, but worth mentioning since it’s likely to be the very next thing a parser author will go after once they attempt to build a parser on this example.

  2. Jakub Piotr CłapaNo Gravatar Says:

    “Why, instead of returning the next state function, we can’t just call it directly from the current state? (…) Therefore, I don’t think functions are the right vehicle for such a mechanism – perhaps continuations are needed here.”

    If I understand your idea correctly I think tail-call optimization is what you want. There is a nice paper (and talk) [1] by
    Shriram Krishnamurthi about using (tail-call optimized) function calls and macros to write state machines.

    [1]: http://cs.brown.edu/~sk/Publications/Papers/Published/sk-automata-macros/

  3. elibenNo Gravatar Says:

    @Jakub,

    No, TCO is not the solution here. TCO optimizes tail calls but ultimately control returns to the caller. What’s really needed is something in the vein of continuations – jumping to some piece of code without returning to the call site later.

  4. Jakub Piotr CłapaNo Gravatar Says:

    @eliben

    With TCO you effectively have only one caller and you don’t have to return to it. This is true regardless of the amount of calls (or state transitions). If you do it in a new thread of control (with an independent stack) you don’t even have to have a caller (you can tail-call immediately from the entry function).

    Delimited continuations are like threads (serializable threads, so you can do multiple resumes). Full-blown continuations are similar but when you capture the continuation you do something fork()-like instead of just “serializing” a thread with a well defined starting point.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)