Hand-written lexer in Javascript compared to the regex-based ones

July 16th, 2013 at 6:25 am

A few weeks ago I wrote about the comparison between regex-based lexers in Python and Javascript. Javascript running on Node.js (V8) ended up being much faster than Python, and in both languages a speed improvement could be gained by switching to a single regex and letting the regex engine do the hard work.

However, in the real world you’ll find that most lexers (particularly lexers for real programming languages) are not written that way. They are carefully hand-written to go over the input and dispatch to the appropriate token handling code depending on the first character encountered.

This technique makes more sense for complex languages because it’s much more flexible than regexes (for instance, representing nested comments with regexes is a big challenge). But I was curious also about its performance implications.

So I hacked together a hand-written lexer for exactly the same language used in the previous benchmark and also exercised it on the same large file to make sure the results are correct. Its runtime, however, surprised me. Here’s the runtime of lexing a large-ish document (smaller is better):

http://eli.thegreenplace.net/wp-content/uploads/2013/07/lexer_runtime_vs_handwritten.png

I was expecting the runtime to be much closer to the single-regex version; in fact I was expecting it to be a bit slower (because most of the regex engine work is done at a lower level). But it turned out to be much faster, more than 2.5x. Another case of the real world beating intuition.

I was lazy to look, but if V8′s regex implementation is anything like Python’s, then alternation of many options (foo|bar) is not as efficient as one might expect because the regex engine does not use real DFAs, but rather backtracking. So alternation essentially means iteration deep in the regex engine. On the other hand, the hand-written code seems like something quite optimizable by a JIT compiler like V8 (the types are simple and consistent) so there’s a good chance it got converted into tight machine code. Throw some inlining in and the speed is not very unlikely.

Anyway, here is the hand-written lexer. It’s significantly more code than the regex-based ones, but I can’t say it was particularly difficult to write – the main part took a bit more than an hour, or so. If you have any additional ideas about the speed difference, I’ll be happy to hear about them.

'use strict';

var Lexer = exports.Lexer = function() {
  this.pos = 0;
  this.buf = null;
  this.buflen = 0;

  // Operator table, mapping operator -> token name
  this.optable = {
    '+':  'PLUS',
    '-':  'MINUS',
    '*':  'MULTIPLY',
    '.':  'PERIOD',
    '\\': 'BACKSLASH',
    ':':  'COLON',
    '%':  'PERCENT',
    '|':  'PIPE',
    '!':  'EXCLAMATION',
    '?':  'QUESTION',
    '#':  'POUND',
    '&':  'AMPERSAND',
    ';':  'SEMI',
    ',':  'COMMA',
    '(':  'L_PAREN',
    ')':  'R_PAREN',
    '<':  'L_ANG',
    '>':  'R_ANG',
    '{':  'L_BRACE',
    '}':  'R_BRACE',
    '[':  'L_BRACKET',
    ']':  'R_BRACKET',
    '=':  'EQUALS'
  };
}

// Initialize the Lexer's buffer. This resets the lexer's internal
// state and subsequent tokens will be returned starting with the
// beginning of the new buffer.
Lexer.prototype.input = function(buf) {
  this.pos = 0;
  this.buf = buf;
  this.buflen = buf.length;
}

// Get the next token from the current buffer. A token is an object with
// the following properties:
// - name: name of the pattern that this token matched (taken from rules).
// - value: actual string value of the token.
// - pos: offset in the current buffer where the token starts.
//
// If there are no more tokens in the buffer, returns null. In case of
// an error throws Error.
Lexer.prototype.token = function() {
  this._skipnontokens();
  if (this.pos >= this.buflen) {
    return null;
  }

  // The char at this.pos is part of a real token. Figure out which.
  var c = this.buf.charAt(this.pos);

  // '/' is treated specially, because it starts a comment if followed by
  // another '/'. If not followed by another '/', it's the DIVIDE
  // operator.
  if (c === '/') {
    var next_c = this.buf.charAt(this.pos + 1);
    if (next_c === '/') {
      return this._process_comment();
    } else {
      return {name: 'DIVIDE', value: '/', pos: this.pos++};
    }
  } else {
    // Look it up in the table of operators
    var op = this.optable[c];
    if (op !== undefined) {
      return {name: op, value: c, pos: this.pos++};
    } else {
      // Not an operator - so it's the beginning of another token.
      if (Lexer._isalpha(c)) {
        return this._process_identifier();
      } else if (Lexer._isdigit(c)) {
        return this._process_number();
      } else if (c === '"') {
        return this._process_quote();
      } else {
        throw Error('Token error at ' + this.pos);
      }
    }
  }
}

Lexer._isnewline = function(c) {
  return c === '\r' || c === '\n';
}

Lexer._isdigit = function(c) {
  return c >= '0' && c <= '9';
}

Lexer._isalpha = function(c) {
  return (c >= 'a' && c <= 'z') ||
         (c >= 'A' && c <= 'Z') ||
         c === '_' || c === '$';
}

Lexer._isalphanum = function(c) {
  return (c >= 'a' && c <= 'z') ||
         (c >= 'A' && c <= 'Z') ||
         (c >= '0' && c <= '9') ||
         c === '_' || c === '$';
}

Lexer.prototype._process_number = function() {
  var endpos = this.pos + 1;
  while (endpos < this.buflen &&
         Lexer._isdigit(this.buf.charAt(endpos))) {
    endpos++;
  }

  var tok = {
    name: 'NUMBER',
    value: this.buf.substring(this.pos, endpos),
    pos: this.pos
  };
  this.pos = endpos;
  return tok;
}

Lexer.prototype._process_comment = function() {
  var endpos = this.pos + 2;
  // Skip until the end of the line
  var c = this.buf.charAt(this.pos + 2);
  while (endpos < this.buflen &&
         !Lexer._isnewline(this.buf.charAt(endpos))) {
    endpos++;
  }

  var tok = {
    name: 'COMMENT',
    value: this.buf.substring(this.pos, endpos),
    pos: this.pos
  };
  this.pos = endpos + 1;
  return tok;
}

Lexer.prototype._process_identifier = function() {
  var endpos = this.pos + 1;
  while (endpos < this.buflen &&
         Lexer._isalphanum(this.buf.charAt(endpos))) {
    endpos++;
  }

  var tok = {
    name: 'IDENTIFIER',
    value: this.buf.substring(this.pos, endpos),
    pos: this.pos
  };
  this.pos = endpos;
  return tok;
}

Lexer.prototype._process_quote = function() {
  // this.pos points at the opening quote. Find the ending quote.
  var end_index = this.buf.indexOf('"', this.pos + 1);

  if (end_index === -1) {
    throw Error('Unterminated quote at ' + this.pos);
  } else {
    var tok = {
      name: 'QUOTE',
      value: this.buf.substring(this.pos, end_index + 1),
      pos: this.pos
    };
    this.pos = end_index + 1;
    return tok;
  }
}

Lexer.prototype._skipnontokens = function() {
  while (this.pos < this.buflen) {
    var c = this.buf.charAt(this.pos);
    if (c == ' ' || c == '\t' || c == '\r' || c == '\n') {
      this.pos++;
    } else {
      break;
    }
  }
}

Related posts:

  1. Regex-based lexical analysis in Python and Javascript
  2. Detexify recognizes hand-written math symbols
  3. A simple canvas-based Javascript game, with a Django back-end
  4. regex gotcha in P::RD
  5. some regex “best practices”

14 Responses to “Hand-written lexer in Javascript compared to the regex-based ones”

  1. Caleb SpareNo Gravatar Says:

    Hey Eli — love your blog.

    I’m also slightly surprised by the result, but I think you’re right: V8 is just very good.

    I generally do all my lexers by hand. I share Rob Pike’s view about using regexes for lexing and parsing:

    http://commandcenter.blogspot.com/2011/08/regular-expressions-in-lexing-and.html

  2. elibenNo Gravatar Says:

    @Caleb,

    Thanks for the feedback and the link — it’s very relevant.

  3. Corentin SMithNo Gravatar Says:

    Hi,
    Thanks for the article, very interesting read.
    Is there any specific reason to why you wouldn’t write _isalphanum as
    Lexer._isalphanum = function(c) {
    return Lexer._isalpha(c) || Lexer._isdigit(c);
    }

    ?

  4. JavrayNo Gravatar Says:

    Can you share the tests, for example in jsperf.com?

    Thanks!!!

  5. elibenNo Gravatar Says:

    @Corentin: not really. Mostly quick-n-dirty mode of coding since this is just a prototype to get a feeling for things. In “real” code I’d fix that.

    @Javray: that’s a bit more trouble than it’s worth for me :-) First, the code is written for Node.js and will take some tweaking. Second, it reads its input file from the file system which is tricky with jsperf. It should be trivial to reproduce if you have Node.js though: just grab one of the fat TableGen (.td) files from https://github.com/llvm-mirror/llvm/tree/master/lib/Target/X86 (I concatenated several together to have a bit of a longer runtime) and run the lexer on it collecting all tokens into a list.

  6. AnonNo Gravatar Says:

    It would be interesting to see a comparison against a DFA based regular expression engine, like RE2. Although in this case I can understand not doing that as it doesn’t appear that there exists a node.js binding for RE2, so you’d have to write all the glue code. But there are Python bindings, so you should be able to do a comparison there.

  7. gabrielNo Gravatar Says:

    There’s one metric you forget. Code size. This matters for javascript in the real world.

    sometimes, the 0.2s saved in processing cost 0.6s in network delay.

  8. elibenNo Gravatar Says:

    @gabriel,

    I did mention code size. However, it’s unlikely to be a big problem in this case since it’s so small – not very likely to affect network delay.

  9. Bryan Ewbank (@bryane)No Gravatar Says:

    I think one issue – not addressed here – is whether the scanner is “fast enough”. If it doesn’t come up in any profiling work, then there’s really no point in the overhead of maintaining a hand written scanner.

    Yes, they are beautiful. Yes, they are faster. At the same time, they are often hard for anyone not the original owner to maintain or extend. Since code is written once, and read hundreds of times over its lifetime, it is often better (much better) to use a well-known idiom rather than “the fastest, cleverest code”.

    By the way, to make it faster still, use a lookup table from char to function that maps current character to a function that returns the token starting at that char. Not in the table means it’s whitespace to be ignored. The operators all call the same function that wraps them in an operator token; each letter references a function that eats an identifier; each digit, a function that eats numbers; etc.

    This replaces the control flow in your example with a single dispatch table, and should improve performance a bit more.

  10. Hubert KaukerNo Gravatar Says:

    When we are dealing with a given string character by character we might as well convert the string to an array of single characters before starting, because the charAt method is generally a little slower to access individual characters of a string than accessing array elements by index. This, of course, depends on the JavaScript implementation.
    So you might save a few more instruction cycles doing in Lexer.prototype.input:
    this.buf = buf.split(""); // convert given string into array of characters

    and a few lines later in Lexer.prototype.token:
    var c = this.buf[this.pos];
    Other statements when substring is used need to modified, as well.
    But I don’t know if that would have a great impact on your figures.

  11. matclabNo Gravatar Says:

    You note : “representing nested comments with regexes is a big challenge”…

    If i’m not wrong, it is simply not possible to represent nested structures with a regular expression (in the sense of expression which describes a regular language)… You at least need a parser which is context-free (i.e represented by an automata coupled with a stack).

    So, do not even try to use regexes for a language with nested structures !

  12. elibenNo Gravatar Says:

    Speaking strictly theoretically, regular expressions cannot match nested comments. However, some tools extend their regex engines with stronger capabilities. For example, Perl’s regexes have support for nesting; Lex also allows to mix arbitrary state with regexes which enables implementing matching non-regular constructs such as nested comments.

  13. salmanNo Gravatar Says:

    Thanks a lot . Its really healpful :) .

  14. escort bayanNo Gravatar Says:

    I think one issue – not addressed here – is whether the scanner is “fast enough”. If it doesn’t come up in any profiling work, then there’s really no point in the overhead of maintaining a hand written scanner.

    Yes, they are beautiful. Yes, they are faster. At the same time, they are often hard for anyone not the original owner to maintain or extend. Since code is written once, and read hundreds of times over its lifetime, it is often better (much better) to use a well-known idiom rather than “the fastest, cleverest code”.

    By the way, to make it faster still, use a lookup table from char to function that maps current character to a function that returns the token starting at that char. Not in the table means it’s whitespace to be ignored. The operators all call the same function that wraps them in an operator token; each letter references a function that eats an identifier; each digit, a function that eats numbers; etc.

    This replaces the control flow in your example with a single dispatch table, and should improve performance a bit more.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)