Rewriting the lexer benchmark in Rust - Eli Bendersky's website

I've reimplemented my lexical analyzer (lexer) for the the TableGen language in Python, JS and Go so far. The latest post in the series discussed how several years of new Go versions improved my lexer's performance by roughly 2x, and how several additional optimizations won another 37%; the final result is a lexer that churns through 1 MiB of source code in just 5.6 milliseconds.

Since I've also been playing with Rust recently, I thought it would be interesting to re-implement this lexer once again, this time in Rust. Rust is certainly the lowest-level language among those I've used so far for this task, so I expect to see some top-notch performance.

The full code for this post is available on GitHub.

Designing an API

I find that Rust's strict ownership rules makes one think carefully about API design from very early on. If we want to create a lexer and pass it a string as input - who owns the string? In Rust, the answer to this question is not an implicit contract (like it always is in C and sometimes in C++), and it cannot be deferred to the runtime either (like one would do in Python, JS or Go). The answer has to be explicitly encoded into the types of the program.

Since one of the goals of this series of posts is performance, I decided to go with a zero-copy API at first, where the user of the lexer owns the input string; the lexer, in turn, returns tokens that contain references into this input string - so the user ends up owning these too. Rust's lifetime specifiers make this fairly natural; here's the type:

pub struct Lexer<'source> {
  input: &'source str,
  iter: Peekable<CharIndices<'source>>,

  // c is the last char taken from iter, and ci is its offset in the input.
  c: char,
  ci: usize,

  // error is true iff the lexer encountered and error.
  error: bool,
}

Ignore the fields for now, focusing just on the struct definition. This is the constructor:

impl<'source> Lexer<'source> {
    pub fn new(input: &'source str) -> Self {
      // ...
    }
}

The 'source lifetime specifier is used to explicitly annotate the lifetime of the input string slice, since we want to store it in our Lexer and refer to it later. Once a Lexer is created, the user obtains new tokens by calling next_token:

pub fn next_token(&mut self) -> Token<'source> {
  // ...
}

Note that the returned Token also has the same lifetime annotation. Here's how the type is defined:

#[derive(Debug, PartialEq, Clone, Copy)]
pub struct Token<'source> {
    pub value: TokenValue<'source>,
    pub pos: usize,
}

#[derive(Debug, PartialEq, Clone, Copy)]
pub enum TokenValue<'source> {
    EOF,
    Error,

    Plus,
    Minus,
    Multiply,
    Divide,
    Period,
    Backslash,
    Colon,
    Percent,
    Pipe,
    Exclamation,
    Question,
    Pound,
    Ampersand,
    Semi,
    Comma,
    LeftParen,
    RightParen,
    LeftAng,
    RightAng,
    LeftBrace,
    RightBrace,
    LeftBracket,
    RightBracket,
    Equals,
    Comment(&'source str),
    Identifier(&'source str),
    Number(&'source str),
    Quote(&'source str),
}

Now it should be clear how these lifetimes are tied together. Some tokens hold slices into the user's input, and this is encoded in the explicit lifetimes. The signature of the next_token method says "we return tokens with a lifetime that's tied to the lifetime of the input passed into the constructor".

We can also provide a more natural iteration API for the Lexer by implementing the Iterator trait:

// Lexer is an Iterator; it returns tokens until EOF is encountered, when it
// returns None (the EOF token itself is not returned). Note that errors are
// still returned as tokens with TokenValue::Error.
impl<'source> Iterator for Lexer<'source> {
    type Item = Token<'source>;
    fn next(&mut self) -> Option<Self::Item> {
        if self.error {
            // If an error has already been set before we invoke next_token,
            // it means we've already returned TokenValue::Error once and now
            // we should terminate the iteration.
            return None;
        }

        let tok = self.next_token();
        if tok.value == TokenValue::EOF {
            None
        } else {
            Some(tok)
        }
    }
}

The iterator implementation makes it possible to integrate the lexer with the rest of Rust very elegantly; for example, to obtain all the tokens in a given input as a vector:

pub fn tokenize_all_collect<'source>(data: &'source str) -> Vec<Token<'source>> {
    let lex = Lexer::new(&data);
    lex.collect()
}

Implementation

Generally, the implementation of the lexer in Rust follows the same approach used by my previous hand-written lexers in this series. I'd like to highlight a couple of Rust-specific aspects here.

Our lexer fully supports Unicode, so I decided to use Rust's string iterator support to obtain chars from &str. Rust provides a helpful iterator called CharIndices, which yields chars along with their position in the input - we need the position to report token locations. Furthermore, since our lexer requires a bit of look-ahead [1], the iterator is wrapped in Peekable, which provides the peek method. As we've seen above already, the final definition is:

iter: Peekable<CharIndices<'source>>,

With this in mind, the code of the lexer should be very readable, even for someone not too familiar with Rust.

Another note is how sub-string extraction happens when tokens are returned. As an example, let's look at the scan_number method which is invoked when the lexer's current character is a digit:

fn scan_number(&mut self) -> Token<'source> {
    let startpos = self.ci;
    while self.c.is_digit(10) {
        self.scan_char();
    }
    Token {
        value: TokenValue::Number(&self.input[startpos..self.ci]),
        pos: startpos,
    }
}

Recall that the Number variant of the TokenValue enum is defined as Number(&'source str) - it contains a reference to the input source string. In the code for scan_number, we see how this is actually implemented, by sub-slicing the input slice. This creates a slice with the same lifetime as the input slice (which is encoded in the lifetimes of the types). As in Go, sub-slicing is a very cheap operation in Rust (no heap allocation).

Performance

The performance of this Rust-implemented lexer is very good! I ran it on the same large TableGen file I've used for all the benchmarking, and it finishes tokenizing it in just 3.7 ms; this is about 33% faster than my fastest Go version from the previous post.

Profiling Rust is quite a bit trickier than Go, but I managed to cobble together enough magic flags and perf invocations to ascertain that next_token indeed takes the bulk of the time; this is good - the time is being spent where it's supposed to be spent.

Trying a variant with allocations

As described above, the API for this Rust lexer was designed to be zero-copy. Since my Go lexers experimented with different approaches, I've decided to see how a different API in Rust would look.

There are two aspects to ownership we can change: taking ownership of the input string, and/or returning owned Strings in the tokens.

For taking ownership of the input string, our constructor would have to look like:

pub fn new(input: String) -> Self

Does it make sense for a lexer to own its input? This isn't clear, and in reality it turned out to be tricky to implement due to lifetime issues. Rust is very unhappy when a struct field is a reference to another field of the same struct, because there is no safe way to move instances of such structs. In our case, the iterator (a struct field) needs a reference to the string (another field in the same struct), and this is doesn't pass the Rust compiler's scrutiny. I wanted to be able to implement this without actively fooling the compiler by using opaque indices or unsafe.

When the language fights you this hard, it may be a good sign that the design is wrong - it's better to leave the ownership to the code that creates a Lexer.

Ownership of returned tokens was easier to set up. The code for this is available in the owning.rs file of the repository. In this variant, the constructor doesn't change, but tokens are defined differently:

#[derive(Debug, PartialEq, Clone)]
pub struct Token {
    pub value: TokenValue,
    pub pos: usize,
}

#[derive(Debug, PartialEq, Clone)]
pub enum TokenValue {
    // .. types
    Comment(String),
    Identifier(String),
    Number(String),
    Quote(String),
}

Note that variants like Identifier now hold an owning String instead of a string reference. Therefore, lifetime annotations are no longer necessary on Token.

The scanning code now allocates a new String and reads characters into it from the iterator. We've seen previously how scan_number looks; here it is again, for the owning token variant (with some helper methods):

fn scan_number(&mut self) -> Token {
    let startpos = self.ci;
    Token {
        value: TokenValue::Number(self.scan_while_true(|c| c.is_digit(10))),
        pos: startpos,
    }
}

// Helper to scan chars while `pred(c)` returns true, into the given `s`.
fn scan_while_true_into<F>(&mut self, s: &mut String, pred: F)
where
    F: Fn(char) -> bool,
{
    while pred(self.c) {
        s.push(self.c);
        self.scan_char();
    }
}

// Helper to scan chars while `pred(c)` returns true and return all scanned
// chars in a new String.
fn scan_while_true<F>(&mut self, pred: F) -> String
where
    F: Fn(char) -> bool,
{
    let mut s = String::with_capacity(8);
    self.scan_while_true_into(&mut s, pred);
    s
}

Performance of this variant

When I first tried this variant, its performance wasn't great at all! It was about 30% slower than the string-copying version in Go. I think this has to do with the slight difference in approach - in Go, my lexer figures out the token boundaries using numerical indices and then converts a []byte into string in one fell swoop. The Rust version fetches chars one by one from an iterator and writes them into a String.

In particular, since Rust's String is dynamically allocated, this may incur reallocations (depending on how much is allocated initially).

So my solution was to create these strings with_capacity - as you can see in the previous code sample. This cut down the execution time to be roughly equal to Go's version where strings are copied.

Which API is better?

IMHO there's little doubt that the original "slice" API is better, for multiple reasons:

Performance: the results speak for themselves - the slice API is zero-copy and incurs no additional heap allocations. Like in Go, it deals in slice headers, which are just pointers to parts of a string.
The ownership story is very clear, symmetrical and explicitly documented with lifetime annotations. The 'source lifetime controls everything: it's the lifetime of the &str passed into the lexer's constructor, and it's the lifetime of the tokens. The symmetry feels important - the code that creates a lexer controls the lifetime of the input, as well as the output.
In general, there's a known good practice in API design which is not to force allocations on users, if possible. This is well articulated in this blog post for Go, but it applies equally well in Rust. In our slice API, the user can always call to_owned on the returned slice, if they so wish. But why do it for them? Why should we assume the users want us to return them an owned string? Returning a slice provides more flexibility in using the API.

Performance improvement: avoiding `Peekable`

After this post was published, the reader Utkarsh Kukreti mentioned that using Peekable is not the most optimal approach, since its next is slower than the underlying iterator's next, and calling it is on the hot path. Instead of using Peekable, we could just clone the iterator when we see a / and invoke next on the clone; iterators are small so cloning them is cheap.

Here's a patch with this change. Applying it to my lexer makes it ~11% faster on my machine, finishing the benchmark in about 3.4 ms!

To be honest, this is a little bit surprising and disappointing to me, since I was hoping that Peekable would fall into the zero abstraction promise of Rust. Perhaps this is something that can fixed in the future.

[1]	For properly tokenizing comments; when the lexer sees `/`, it needs to peek forward to see if this is the beginning of a `//`, or else the division operator. This is a common use case in lexers, especially when tokenizing multi-character operators (like `>=`).