Tags Go

I have recently discovered - to my delight - that Go has an efficient implementation of suffix arrays right in the standard library - index/suffixarray.

A suffix array is an ingenious data structure that lets us take a large body of text (or any binary data, for that matter), preprocess it and then be able to find any substring in this text in logarithmic time. And the coolest thing is that a suffix array only requires O(n) space and can be constructed efficiently. For more details, turn to the Wikipedia page on Suffix Arrays - it's pretty good.

Like the rest of the Go standard library, infex/suffixarray is decently documented. However, I couldn't find good examples, so I'll provide some in this post. The full, runnable code of these examples can be found here.

Building the suffix array and simple lookups

Let's start with the basics. suffixarray.New builds a new Index, which can then be used to efficiently look up substrings. One catch with suffixarray.New is that it just takes a single byte slice as the data. This suits many use cases, but sometimes we need some extra preprocessing.

A common task is having a bunch of words (or sentences), and finding which word(s) a certain substring is part of. For example:

words := []string{
    "banana",
    "apple",
    "pear",
    "tangerine",
    "orange",
    "lemon",
    "peach",
    "persimmon",
}

We can combine these words into a single []byte, delimiting them by a byte that appears in no valid word; \x00 is a good candidate. Then, we can create the suffix array:

data := []byte("\x00" + strings.Join(words, "\x00") + "\x00")
sa := suffixarray.New(data)

Now we can use the Lookup method to find all locations of some substring:

indices := sa.Lookup([]byte("an"), -1)
// indices is now: [4 2 31 20]

Lookup returns a slice of indices into the data buffer the suffix array was created with. The substrings can be found at these indices. Given our \x00 delimiters, it's fairly easy to reconstruct a word from an index:

func getStringFromIndex(data []byte, index int) string {
  var start, end int
  for i := index - 1; i >= 0; i-- {
    if data[i] == 0 {
      start = i + 1
      break
    }
  }
  for i := index + 1; i < len(data); i++ {
    if data[i] == 0 {
      end = i
      break
    }
  }
  return string(data[start:end])
}

Now we can get back the words from indices:

// Reconstruct matches from indices found by Lookup.
for _, idx := range indices {
    fmt.Println(getStringFromIndex(data, idx))
}

// prints:
//   banana
//   banana
//   orange
//   tangerine

Note that banana appears twice - this is because the substring an appears twice in the word.

FindAllIndex - flexibility with a big caveat

Another lookup method provided by a suffixarray is FindAllIndex, taking a *regexp.Regexp instead of a []byte. To be honest, I was initially stumped. Nothing I knew about suffix arrays included being able to match artibrary regular expressions! So I looked into the source of the package and drew a breath of calm, and alarm. On one hand, there's no magic. On the other hand, there's no magic! Therefore, I would treat the FindAllIndex method with great suspicion, since it's potentially misleading.

What FindAllIndex does is first figure out whether the given regexp has a literal prefix. If it does, Lookup is used to find the prefix (efficiently) and then regexp matching is used to find the rest of the match starting from the found prefix. If there is no literal prefix, however, it just invokes the Regexp.FindAllIndex method, which scans the whole data linearly to find matches.

So while I can see how FindAllIndex can be convenient in some cases, I would advise a lot of caution when using it, lest you throw the whole premise of the suffix array out of the window. Why bother building the suffix array when you're going to search through it linearly anyway? Be careful to only use FindAllIndex when you know for sure that the regexp has a literal prefix.

Performance

So this is how the suffix array can be used. How well does it perform? I wrote a simple benchmark where I read the list of words from /usr/share/dict/words (on my Ubuntu 14.04 machine it has close to 100K words and total size of ~1MB), and then compare suffix array lookups with regular (linear) lookups.

For the puspose of comparison, I'm just looking for the first appearance of the substring in the whole data. When doing a linear search, it's very crucial where the match ends up being; if it's the very fist item in the list, the lookup is blazing fast; if it's the last, not so much.

With that in mind, I ran some measurements with different locations of the found substring (on a scale from 0.0 to 1.0, 0.0 being the very first word, 1.0 being the last, 0.5 the middle, etc.), as well as one substring that wasn't there at all (marked as n/a below). The times are averages produced by the Go benchmark tooling, per lookup.

Location Linear time Suffixarray time
0.01 12 us 550 ns
0.25 271 us 514 ns
0.5 604 us 506 ns
0.75 775 us 497 ns
n/a 1 ms 409 ns

Note how the linear lookup time grows about linearly with the location, as expected. The suffix array lookup time appears to be independent of location, which is also expected for a given size (there are small fluctuations having to do with the exact order binary search visits a sorted array). In the average case (found in the middle) for this size of input, the suffix array provides a 1000x speedup.

It's also interesting to measure how long the index took building, and how much memory it consumes. Building the suffix array took about 250 ms, and its size in memory was less than 4 MB, which is pretty good for 1 MB of data.

Given that the construction time is non-trivial, there's a tradeoff when using suffix arrays. For the data I'm using, constructing the suffix array took as long as ~300 average lookups, and the cost of a lookup then becomes negligible. So it all depends on your data - how big it is (the bigger it gets, the bigger the win of the suffix array's logarithmic lookup time), how many lookups you anticipate to perform, how are the matches distributed, and so on. In any case, a ready-made suffix array is an excellent tool to have in one's toolbox to tackle substring lookups. Suffix arrays are useful for all kinds of problems involving substrings, not just lookups. For example, they are great for efficiently computing longest-common-prefixes between collections of data. But this is a topic for another day.


Comments

comments powered by Disqus