The Expression Problem and its solutions



The craft of programming is almost universally concerned with different types of data and operations/algorithms that act on this data [1]. Therefore, it's hardly surprising that designing abstractions for data types and operations has been on the mind of software engineers and programming-language designers since... forever.

Yet I've only recently encountered a name for a software design problem which I ran into multiple times in my career. It's a problem so fundamental that I was quite surprised that I haven't seen it named before. Here is a quick problem statement.

Imagine that we have a set of data types and a set of operations that act on these types. Sometimes we need to add more operations and make sure they work properly on all types; sometimes we need to add more types and make sure all operations work properly on them. Sometimes, however, we need to add both - and herein lies the problem. Most of the mainstream programming languages don't provide good tools to add both new types and new operations to an existing system without having to change existing code. This is called the "expression problem". Studying the problem and its possible solutions gives great insight into the fundamental differences between object-oriented and functional programming and well as concepts like interfaces and multiple dispatch.

A motivating example

As is my wont, my example comes from the world of compilers and interpreters. To my defense, this is also the example used in some of the seminal historic sources on the expression problem, as the historical perspective section below details.

Imagine we're designing a simple expression evaluator. Following the standard interpreter design pattern, we have a tree structure consisting of expressions, with some operations we can do on such trees. In C++ we'd have an interface every node in the expression tree would have to implement:

class Expr {
public:
  virtual std::string ToString() const = 0;
  virtual double Eval() const = 0;
};

This interface shows that we currently have two operations we can do on expression trees - evaluate them and query for their string representations. A typical leaf node expression:

class Constant : public Expr {
public:
  Constant(double value) : value_(value) {}

  std::string ToString() const {
    std::ostringstream ss;
    ss << value_;
    return ss.str();
  }

  double Eval() const {
    return value_;
  }

private:
  double value_;
};

And a typical composite expression:

class BinaryPlus : public Expr {
public:
  BinaryPlus(const Expr& lhs, const Expr& rhs) : lhs_(lhs), rhs_(rhs) {}

  std::string ToString() const {
    return lhs_.ToString() + " + " + rhs_.ToString();
  }

  double Eval() const {
    return lhs_.Eval() + rhs_.Eval();
  }

private:
  const Expr& lhs_;
  const Expr& rhs_;
};

Until now, it's all fairly basic stuff. How extensible is this design? Let's see... if we want to add new expression types ("variable reference", "function call" etc.), that's pretty easy. We just define additional classes inheriting from Expr and implement the Expr interface (ToString and Eval).

However, what happens if we want to add new operations that can be applied to expression trees? Right now we have Eval and ToString, but we may want additional operations like "type check" or "serialize" or "compile to machine code" or whatever.

It turns out that adding new operations isn't as easy as adding new types. We'd have to change the Expr interface, and consequently change every existing expression type to support the new method(s). If we don't control the original code or it's hard to change it for other reasons, we're in trouble.

In other words, we'd have to violate the venerable open-closed principle, one of the main principles of object-oriented design, defined as:

software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification

The problem we're hitting here is called the expression problem, and the example above shows how it applies to object-oriented programming.

Interestingly, the expression problem bites functional programming languages as well. Let's see how.

The expression problem in functional programming

Object-oriented approaches tend to collect functionality in objects (types). Functional languages cut the cake from a different angle, usually preferring types as thin data containers, collecting most functionality in functions (operations) that act upon them. Functional languages don't escape the expression problem - it just manifests there in a different way.

To demonstrate this, let's see how the expression evaluator / stringifier looks in Haskell. Haskell is a good poster child for functional programming since its pattern matching on types makes such code especially succinct:

module Expressions where

data Expr = Constant Double
          | BinaryPlus Expr Expr

stringify :: Expr -> String
stringify (Constant c) = show c
stringify (BinaryPlus lhs rhs) = stringify lhs
                                ++ " + "
                                ++ stringify rhs

evaluate :: Expr -> Double
evaluate (Constant c) = c
evaluate (BinaryPlus lhs rhs) = evaluate lhs + evaluate rhs

Now let's say we want to add a new operation - type checking. We simply have to add a new function typecheck and define how it behaves for all known kinds of expressions. No need to modify existing code.

On the other hand, if we want to add a new type (like "function call"), we get into trouble. We now have to modify all existing functions to handle this new type. So we hit exactly the same problem, albeit from a different angle.

The expression problem matrix

A visual representation of the expression problem can be helpful to appreciate how it applies to OOP and FP in different ways, and how a potential solution would look.

The following 2-D table (a "matrix") has types in its rows and operations in its columns. A matrix cell row, col is checked when the operation col is implemented for type row:

Basic expression problem matrix demonstarting the starting point

In object-oriented languages, it's easy to add new types but difficult to add new operations:

OOP expression problem matrix

Whereas in functional languages, it's easy to add new operations but difficult to add new types:

FP expression problem matrix

A historical perspective

The expression problem isn't new, and has likely been with us since the early days; it pops its head as soon as programs reach some not-too-high level of complexity.

It's fairly certain that the name expression problem comes from an email sent by Philip Wadler to a mailing list deailing with adding generics to Java (this was back in the 1990s).

In that email, Wadler points to the paper "Synthesizing Object-Oriented and Functional Design to Promote Re-Use" by Krishnamurthi, Felleisen and Friedman as an earlier work describing the problem and proposed solutions. This is a great paper and I highly recommend reading it. Krishnamurthi et.al., in their references, point to papers from as early as 1975 describing variations of the problem in Algol.

Flipping the matrix with the visitor pattern

So far the article has focused on the expression problem, and I hope it's clear by now. However, the title also has the word solution in it, so let's turn to that.

It's possible to kinda solve (read on to understand why I say "kinda") the expression problem in object-oriented languages; first, we have to look at how we can flip the problem on its side using the visitor pattern. The visitor pattern is very common for this kind of problems, and for a good reason. It lets us reformulate our code in a way that makes it easier to change in some dimensions (though harder in others).

For the C++ sample shown above, rewriting it using the visitor pattern means adding a new "visitor" interface:

class ExprVisitor {
public:
  virtual void VisitConstant(const Constant& c) = 0;
  virtual void VisitBinaryPlus(const BinaryPlus& bp) = 0;
};

And changing the Expr interface to be:

class Expr {
public:
  virtual void Accept(ExprVisitor* visitor) const = 0;
};

Now expression types defer the actual computation to the visitor, as follows:

class Constant : public Expr {
public:
  Constant(double value) : value_(value) {}

  void Accept(ExprVisitor* visitor) const {
    visitor->VisitConstant(*this);
  }

  double GetValue() const {
    return value_;
  }

private:
  double value_;
};

// ... similarly, BinaryPlus would have
//
//    void Accept(ExprVisitor* visitor) const {
//      visitor->VisitBinaryPlus(*this);
//    }
//
// ... etc.

A sample visitor for evaluation would be [2]:

class Evaluator : public ExprVisitor {
public:
  double GetValueForExpr(const Expr& e) {
    return value_map_[&e];
  }

  void VisitConstant(const Constant& c) {
    value_map_[&c] = c.GetValue();
  }

  void VisitBinaryPlus(const BinaryPlus& bp) {
    bp.GetLhs().Accept(this);
    bp.GetRhs().Accept(this);
    value_map_[&bp] = value_map_[&(bp.GetLhs())] + value_map_[&(bp.GetRhs())];
  }

private:
  std::map<const Expr*, double> value_map_;
};

It should be obvious that for a given set of data types, adding new visitors is easy and doesn't require modifying any other code. On the other hand, adding new types is problematic since it means we have to update the ExprVisitor interface with a new abstract method, and consequently update all the visitors to implement it.

So it seems that we've just turned the expression problem on its side: we're using an OOP language, but now it's hard to add types and easy to add ops, just like in the functional approach. I find it extremely interesting that we can do this. In my eyes this highlights the power of different abstractions and paradigms, and how they enable us to rethink a problem in a completely different light.

So we haven't solved anything yet; we've just changed the nature of the problem we're facing. Worry not - this is just a stepping stone to an actual solution.

Extending the visitor pattern

The following is code excerpts from a C++ solution that follows the extended visitor pattern proposed by Krishnamurthi et. al. in their paper; I strongly suggest reading the paper (particularly section 3) if you want to understand this code on a deep level. A complete code sample in C++ that compiles and runs is available here.

Adding new visitors (ops) with the visitor pattern is easy. Our challenge is to add a new type without upheaving too much existing code. Let's see how it's done.

One small design change that we should make to the original visitor pattern is use virtual inheritance for Evaluator, for reasons that will soon become obvious:

class Evaluator : virtual public ExprVisitor {
  // .. the rest is the same
};

Now we're going to add a new type - FunctionCall:

// This is the new ("extended") expression we're adding.
class FunctionCall : public Expr {
public:
  FunctionCall(const std::string& name, const Expr& argument)
      : name_(name), argument_(argument) {}

  void Accept(ExprVisitor* visitor) const {
    ExprVisitorWithFunctionCall* v =
        dynamic_cast<ExprVisitorWithFunctionCall*>(visitor);
    if (v == nullptr) {
      std::cerr << "Fatal: visitor is not ExprVisitorWithFunctionCall\n";
      exit(1);
    }
    v->VisitFunctionCall(*this);
  }

private:
  std::string name_;
  const Expr& argument_;
};

Since we don't want to modify the existing visitors, we create a new one, extending Evaluator for function calls. But first, we need to extend the ExprVisitor interface to support the new type:

class ExprVisitorWithFunctionCall : virtual public ExprVisitor {
public:
  virtual void VisitFunctionCall(const FunctionCall& fc) = 0;
};

Finally, we write the new evaluator, which extends Evaluator and supports the new type:

class EvaluatorWithFunctionCall : public ExprVisitorWithFunctionCall,
                                  public Evaluator {
public:
  void VisitFunctionCall(const FunctionCall& fc) {
    std::cout << "Visiting FunctionCall!!\n";
  }
};

Multiple inheritance, virtual inheritance, dynamic type checking... that's pretty hard-core C++ we have to use here, but there's no choice. Unfortunately, multiple inheritance is the only way C++ lets us express the idea that a class implements some interface while at the same time deriving functionality from another class. What we want to have here is an evaluator (EvaluatorWithFunctionCall) that inherits all functionality from Evaluator, and also implements the ExprVisitorWithFunctionCall interface. In Java, we could say something like:

class EvaluatorWithFunctionCall extends Evaluator implements ExprVisitor {
  // ...
}

But in C++ virtual multiple inheritance is the tool we have. The virtual part of the inheritance is essential here for the compiler to figure out that the ExprVisitor base underlying both Evaluator and ExprVisitorWithFunctionCall is the same and should only appear once in EvaluatorWithFunctionCall. Without virtual, the compiler would complain that EvaluatorWithFunctionCall doesn't implement the ExprVisitor interface.

This is a solution, alright. We kinda added a new type FunctionCall and can now visit it without changing existing code (assuming the virtual inheritance was built into the design from the start to anticipate this approach). Here I am using this "kinda" word again... it's time to explain why.

This approach has multiple flaws, in my opinion:

  1. Note the dynamic_cast in FunctionCall::Accept. It's fairly ugly that we're forced to mix in dynamic checks into this code, which should supposedly rely on static typing and the compiler. But it's just a sign of a larger problem.
  2. If we have an instance of an Evaluator, it will no longer work on the whole extended expression tree since it has no understanding of FunctionCall. It's easy to say that all new evaluators should rather be EvaluatorWithFunctionCall, but we don't always control this. What about code that was already written? What about Evaluators created in third-party or library code which we have no control of?
  3. The virtual inheritance is not the only provision we have to build into the design to support this pattern. Some visitors would need to create new, recursive visitors to process complex expressions. But we can't anticipate in advance which dynamic type of visitor needs to be created. Therefore, the visitor interface should also accept a "visitor factory" which extended visitors will supply. I know this sounds complicated, and I don't want to spend more time on this here - but the Krishnamurthi paper addresses this issue extensively in section 3.4
  4. Finally, the solution is unwieldy for realistic applications. Adding one new type looks manageable; what about adding 15 new types, gradually over time? Imagine the horrible zoo of ExprVisitor extensions and dynamic checks this would lead to.

Yeah, programming is hard. I could go on and on about the limitations of classical OOP and how they surface in this example [3]. Instead, I'll just present how the expression problem can be solved in a language that supports multiple dispatch and separates the defintion of methods from the bodies of types they act upon.

Solving the expression problem in Clojure

There are a number of ways the expression problem as displayed in this article can be solved in Clojure using the language's built-in features. Let's start with the simplest one - multi-methods.

First we'll define the types as records:

(defrecord Constant [value])
(defrecord BinaryPlus [lhs rhs])

Then, we'll define evaluate as a multimethod that dispatches upon the type of its argument, and add method implementations for Constant and BinaryPlus:

(defmulti evaluate class)

(defmethod evaluate Constant
  [c] (:value c))

(defmethod evaluate BinaryPlus
  [bp] (+ (evaluate (:lhs bp)) (evaluate (:rhs bp))))

Now we can already evaluate expressions:

user=> (use 'expression.multimethod)
nil
user=> (evaluate (->BinaryPlus (->Constant 1.1) (->Constant 2.2)))
3.3000000000000003

Adding a new operation is easy. Let's add stringify:

(defmulti stringify class)

(defmethod stringify Constant
  [c] (str (:value c)))

(defmethod stringify BinaryPlus
  [bp]
  (clojure.string/join " + " [(stringify (:lhs bp))
                              (stringify (:rhs bp))]))

Testing it:

user=> (stringify (->BinaryPlus (->Constant 1.1) (->Constant 2.2)))
"1.1 + 2.2"

How about adding new types? Suppose we want to add FunctionCall. First, we'll define the new type. For simplicity, the func field of FunctionCall is just a Clojure function. In real code it could be some sort of function object in the language we're interpreting:

(defrecord FunctionCall [func argument])

And define how evaluate and stringify work for FunctionCall:

(defmethod evaluate FunctionCall
  [fc] ((:func fc) (evaluate (:argument fc))))

(defmethod stringify FunctionCall
  [fc] (str (clojure.repl/demunge (str (:func fc)))
            "("
            (stringify (:argument fc))
            ")"))

Let's take it for a spin (the full code is here):

user=> (def callexpr (->FunctionCall twice (->BinaryPlus (->Constant 1.1)
                                                         (->Constant 2.2))))
#'user/callexpr
user=> (evaluate callexpr)
6.6000000000000005
user=> (stringify callexpr)
"expression.multimethod/twice@52e29c38(1.1 + 2.2)"

It should be evident that the expression problem matrix for Clojure is:

Expression problem matrix in Clojure

We can add new ops without touching any existing code. We can also add new types without touching any existing code. The code we're adding is only the new code to handle the ops/types in question. The existing ops and types could come from a third-party library to which we don't have source access. We could still extend them for our new ops and types, without ever having to touch (or even see) the original source code [4].

Is multiple dispatch necessary to cleanly solve the expression problem?

I've written about multiple dispatch in Clojure before, and in the previous section we see another example of how to use the language's defmulti/defmethod constructs. But is it multiple dispatch at all? No! It's just single dispatch, really. Our ops (evaluate and stringify) dispatch on a single argument - the expression type) [5].

If we're not really using multiple dispatch, what is the secret sauce that lets us solve the expression problem so elegantly in Clojure? The answer is - open methods. Note a crucial difference between how methods are defined in C++/Java and in Clojure. In C++/Java, methods have to be part of a class and defined (or at least declared) in its body. You cannot add a method to a class without changing the class's source code.

In Clojure, you can. In fact, since data types and multimethods are orthogonal entities, this is by design. Methods simply live outside types - they are first class citizens, rather than properties of types. We don't add methods to a type, we add new methods that act upon the type. This doesn't require modifying the type's code in any way (or even having access to its code).

Some of the other popular programming languages take a middle way. In languages like Python, Ruby and JavaScript methods belong to types, but we can dynamically add, remove and replace methods in a class even after it was created. This technique is lovingly called monkey patching. While initially enticing, it can lead to big maintainability headaches in code unless we're very careful. Therefore, if I had to face the expression problem in Python I'd prefer to roll out some sort of multiple dispatch mechanism for my program rather than rely on monkey patching.

Another Clojure solution - using protocols

Clojure's multimethods are very general and powerful. So general, in fact, that their performance may not be optimal for the most common case - which is single dispatch based on the type of the sole method argument; note that this is exactly the kind of dispatch I'm using in this article. Therefore, starting with Clojure 1.2, user code gained the ability to define and use protocols - a language feature that was previously restricted only to built-in types.

Protocols leverage the host platform's (which in Clojure's case is mostly Java) ability to provide quick virtual dispatch, so using them is a very efficient way to implement runtime polymorphism. In addition, protocols retain enough of the flexibility of multimethods to elegantly solve the expression problem. Curiously, this was on the mind of Clojure's designers right from the start. The Clojure documentation page about protocols lists this as one of their capabilities:

[...] Avoid the 'expression problem' by allowing independent extension of the set of types, protocols, and implementations of protocols on types, by different parties. [...] do so without wrappers/adapters

Clojure protocols are an interesting topic, and while I'd like to spend some more time on them, this article is becoming too long as it is. So I'll leave a more thorough treatment for some later time and for now will just show how protocols can also be used to solve the expression problem we're discussing.

The type definitions remain the same:

(defrecord Constant [value])
(defrecord BinaryPlus [lhs rhs])

However, instead of defining a multimethod for each operation, we now define a protocol. A protocol can be thought of as an interface in a language like Java, C++ or Go - a type implements an interface when it defines the set of methods declared by the interface. In this respect, Clojure's protocols are more like Go's interfaces than Java's, as we don't have to say a-priori which interfaces a type implements when we define it.

Let's start with the Evaluatable protocol, that consists of a single method - evaluate:

(defprotocol Evaluatable
  (evaluate [this]))

Another protocol we'll define is Stringable:

(defprotocol Stringable
  (stringify [this]))

Now we can make sure our types implement these protocols:

(extend-type Constant
  Evaluatable
    (evaluate [this] (:value this))
  Stringable
    (stringify [this] (str (:value this))))

(extend-type BinaryPlus
  Evaluatable
    (evaluate [this] (+ (evaluate (:lhs this)) (evaluate (:rhs this))))
  Stringable
    (stringify [this]
      (clojure.string/join " + " [(stringify (:lhs this))
                                  (stringify (:rhs this))])))

The extend-type macro is a convenience wrapper around the more general extend - it lets us implement multiple protocols for a given type. A sibling macro named extend-protocol lets us implement the same protocol for multiple types in the same invocation [6].

It's fairly obvious that adding new data types is easy - just as we did above, we simply use extend-type for each new data type to implement our current protocols. But how do we add a new protocol and make sure all existing data types implement it? Once again, it's easy because we don't have to modify any existing code. Here's a new protocol:

(defprotocol Serializable
  (serialize [this]))

And this is its implementation for the currently supported data types:

(extend-protocol Serializable
  Constant
    (serialize [this] [(type this) (:value this)])
  BinaryPlus
    (serialize [this] [(type this)
                       (serialize (:lhs this))
                       (serialize (:rhs this))]))

This time, extending a single protocol for multiple data types - extend-protocol is the more convenient macro to use.

Small interfaces are extensibility-friendly

You may have noted that the protocols (interfaces) defined in the Clojure solution are very small - consisting of a single method. Since adding methods to an existing protocol is much more problematic (I'm not aware of a way to do this in Clojure), keeping protocols small is a good idea. This guideline comes up in other contexts as well; for example, it's good practice to keep interfaces in Go very minimal.

In our C++ solution, splitting the Expr interface could also be a good idea, but it wouldn't help us with the expression problem, since we can't modify which interfaces a class implements after we've defined it; in Clojure we can.


[1]"Types of data" and "operations" are two terms that should be fairly obvious to modern-day programmers. Philip Wadler, in his discussion of the expression problem (see the "historical perspective" section of the article) calls them "datatypes" and "functions". A famous quote from Fred Brooks's The Mythical Man Month (1975) is "Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
[2]

Note the peculiar way in which data is passed between Visit* methods in a Expr* -> Value map kept in the visitor. This is due to our inability to make Visit* methods return different types in different visitors. For example, in Evaluator we'd want them to return double, but in Stringifier they'd probably return std::string. Unfortunately C++ won't let us easily mix templates and virtual functions, so we have to resort to either returning void* the C way or the method I'm using here.

Curiously, in their paper Krishnamurthi et.al. run into the same issue in the dialect of Java they're using, and propose some language extensions to solve it. Philip Wadler uses proposed Java generics in his approach.

[3]I can't resist, so just in brief: IMHO inheritance is only good for a very narrow spectrum of uses, but languages like C++ hail it as the main extension mechanism of types. But inheritance is deeply flawed for many other use cases, such as implementations of interfaces. Java is a bit better in this regard, but in the end the primacy of classes and their "closed-ness" make a lot of tasks - like the expression problem - very difficult to express in a clean way.
[4]In fact, there are plenty of examples in which the Clojure implementation and the standard library provide protocols that can be extended by the user for user-defined types. Extending user-written protocols and multimethods for built-in types is trivial. As an exercise, add an evaluate implementation for java.lang.Long, so that built-in integers could participate in our expression trees without requiring wrapping in a Constant.
[5]FWIW, we can formulate a multiple dispatch solution to the expression problem in Clojure. The key idea is to dispatch on two things: type and operation. Just for fun, I coded a prototype that does this which you can see here. I think the approach presented in the article - each operation being its own multimethod - is preferable, though.
[6]The sharp-eyed reader will notice a cool connection to the expression problem matrix. extend-type can add a whole new row to the matrix, while extend-protocol adds a column. extend adds just a single cell.

Suffix arrays in the Go standard library



I have recently discovered - to my delight - that Go has an efficient implementation of suffix arrays right in the standard library - index/suffixarray.

A suffix array is an ingenious data structure that lets us take a large body of text (or any binary data, for that matter), preprocess it and then be able to find any substring in this text in logarithmic time. And the coolest thing is that a suffix array only requires O(n) space and can be constructed efficiently. For more details, turn to the Wikipedia page on Suffix Arrays - it's pretty good.

Like the rest of the Go standard library, infex/suffixarray is decently documented. However, I couldn't find good examples, so I'll provide some in this post. The full, runnable code of these examples can be found here.

Building the suffix array and simple lookups

Let's start with the basics. suffixarray.New builds a new Index, which can then be used to efficiently look up substrings. One catch with suffixarray.New is that it just takes a single byte slice as the data. This suits many use cases, but sometimes we need some extra preprocessing.

A common task is having a bunch of words (or sentences), and finding which word(s) a certain substring is part of. For example:

words := []string{
    "banana",
    "apple",
    "pear",
    "tangerine",
    "orange",
    "lemon",
    "peach",
    "persimmon",
}

We can combine these words into a single []byte, delimiting them by a byte that appears in no valid word; \x00 is a good candidate. Then, we can create the suffix array:

data := []byte("\x00" + strings.Join(words, "\x00") + "\x00")
sa := suffixarray.New(data)

Now we can use the Lookup method to find all locations of some substring:

indices := sa.Lookup([]byte("an"), -1)
// indices is now: [4 2 31 20]

Lookup returns a slice of indices into the data buffer the suffix array was created with. The substrings can be found at these indices. Given our \x00 delimiters, it's fairly easy to reconstruct a word from an index:

func getStringFromIndex(data []byte, index int) string {
  var start, end int
  for i := index - 1; i >= 0; i-- {
    if data[i] == 0 {
      start = i + 1
      break
    }
  }
  for i := index + 1; i < len(data); i++ {
    if data[i] == 0 {
      end = i
      break
    }
  }
  return string(data[start:end])
}

Now we can get back the words from indices:

// Reconstruct matches from indices found by Lookup.
for _, idx := range indices {
    fmt.Println(getStringFromIndex(data, idx))
}

// prints:
//   banana
//   banana
//   orange
//   tangerine

Note that banana appears twice - this is because the substring an appears twice in the word.

FindAllIndex - flexibility with a big caveat

Another lookup method provided by a suffixarray is FindAllIndex, taking a *regexp.Regexp instead of a []byte. To be honest, I was initially stumped. Nothing I knew about suffix arrays included being able to match artibrary regular expressions! So I looked into the source of the package and drew a breath of calm, and alarm. On one hand, there's no magic. On the other hand, there's no magic! Therefore, I would treat the FindAllIndex method with great suspicion, since it's potentially misleading.

What FindAllIndex does is first figure out whether the given regexp has a literal prefix. If it does, Lookup is used to find the prefix (efficiently) and then regexp matching is used to find the rest of the match starting from the found prefix. If there is no literal prefix, however, it just invokes the Regexp.FindAllIndex method, which scans the whole data linearly to find matches.

So while I can see how FindAllIndex can be convenient in some cases, I would advise a lot of caution when using it, lest you throw the whole premise of the suffix array out of the window. Why bother building the suffix array when you're going to search through it linearly anyway? Be careful to only use FindAllIndex when you know for sure that the regexp has a literal prefix.

Performance

So this is how the suffix array can be used. How well does it perform? I wrote a simple benchmark where I read the list of words from /usr/share/dict/words (on my Ubuntu 14.04 machine it has close to 100K words and total size of ~1MB), and then compare suffix array lookups with regular (linear) lookups.

For the puspose of comparison, I'm just looking for the first appearance of the substring in the whole data. When doing a linear search, it's very crucial where the match ends up being; if it's the very fist item in the list, the lookup is blazing fast; if it's the last, not so much.

With that in mind, I ran some measurements with different locations of the found substring (on a scale from 0.0 to 1.0, 0.0 being the very first word, 1.0 being the last, 0.5 the middle, etc.), as well as one substring that wasn't there at all (marked as n/a below). The times are averages produced by the Go benchmark tooling, per lookup.

Location Linear time Suffixarray time
0.01 12 us 550 ns
0.25 271 us 514 ns
0.5 604 us 506 ns
0.75 775 us 497 ns
n/a 1 ms 409 ns

Note how the linear lookup time grows about linearly with the location, as expected. The suffix array lookup time appears to be independent of location, which is also expected for a given size (there are small fluctuations having to do with the exact order binary search visits a sorted array). In the average case (found in the middle) for this size of input, the suffix array provides a 1000x speedup.

It's also interesting to measure how long the index took building, and how much memory it consumes. Building the suffix array took about 250 ms, and its size in memory was less than 4 MB, which is pretty good for 1 MB of data.

Given that the construction time is non-trivial, there's a tradeoff when using suffix arrays. For the data I'm using, constructing the suffix array took as long as ~300 average lookups, and the cost of a lookup then becomes negligible. So it all depends on your data - how big it is (the bigger it gets, the bigger the win of the suffix array's logarithmic lookup time), how many lookups you anticipate to perform, how are the matches distributed, and so on. In any case, a ready-made suffix array is an excellent tool to have in one's toolbox to tackle substring lookups. Suffix arrays are useful for all kinds of problems involving substrings, not just lookups. For example, they are great for efficiently computing longest-common-prefixes between collections of data. But this is a topic for another day.


On the Composite and Interpreter design patterns



I often see references to the interpreter design pattern in papers related to programming language design. This short post is here to help me remember what this pattern reference usually means, as well as document its relation to the composite design pattern.

The short Wikipedia definition of the interpreter design pattern is:

Given a language, define a representation for its grammar along with an interpreter that uses the representation to interpret sentences in the language.

On the page dedicated to the pattern, it also says:

The syntax tree of a sentence in the language is an instance of the composite pattern and is used to evaluate (interpret) the sentence for a client.

As a compiler hacker, all of this sounds very familiar. Indeed, if you've ever written an interpreter or compiler for a programming language or a domain-specific language - even a simple one - you've almost certainly used both the interpreter and composite patterns.

Suppose we have a very simple language for evaluating mathematical expressions, and we want to write an interpreter for it. Using the classical compiler work-flow we'll tokenize the language, parse it to produce a syntax tree and then either interpret this tree directly or compile it down to some lower-level representation. For the purpose of this post, we'll assume:

  1. Direct evaluation (interpretation) on the tree is used. A compiler would use exactly the same pattern, except that it would emit some sort of code instead of direct results.
  2. We don't care about how the tree is constructed, i.e. the syntax of the language. This post's code sample starts with the constructed syntax tree in memory and focuses on how it's represented and interpreted.

With this in mind, here's a simple C++ program that represents expressions and evaluates them. I'll show the code piecemeal to explain what it does; the full code sample is available here.

We'll start with an abstract interface called Expr which all syntax elements have to implement:

// Maps symbol names to their values. An expression is evaluated in the context
// of a symbol table, in order to assign concrete values to variables referenced
// within it.
typedef std::map<std::string, double> SymbolTable;

// Abstract interface for expressions in the language.
class Expr {
public:
  // Evaluate the expression in the context of the given symbol table, which
  // is to be used to resolve (or update) variable references.
  virtual double Eval(SymbolTable* st) const = 0;
};

And some simple expression kinds:

class Constant : public Expr {
public:
  Constant(double value) : value_(value) {}

  double Eval(SymbolTable* st) const {
    return value_;
  }

private:
  double value_;
};

class VarRef : public Expr {
public:
  VarRef(const char* varname) : varname_(varname) {}

  double Eval(SymbolTable* st) const {
    // Ignore errors: assuming the symbol is defined.
    return (*st)[varname_];
  }

private:
  std::string varname_;
};

Expressions such as constants and variable references are often called terminal, or leaf expressions, since they don't contain other expressions within them. Let's add a more complex, non-leaf expression:

// A function type for computing the result of a binary operation.
typedef std::function<double(double, double)> BinaryFunction;

class BinaryOp : public Expr {
public:
  BinaryOp(BinaryFunction func, const Expr& lhs, const Expr& rhs)
      : func_(func), lhs_(lhs), rhs_(rhs) {}

  double Eval(SymbolTable* st) const {
    return func_(lhs_.Eval(st), rhs_.Eval(st));
  }

private:
  BinaryFunction func_;
  const Expr& lhs_;
  const Expr& rhs_;
};

Note how BinaryOp implements the same interface as the leaf expressions. Its Eval defers to the Eval method of its constituent left-hand-side and right-hand-side expressions. This is an embodiment of the Composite design pattern, defined as:

[...] describes that a group of objects is to be treated in the same way as a single instance of an object. The intent of a composite is to "compose" objects into tree structures to represent part-whole hierarchies. Implementing the composite pattern lets clients treat individual objects and compositions uniformly.

In the language of the Composite pattern, there are leaf and composite classes, both of which are components. In our example, a Constant is a leaf, and so is a VarRef. A BinaryOp is a composite. Both inherit from Expr, which is the component.

The core of the composite pattern manifests here in the uniform interface (Expr) implemented by both Constant ("individial object" in the definition quoted above) and BinaryOp ("composition").

I'm not a big fan of UML, but since this is design patterns we're talking about, I couldn't help myself ;-) Here's our class diagram described in UML. Note the close conceptual resemblance to the UML diagram on the Composite Pattern Wikipedia page.

UML for the pattern in Expr

Finally, let us see these classes in action. Here's a main function that hand-assembles a simple expression and evaluates it. This is a toy for demonstration purposes; in a real program, the syntax tree would be built automatically, most likely by a parser.

int main(int argc, const char** argv) {
  // Define a couple of constants and a reference to the variable 'A'.
  std::unique_ptr<Expr> c1(new Constant(2.0));
  std::unique_ptr<Expr> c2(new Constant(3.3));
  std::unique_ptr<Expr> v(new VarRef("A"));

  // Define a binary expression representing "2.0 * 3.3 + A"
  std::unique_ptr<Expr> e1(new BinaryOp(std::multiplies<double>(), *c1, *c2));
  std::unique_ptr<Expr> e2(new BinaryOp(std::plus<double>(), *e1, *v));

  // Evaluate in the context of a symbol table where A has the value 1.1
  SymbolTable st{{"A", 1.1}};
  std::cout << e2->Eval(&st) << "\n";

  return 0;
}

The expression tree created by this code is:

Expression tree for 2.0 * 3.3 + A

It is then evaluated with the context of A = 1.1, and the result is 7.7, as expected.

Finally, I'll mention that while this example is very typical of a scenario in which I usually encounter these two patterns, it's by no means the only one.

The Composite pattern has life outside interpreters, of course. It's useful whenever a group of objects can be handled in a uniform manner as a single object. For example, in the world of graphics we may have shape objects that can be moved, rotated, and so on; we may want to treat a "group of shapes" similarly (move all shapes within it equally, rotate the group, etc). This calls for the use of the composite pattern where all shapes, as well as a "shape group" derive from a common component interface.

The Interpreter pattern is useful whenever a problem can be described by a language of any sort. Some examples are SQL or other logical query methods, regular expressions, many kinds of rule-based systems, etc.