Some notes on POSIX regular expressions

Lately I've been digging a bit into POSIX regular expressions. My first experience with regular exressions was in year 2000 with Perl. Since then, Perl-compatible regular expressions (PCRE) is the norm in higher level programming languages and new libraries. My encounter with POSIX regular expressions was entirely accidental, while examining the LLVM Regex utilities. These turned out to be based on OpenBSD's implementation of POSIX regexes.

POSIX regexes are a representative example of standards in the Unix world and the programming domain in general. Although POSIX was devised as a set of standards aimed to resolve incompatibilities between earlier interfaces of programming libraries, even in the post-POSIX world actual interfaces and implementations have grown apart in some respects. Regular expressions seems to be one of them.

(xkcd tends to get it just right)

Some code to play with

Here's a simple C program that lets one experiment with POSIX regular expression matching:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <regex.h>


void showargs(int argc, const char** argv) {
  printf("Program argc=%d\n", argc);
  for (int i = 0; i < argc; ++i)
    printf("[%d]: %s\n", i, argv[i]);
}


void report_regex_error(int errcode, const regex_t* preg) {
  const size_t BUFSIZE = 512;
  char buf[BUFSIZE];

  size_t errlen = regerror(errcode, preg, buf, BUFSIZE);
  assert(errlen > 0 && errlen < BUFSIZE);
  printf("regerror: %s\n", buf);
}


int main(int argc, const char* argv[])
{
  // Used here to make sure shell quotiong doesn't screw up with our
  // regexes.
  showargs(argc, argv);
  printf("--------\n");

  if (argc < 4) {
    printf("Usage: %s <regex_type> <regex> <string>\n", argv[0]);
    exit(1);
  }

  const char* arg_type = argv[1];
  const char* arg_regex = argv[2];
  const char* arg_string = argv[3];

  int extended = 0;
  if (!strcmp(arg_type, "extended"))
    extended = 1;
  else if (!strcmp(arg_type, "basic"))
    extended = 0;
  else {
    printf("Expected regex_type to be \"extended\" or \"basic\"\n");
    exit(1);
  }

  regex_t compiled_regex;
  const size_t max_groups = 9;
  size_t errcode;
  int regflags = 0;

  // Set up flags for regcomp
  if (extended)
    regflags |= REG_EXTENDED;

  // Compile the regex. Return code != 0 means an error.
  if ((errcode = regcomp(&compiled_regex, arg_regex, regflags))) {
    report_regex_error(errcode, &compiled_regex);
    exit(1);
  }

  regmatch_t match_groups[max_groups];
  if (regexec(&compiled_regex, arg_string,
              max_groups, match_groups, 0) == 0) {
    // Go over all matches. A match with rm_so = -1 signals the end
    for (size_t i = 0; i < max_groups; ++i) {
      if (match_groups[i].rm_so == -1)
        break;
      printf("Match group %zu: ", i);
      for (regoff_t p = match_groups[i].rm_so;
           p < match_groups[i].rm_eo; ++p) {
        putchar(arg_string[p]);
      }
      putchar('\n');
    }
  } else {
    printf("No match\n");
  }

  return 0;
}

And here's sample usage:

$ ./regex_sample extended "[a-z]+[0-9]{2,3}" abc123
Program argc=4
[0]: ./regex_sample
[1]: extended
[2]: [a-z]+[0-9]+
[3]: abc123
--------
Match group 0: abc123

The program prints back its command-line arguments because I wanted to be sure there's no clash with shell quoting and escaping rules. When dealing with regular expressions, quoting is always a problem.

You will notice that it accepts either "basic" or "extended" as its first argument. This is to specify the "flavor" of POSIX regular expressions. Basic Regular Expressions (BREs) and Extended Regular Expressions (EREs) are different in some important respects. I won't spell it all out here, there's plenty of material online (for example this Wikipedia page).

The above can also be expressed as a BRE:

$ ./regex_sample basic "[a-z][a-z]*[0-9]\{2,3\}" abc123
Program argc=4
[0]: ./regex_sample
[1]: basic
[2]: [a-z][a-z]*[0-9]\{2,3\}
[3]: abc123
--------
Match group 0: abc123

This example shows a couple of the differences between BREs and EREs. The "one or more" repetition specifier (+) is not supported in BREs, and some special characters like braces and grouping parens have to be escaped by backslashes. Note, however, that many libraries and tools added some "extensions" to either BRE or ERE to support features from the other one.

Backreferences

Backreferences are a controversial topic in regular expressions, possibly because they can not be matched efficiently in the general case. However, backreferences have been supported by tools like Perl for a long enough time to become commonly used and expected. There are some very common useful patterns made quite easy with backreferences and almost impossible without them.

And when it comes to backreferences, POSIX is a mess. Or at least the various POSIX implementations, if not the standard itself. The standard quite clearly dictates that while backreferences are supported in BREs, they are not supported in EREs.

But look at this output from my simple test program, compiled with gcc 4.6 on Ubuntu Linux 12.04:

$ ./regex_sample extended "([a-z]+)_\1" abc_abc
Program argc=4
[0]: ./regex_sample
[1]: extended
[2]: ([a-z]+)_\1
[3]: abc_abc
--------
Match group 0: abc_abc
Match group 1: abc

So EREs do support backreferences in Linux's implementation. Hmm, useful, although technically POSIX-incompliant.

Basic, extended, enhanced... oh my!

So, on Linux, the GNU regex implementation accepts backreferences in EREs. This is mentioned, for example, here:

POSIX ERE does not support backreferences. The GNU Extension adds them, using the same \1 through \9 syntax.

The various BSD systems take another view. According to the OpenBSD man page, only BREs support backrefs, not EREs. I wonder if this has to do with OpenBSD's insistence on security - after all the exponential explosion of matching time sometimes caused by backrefereces can be considered a security problem of a sort.

Other references, like this one, mention a third flavor of regexes - enhanced, which is like basic but with some features (including backreferences) added. I'm not entirely sure which exact implementation that reference talks about, though.

What's worse is that some sources wrongly state that it is EREs that support backreferences, and not BREs. I guess the sentiment is understandable - how can someone "enhance" something by removing, instead of adding, features. On the other hand, since backreferences may be so slow to match, some newer libraries (like re2) dropped support for them on purpose. And looking in the manuals of many tools and libraries, a warning is given with regards to using backrefs in regular expression. For example, this is what man grep has to say (in the "Known Bugs" section, no less...):

Back-references are very slow, and may require exponential time.

Any conclusion?

As it often happens, and as the xkcd drawing above nicely demonstrates, standards often start with good intentions but end up making the situation even more confusing. I will not say this is what happened with the POSIX regex standard. I'm too young to judge. Perhaps before POSIX did something in this respect, the situation was a total mess. Whatever the history was, the current situation still doesn't leave us with a single reliable "standard". Each implementation to its own, which of course makes a lot of code non-portable.

This is why, if portable regular expressions are important for your program or library, I would recommend to pick one implementation and carry it around, not relying on the system implementations which are supposed to follow a standard, but in practice don't. And while we're at it, don't pick a POSIX implementation. These tend to be old, crufty and poor in supported features.