Welcome


... to my place on the web - a personal website where I keep articles, freeware programs and code snippets I've written, and a weblog where I share my ideas and insights on programming, technology, books, and life in general.

To browse the website, use the site map on the right-hand side of the screen. Below, you will find the ten most recent posts.

Removing epsilon productions from context free grammars

February 8th, 2010 at 6:53 am

Background

epsilon productions are very useful to express many grammars in a compact way. For example, take these simple function call productions in some imaginary C-like language:

func_call:: identifier '(' arguments_opt ')'
arguments_opt:: arguments_list | eps
arguments_list:: argument | argument ',' arguments_list

When composing grammars by hand, simplicity matters. It’s very useful to be able to look at arguments_opt and know that it’s an optional list of arguments. The same non-terminal can be reused in several other productions.

However, epsilon productions pose a problem for several algorithms that act on grammars. Therefore, prior to running these algorithms, epsilon productions have to be removed. Fortunately, this can be done relatively effortlessly in an automatic way.

Here I want to present an algorithm and a simple implementation for epsilon production removal.

The algorithm

Intuitively, it’s quite simple to remove epsilon productions. Consider the grammar for function calls presented above. The argument_opt nonterminal in func_call is just a short way of saying that there either is an argument list inside those parens or nothing. In other words, it can be rewritten as follows:

func_call:: identifier '(' arguments_opt ')'
          | identifier '(' ')'
arguments_opt:: arguments_list
arguments_list:: argument | argument ',' arguments_list

This duplication of productions for func_call will have to be repeated for every other production that had arguments_opt in it. This grammar looks somewhat strange, as arguments_opt is now identical to arguments_list. It is correct, however.

A more interesting case occurs when the epsilon production is in a nonterminal that appears more than once in some other production [1]. Consider:

B:: A z A
A:: a | eps

When we remove the epsilon production from A, we have to duplicate the productions that have A in them, but the production for B has two A. Since either of the A instances in the production can be empty, the only proper way to do this is go over all the combinations:

B:: z | A z | z A | A z A
A:: a

In the general case, if A appears k times in some production, this production will be replicated 2^k times, each time with a different combination [2].

This leads us to the algorithm:

  1. Pick a nonterminal A with an epsilon production
  2. Remove that epsilon production
  3. For each production containing A: Replicate it 2^k times where k is the number of A instances in the production, such that all combinations of A being there or not will be represented.
  4. If there are still epsilon productions in the grammar, go back to step 1.

A couple of points to pay attention to:

  • It’s obvious that a step of the algorithm can create new epsilon productions [3]. This is handled correctly, as it works iteratively until all epsilon productions are removed.
  • The only place where an epsilon production cannot be removed is at the start symbol. If the grammar can generate an empty string, we can’t ruin that. A special case will have to handle this case.

Implementation

Here’s an implementation of this algorithm in Python:

from collections import defaultdict

class CFG(object):
    def __init__(self):
        self.prod = defaultdict(list)
        self.start = None

    def set_start_symbol(self, start):
        """ Set the start symbol of the grammar.
        """
        self.start = start

    def add_prod(self, lhs, rhs):
        """ Add production to the grammar. 'rhs' can
            be several productions separated by '|'.
            Each production is a sequence of symbols
            separated by whitespace.
            Empty strings are interpreted as an eps-production.

            Usage:
                grammar.add_prod('NT', 'VP PP')
                grammar.add_prod('Digit', '1|2|3|4')

                # Optional Digit: digit or eps
                grammar.add_prod('Digit_opt', Digit |')
        """
        # The internal data-structure representing productions.
        # maps a nonterminal name to a list of productions, each
        # a list of symbols. An empty list [] specifies an
        # eps-production.
        #
        prods = rhs.split('|')
        for prod in prods:
            self.prod[lhs].append(prod.split())

    def remove_eps_productions(self):
        """ Removes epsilon productions from the grammar.

            The algorithm:

            1. Pick a nonterminal p_eps with an epsilon production
            2. Remove that epsilon production
            3. For each production containing p_eps, replace it
               with several productions such that all the
               combinations of p_eps being there or not will be
               represented.
            4. If there are still epsilon productions in the
               grammar, go back to step 1

            The replication can be demonstrated with an example.
            Suppose that A contains an epsilon production, and
            we've found a production B:: [A, k, A]
            Then this production of B will be replaced with these:
            [A, k], [k], [k, A], [A, k, A]
        """
        while True:
            # Find an epsilon production
            #
            p_eps, index = self._find_eps_production()

            # No epsilon productions? Then we're done...
            #
            if p_eps is None:
                break

            # Remove the epsilon production
            #
            del self.prod[p_eps][index]

            # Now find all the productions that contain the
            # production that removed.
            # For each such production, replicate it with all
            # the combinations of the removed production.
            #
            for lhs in self.prod:
                prods = []

                for lhs_prod in self.prod[lhs]:
                    num_p_eps = lhs_prod.count(p_eps)
                    if num_p_eps == 0:
                        prods.append(lhs_prod)
                    else:
                        prods.extend(self._create_prod_combinations(
                            prod=lhs_prod,
                            nt=p_eps,
                            count=num_p_eps))

                # Remove duplicates
                #
                prods = sorted(prods)
                prods = [prods[i] for i in xrange(len(prods))
                                  if i == 0 or prods[i] != prods[i-1]]

                self.prod[lhs] = prods

    def _find_eps_production(self):
        """ Finds an epsilon production in the grammar. If such
            a production is found, returns the pair (lhs, index):
            the name of the non-terminal that has an epsilon
            production and its index in lhs's list of productions.
            If no epsilon productions were found, returns the
            pair (None, None).

            Note: eps productions in the start symbol will be
            ignored, because we don't want to remove them.
        """
        for lhs in self.prod:
            if not self.start is None and lhs == self.start:
                continue

            for i, p in enumerate(self.prod[lhs]):
                if len(p) == 0:
                    return lhs, i

        return None, None

    def _create_prod_combinations(self, prod, nt, count):
        """ prod:
                A production (list) that contains at least one
                instance of 'nt'
            nt:
                The non-terminal which should be replicated
            count:
                The amount of times 'nt' appears in 'lhs_prod'.
                Assumed to be >= 1

            Returns the generated list of productions.
        """
        # The combinations are a kind of a powerset. Membership
        # in a powerset can be checked by using the binary
        # representation of a number.
        # There are 2^count possibilities in total.
        #
        numset = 1 << count
        new_prods = []

        for i in xrange(numset):
            nth_nt = 0
            new_prod = []

            for s in prod:
                if s == nt:
                    if i & (1 << nth_nt):
                        new_prod.append(s)
                    nth_nt += 1
                else:
                    new_prod.append(s)

            new_prods.append(new_prod)

        return new_prods

And here are the results with some of the sample grammars presented earlier in the article:

cfg = CFG()
cfg.add_prod('identifier', '( arguments_opt )')
cfg.add_prod('arguments_opt', 'arguments_list | ')
cfg.add_prod('arguments_list', 'argument | argument , arguments_list')

cfg.remove_eps_productions()
for p in cfg.prod:
    print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]

Produces:

func_call ::  ['identifier ( )', 'identifier ( arguments_opt )']
arguments_list ::  ['argument', 'argument , arguments_list']
arguments_opt ::  ['arguments_list']

As expected. And:

cfg = CFG()
cfg.add_prod('B', 'A z A')
cfg.add_prod('A', 'a | ')

cfg.remove_eps_productions()
for p in cfg.prod:
    print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]

Produces:

A ::  ['a']
B ::  ['A z', 'A z A', 'z', 'z A']

The implementation isn’t tuned for efficiency, but for simplicity. Luckily, CFGs are usually small enough to make the runtime of this implementation manageable. Note that the preservation of epsilon productions in the start rule is implemented in the _find_eps_production method.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] From here on, lowercase letters early in the alphabet (a, b, c…) are terminals. Early uppercase letters (A, B, C…) are nonterminals, and letters late in the alphabet (z, y, x…) are arbitrary strings of terminals and nonterminals.
[2] If this sounds like generating a powerset, you’re right.
[3] Consider the productions:
A:: a | eps
B:: b | A

After removing the epsilon production from A we’ll have:

A:: a
B:: b | A | eps

Generating random sentences from a context free grammar

January 28th, 2010 at 3:13 pm

Sometimes it’s interesting to randomly generate a large amount of valid sentences from a context free grammar. The best use for this is automated stress-testing of parsers for those grammars [1].

So how would you generate sentences?

Simple recursive algorithm

The first approach that springs to mind is a simple recursive traversal of the grammar, choosing productions at random. Here’s the algorithm with some infrastructure:

class CFG(object):
    def __init__(self):
        self.prod = defaultdict(list)

    def add_prod(self, lhs, rhs):
        """ Add production to the grammar. 'rhs' can
            be several productions separated by '|'.
            Each production is a sequence of symbols
            separated by whitespace.

            Usage:
                grammar.add_prod('NT', 'VP PP')
                grammar.add_prod('Digit', '1|2|3|4')
        """
        prods = rhs.split('|')
        for prod in prods:
            self.prod[lhs].append(tuple(prod.split()))

    def gen_random(self, symbol):
        """ Generate a random sentence from the
            grammar, starting with the given
            symbol.
        """
        sentence = ''

        # select one production of this symbol randomly
        rand_prod = random.choice(self.prod[symbol])

        for sym in rand_prod:
            # for non-terminals, recurse
            if sym in self.prod:
                sentence += self.gen_random(sym)
            else:
                sentence += sym + ' '

        return sentence

CFG represents a context free grammar. It holds productions in the prod attribute, which is a dictionary mapping a symbol to a list of its possible productions. Each production is a tuple of symbols. A symbol can either be a terminal or a nonterminal. Those are distinguished as follows: nonterminals have entries in prod, terminals do not.

gen_random is a simple recursive algorithm for generating a sentence starting with the given grammar symbol. It selects one of the productions of symbols randomly and iterates through it, recursing into nonterminals and emitting terminals directly.

Here’s an example usage of the class with a very simple natural-language grammar:

cfg1 = CFG()
cfg1.add_prod('S', 'NP VP')
cfg1.add_prod('NP', 'Det N | Det N')
cfg1.add_prod('NP', 'I | he | she | Joe')
cfg1.add_prod('VP', 'V NP | VP')
cfg1.add_prod('Det', 'a | the | my | his')
cfg1.add_prod('N', 'elephant | cat | jeans | suit')
cfg1.add_prod('V', 'kicked | followed | shot')

for i in xrange(10):
    print cfg1.gen_random('S')

And here are some sample statements generated by it:

the suit kicked my suit
she followed she
she shot a jeans
he shot I
a elephant followed the suit
he followed he
he shot the jeans
his cat kicked his elephant
I followed Joe
a elephant shot Joe

The problem with the simple algorithm

Consider the following grammar:

cfgg = CFG()
cfgg.add_prod('S', 'S S S S | a')

It has a single nonterminal, S and a single terminal a. Trying to generate a random sentence from it sometimes results in a RuntimeError exception being thrown: maximum recursion depth exceeded while calling a Python object. Why is that?

Consider what happens when gen_random runs on this grammar. In the first call, it has a 50% chance of selecting the S S S S production and a 50% chance of selecting a. If S S S S is selected, the algorithm will now recurse 4 times into each S. The chance of all those calls resulting in a is just 0.0625, and there’s a 0.9375 chance that at least one will result in S and generate S S S S again. As this process continues, chances get slimmer and slimmer that the algorithm will ever terminate. This isn’t good.

You may now think that this is a contrived example and real-life grammars are more well-behaved. Unfortunately this isn’t the case. Consider this (rather ordinary) arithmetic expression grammar:

cfg2 = CFG()
cfg2.add_prod('EXPR', 'TERM + EXPR')
cfg2.add_prod('EXPR', 'TERM - EXPR')
cfg2.add_prod('EXPR', 'TERM')
cfg2.add_prod('TERM', 'FACTOR * TERM')
cfg2.add_prod('TERM', 'FACTOR / TERM')
cfg2.add_prod('TERM', 'FACTOR')
cfg2.add_prod('FACTOR', 'ID | NUM | ( EXPR )')
cfg2.add_prod('ID', 'x | y | z | w')
cfg2.add_prod('NUM', '0|1|2|3|4|5|6|7|8|9')

When I try to generate random sentences from it, less than 30% percent of the runs terminate [2].

The culprit here is the ( EXPR ) production of FACTOR. An expression can get expanded into several factors, each of which can once again result in a whole new expression. Just a couple of such derivations can be enough for the whole generation process to diverge. And there’s no real way to get rid of this, because ( EXPR ) is an essential derivation of FACTOR, allowing us to parse expressions like 5 * (1 + x).

Thus, even for real-world grammars, the simple recursive approach is an inadequate solution. [3]

An improved generator: convergence

We can employ a clever trick to make the generator always converge (in the mathematical sense). Think of the grammar as representing an infinite tree:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/expr1.png

The bluish nodes represent nonterminals, and the greenish nodes represent possible productions. If we think of the grammar this way, it is obvious that the gen_random method presented earlier is a simple n-nary tree walk.

The idea of the algorithm is to attach weights to each possible production and select the production according to these weights. Once a production is selected, its weight is decreased and passed recursively down the tree. Therefore, once the generator runs into the same nonterminal and considers these productions again, there will be a lower chance for the same recursion to occur. A diagram shows this best:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/expr2.png

Note that initially all the productions of expr have the same weight, and will be selected with equal probability. Once term - expr is selected, the algorithm takes note of this. When the same choice is presented again, the weight of term - expr is decreased by some factor (in this case by a factor of 2). Note that it can be selected once again, but then for the next round its weight will be 0.25. This of course only applies to the same tree branch. If term - expr is selected in some other, unrelated branch, its weight is unaffected by this selection.

This improvement solves the divergence problem of the naive recursive algorithm. Here’s its implementation (it’s a method of the same CFG class presented above):

def gen_random_convergent(self,
      symbol,
      cfactor=0.25,
      pcount=defaultdict(int)
  ):
  """ Generate a random sentence from the
      grammar, starting with the given symbol.

      Uses a convergent algorithm - productions
      that have already appeared in the
      derivation on each branch have a smaller
      chance to be selected.

      cfactor - controls how tight the
      convergence is. 0 < cfactor < 1.0

      pcount is used internally by the
      recursive calls to pass on the
      productions that have been used in the
      branch.
  """
  sentence = ''

  # The possible productions of this symbol are weighted
  # by their appearance in the branch that has led to this
  # symbol in the derivation
  #
  weights = []
  for prod in self.prod[symbol]:
      if prod in pcount:
          weights.append(cfactor ** (pcount[prod]))
      else:
          weights.append(1.0)

  rand_prod = self.prod[symbol][weighted_choice(weights)]

  # pcount is a single object (created in the first call to
  # this method) that's being passed around into recursive
  # calls to count how many times productions have been
  # used.
  # Before recursive calls the count is updated, and after
  # the sentence for this call is ready, it is rolled-back
  # to avoid modifying the parent's pcount.
  #
  pcount[rand_prod] += 1

  for sym in rand_prod:
      # for non-terminals, recurse
      if sym in self.prod:
          sentence += self.gen_random_convergent(
                              sym,
                              cfactor=cfactor,
                              pcount=pcount)
      else:
          sentence += sym + ' '

  # backtracking: clear the modification to pcount
  pcount[rand_prod] -= 1
  return sentence

An auxiliary weighted_choice function is used:

def weighted_choice(weights):
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        rnd -= w
        if rnd < 0:
            return i

Note the cfactor parameter of the algorithm. This is the convergence factor - the probability by which a weight is multiplied each time it’s been selected. Having been selected N times, the weight becomes cfactor to the power N. I’ve plotted the average length of the generated sentence from the expression grammar as a function of cfactor:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/cfactor_plot.png

As expected, the average length grows with cfactor. If we set cfactor to 1.0, this becomes the naive algorithm where all the productions are always of equal weight.

Conclusion

While the naive algorithm is suitable for some simplistic cases, for real-world grammars it’s inadequate. A generalization that employs weighted selection using a convergence factor provides a much better solution that generates sentences from grammars with guaranteed termination. This is a sound and relatively efficient method that can be used in real-world applications to generate complex random test cases for parsers.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] Another could be some rudimentary form of random text generation, although the resulting text isn’t very comprehensible. Markov chains are much better for this.
[2] A larger percentage can be achieved by increasing Python’s recursion depth limit, but this is not the point.
[3] Some algorithms, like this one by Randal Schwartz, assign fixed weights to each production. While it could be used to decrease the chances of divergence, it’s not a really good general solution for our problem. However, it works great for simple, non-recursive grammars like the one presented in his article.

Book review: “Matplotlib for Python developers” by Sandro Tosi

January 27th, 2010 at 4:32 pm

Disclaimer: this book (in electronic format) was provided to me for review by Packt publishing

matplotlib is the most popular plotting library for Python, and rightly so. It produces extremely high-quality plots suitable for publication, and thus, coupled with numpy and scipy is one of the major driving forces in the scientific Python community, which gets more and more active in the past few years.

The library has a comprehensive reference documentation, but few high-quality tutorials. This is the niche this book attempts to fill. It is divided into two main parts. The first (about 1/3 of the book) serves as a tutorial to matplotlib, presenting its various features in an increasing level of complexity. The second part consists of:

  • Tutorials on integrating matplotlib with the major GUI frameworks used for Python - there are chapters for GTK+, wxPython and PyQt. These topics are commonly sought by beginning Python programmers (as the logs of my blog clearly show).
  • A chapter about “matplotlib on the web”, which is somewhat useless in my opinion, because it teaches absolutely nothing new about matplotlib.
  • A chapter called “matplotlib in the real world” which is a hodgepodge of data munging and plotting examples, which is either useful or not, depending on your experience and needs.

The book could clearly use some editing of the English (the author is not a native speaker, which is fine, but means that the editors should have done a more thorough job). Also, it has a peculiar organization - sub-chapters and sections aren’t numbered, which is very unusual and confusing, and makes cross references impossible.

All in all, I can see how this book could be useful to some users. Mainly, I think, for scientists who don’t want to google everything and to wade through docs and tutorials and want everything in a single place. But for Python hackers seeking to just make some plots, I doubt it’s of great value. All the information is available online, and if you know how to look for it, there will be no trouble finding what you need, way faster than reading through this book.

Book review: “Linkers & Loaders” by John Levine

January 25th, 2010 at 6:27 am

I want to start with a short conclusion: if you’re seriously interested in compilation and operating systems, this an interesting book on a topic that isn’t frequently covered in books or tutorials. If you have any intentions of implementing low level system software that performs the tasks of, or at least cooperates with, linkers and loaders, this book is absolutely essential. It comes with a handy project that implements a simple linker and loader from scratch.

Now, that done, I have a very strong urge of making fun of the back cover of the book. You know what I’m talking about - those shiny, marketing-hype-full pages the book’s publisher places hoping to increase the book’s sales.

Here’s the “features” section:

  • Covers dynamic linking in Windows, UNIX, Linux, BeOS and other operating systems
  • Explains the Java linking model and how it figures in network applets and extensible Java code
  • Helps you write more elegant and effective code, and build applications that compile, load, and run more efficiently
  • Includes a linker construction project written in Perl, with project files available for download

The last item has a fat “WEB enhanced” image next to it.

Now, each time I see this I just can’t help thinking of grandpa Joe, roaming his town’s Barnes & Noble looking for a Christmas present for his geeky grandson Joe Jr. Jr., and picking this book because of such a sweet selection of totally irrelevant buzzwords.

This book is NOT about all those operating systems, it’s certainly NOT about Java, it will NOT help you write better code. What it is about is the nitty gritty low-level details of linkers and loaders, the almost unknown and non-glamorous blue-collar workers that are nevertheless essential for any compilation and program load. Yes, all those buzzwords are mentioned in the book (about once or twice at most), but it’s not what the book is about. There, Morgan-Kaufman publishers, I’ve taken this off my chest. Luckily, John Levine is a better writer than your salespeople.

Weighted random generation in Python

January 22nd, 2010 at 2:15 pm

A problem I frequently run into is to randomly select an element from some kind of container, with the chances of each element to be selected not being equal, but defined by relative "weights" (or probabilities). This is called weighted random selection [1].

Simple "linear" approach

The following is a simple function to implement weighted random selection in Python. Given a list of weights, it returns an index randomly, according to these weights [2].

For example, given [2, 3, 5] it returns 0 (the index of the first element) with probability 0.2, 1 with probability 0.3 and 2 with probability 0.5. The weights need not sum up to anything in particular, and can actually be arbitrary Python floating point numbers.

import random

def weighted_choice(weights):
    totals = []
    running_total = 0

    for w in weights:
        running_total += w
        totals.append(running_total)

    rnd = random.random() * running_total
    for i, total in enumerate(totals):
        if rnd < total:
            return i

Giving up the temporary list

It turns out that the temporary totals list isn’t required at all. Employing some ingenuity, we can keep track of where we are in the list of weights by subtracting the current weight from the total:

def weighted_choice_sub(weights):
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        rnd -= w
        if rnd < 0:
            return i

This method is about twice as fast as the binary-search technique, although it has the same complexity overall. Building the temporary list of totals turns out to be a major part of the function’s runtime.

This approach has another interesting property. If we manage to sort the weights in descending order before passing them to weighted_choice_sub, it will run even faster, since the random call returns a uniformly distributed value and larger chunks of the total weight will be skipped in the beginning.

http://eli.thegreenplace.net/wp-content/uploads/2010/01/subweight.png

Indeed, when pre-sorted the runtime is further reduced by about 20%.

King of the hill

All the methods shown so far use the same technique - generate a random number between 0 and the sum of the weights, and find out into which "slice" it falls. Sometimes it’s also called the "roulette method".

There is a different approach however:

def weighted_choice_king(weights):
    total = 0
    winner = 0
    for i, w in enumerate(weights):
        total += w
        if random.random() * total < w:
            winner = i
    return winner

An interesting property of this algorithm is that you don’t need to know the amount of weights in advance in order to use it - so it could be used on some kind of stream.

Cool as this method is, it’s much slower than the others. I suspect this has something to do with Python’s random module not being too speedy, but it’s a fact. Even the simple linear approach is ~25% faster on long lists.

Preprocessing

In some cases you may want to select many random objects from the same weight distribution. In these cases, the totals list of weights can be precomputed and the binary-search approach will be very fast for just selecting the numbers. Something like this can be used:

class WeightedRandomGenerator(object):
    def __init__(self, weights):
        self.totals = []
        running_total = 0

        for w in weights:
            running_total += w
            self.totals.append(running_total)

    def next(self):
        rnd = random.random() * self.totals[-1]
        return bisect.bisect_right(self.totals, rnd)

    def __call__(self):
        return self.next()

As expected, for long lists this approach is about 100 times faster than calling weighted_random all the time. This is hardly a fair comparison, of course, but still a useful one to keep in mind when programming.

Conclusion

Here’s a graph comparing the performance of the various methods presented:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/w_runtime.png

The subtraction method is the fastest when you need one-off selections with different weights. If you can manage to supply a pre-sorted weights list, all the better - you’ll have a nice performance gain.

However, if you need to generate many numbers from the same set of weights, it definitely pays to pre-compute the table of cumulative totals and then only use binary search for each call (the wrg method). This is by far the fastest approach.

Note: after the initial posting on 22.01, this article went through a major revision on 25.01, incorporating information provided in the comments as well as other methods and comparisons.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] Or weighted random choice. If you read about this topic online you’ll find that there’s selection with and without replacement. This is irrelevant for this post since I’m only talking about selecting a single element, not a bunch at a time.
[2] There are many variations on this theme. Some find it more useful to have a list of (object, weight) pairs, or a dict mapping objects to weights. The method I present is the most general, and can be readily re-used or modified for different representations of objects and weights.

Pointers to arrays in C

January 11th, 2010 at 6:33 am

Pointers are a great source of confusion in C - newbies have a hard time grasping them. But coupled with arrays, some of the semantics of pointers are complex enough to confound even more seasoned programmers.

Consider this code:

void test(int** p)
{
}

int main()
{
    int arr[] = {30, 450, 14, 5};
    test(&arr);
    return 0;
}

Take a moment to ponder - would you expect this code to compile cleanly?

gcc isn’t very happy about it, and issues a warning: passing arg 1 of test from incompatible pointer type. C++ has stricter type checking, so let’s try running the same code through g++. As expected, we get an error: cannot convert int (*)[4] to int** for argument 1 to void test(int**)

So what’s the problem here? What’s wrong with the code above? Well, everything. It’s simply invalid, and it makes no sense. Some would think it should work because this works:

void test(int* p)
{

}

int main()
{
    int arr[] = {30, 450, 14, 5};
    test(arr);
    return 0;
}

But this one works specifically because the C compilers should follow the C standard, which mandates that arrays "decay" into pointers when used as lvalues. Thus, a pointer to the array’s first element is actually passed to test and everything works.

But the first code snippet is different. While an array name may decay into a pointer, the address of the array does not decay into a pointer to a pointer. And why should it? What sense does it make to treat an array so?

Pointers to pointers are sometimes passed to modify the pointers (simple pointer arguments don’t work here because C passes by value, which would only allow to modify what’s pointed, not the pointer itself). Here’s some imaginary code (won’t compile):

void test(int** p)
{
    *p = malloc ... /* retarget '*p' */
}

int main()
{
    int arr[] = {30, 450, 14, 5};
    int* ptr;

    /* Fine!
    ** test will retarget ptr, and its new value
    ** will appear after this call.
    */
    test(&ptr);

    /* Makes no sense!
    ** You cannot retarget 'arr', since it's a
    ** constant label created by the compiler.
    */
    test(&arr);

    return 0;
}

Pointers to arrays

Note that the original code could be modified a little to make it work:

void test(int (*p)[4])
{
    (*p)[2] = 10;
}

int main()
{
    int arr[] = {30, 450, 14, 5};

    test(&arr);
    printf("%d\n", arr[2]);

    return 0;
}

What is that weird type test accepts now? Say hello to a "pointer to array", one of the useless features of C. This is what the C FAQ has to say about it:

2.12: How do I declare a pointer to an array?

Usually, you don’t want to. When people speak casually of a pointer to an array, they usually mean a pointer to its first element.

Truly, I can’t imagine why one would use a pointer to an array in real life. If you do a web search on the topic, most of what you find is people mistakingly calling the parameter of foo(int* p) "a pointer to array", which of course it isn’t. It looks to me like the whole concept is just an artifact of C’s declaration syntax.

While the test function from the previous snippet compiles and works, it isn’t of much use, since it’s much clearer to write:

void test(int* p)
{
    p[2] = 10;
}

...
...
/* then call */
test(arr);

The main use of pointers as function arguments is to either avoid passing whole structures by value, or to modify the object pointed by the pointers. Both are irrelevant needs for pointers to array. Here’s a clarifying snippet:

int joe[] = {1, 2, 3, 4};

void test(int (*p)[4])
{
    /* Fine: assign to an element through the
    ** pointer.
    */
    (*p)[2] = 10;

    /* Works, but won't be reflected in the
    ** caller since p was passed by value.
    */
    p = &joe;

    /* Error: arrays can't be assigned.
    */
    *p = joe;
}

Arrays are not passed by value anyway, so a pointer to an array is useless for this purpose. Neither can arrays be modified, so that kills the second reason.

Book review: “Coders at work” by Peter Seibel

January 9th, 2010 at 6:17 pm

This rather unusual book collects long interviews the author has conducted with 15 famous programmers, all of which have earned wide acclaim for their skills and contributions (Don Knuth, Peter Norvig, Simon Peyton Jones, Ken Thompson and Guy Steele are some of the best known ones).

Peter Seibel is a talented interviewer, which makes the book an interesting read. He asks insightful questions on interesting topics, and encourages the interviewees to provide long and thoughtful answers. While you can probably find various interviews with most of the people in this book online, the book collects them all together, answering good questions on similar topics, thus providing context of comparison between them.

Some of my favorite question topics were debugging techniques (it’s very interesting to hear how the luminaries work in this aspect), hiring, the design process and reading others’ code. On the other hand, some topics could be safely skipped - Seibel insisted on asking about literate programming and formal proofs, which are borderline topics at best, and indeed generated bland and un-interesting responses from the interviewees.

You have to be really into programming to enjoy this book. I personally think that it’s quite insightful to read how these people approach problems and see things, and compare your own views with theirs.

P.S. My favorite interviews were with Brad Fitzpatrick and Don Knuth.

Top-Down operator precedence parsing

January 2nd, 2010 at 5:08 pm

Introduction

Recursive-descent parsers have always interested me, and in the past year and a half I wrote a few articles on the topic. Here they are in chronological order:

The third article describes a method that combines RD parsing with a different algorithm for parsing expressions to achieve better results. This method is actually used in the real-world, for example in GCC and Parrot (source).

An alternative parsing algorithm was discovered by Vaughan Pratt in 1973. Called Top Down Operator Precedence, it shares some features with the modified RD parser, but promises to simplify the code, as well as provide better performance. Recently it was popularized again by Douglas Crockford in his article, and employed by him in JSLint to parse Javascript.

I encountered Crockford’s article in the Beautiful Code book, but found it hard to understand. I could follow the code, but had a hard time grasping why the thing works. Recently I became interested in the topic again, tried to read the article once more, and again was stumped. Finally, by reading Pratt’s original paper and Frederick Lundh’s excellent Python-based piece [1], I understood the algorithm.

So this article is my usual attempt to explain the topic to myself, making sure that when I forget how it works in a couple of months, I will have a simple way of remembering.

The fundamentals

Top down operator precedence parsing (TDOP from now on) is based on a few fundamental principles:

  • A "binding power" mechanism to handle precedence levels
  • A means of implementing different functionality of tokens depending on their position relative to their neighbors - prefix or infix.
  • As opposed to classic RD, where semantic actions are associated with grammar rules (BNF), TDOP associates them with tokens.

Binding power

Operator precedence and associativity is a fundamental topic to be handled by parsing techniques. TDOP handles this issue by assigning a "binding power" to each token it parses.

Consider a substring AEB where A takes a right argument, B a left, and E is an expression. Does E associate with A or with B? We define a numeric binding power for each operator. The operator with the higher binding power "wins" - gets E associated with it. Let’s examine the expression:

1 + 2 * 4

Here it is once again with A, E, B identified:

1 + 2 * 4
  ^ ^ ^
  A E B

If we want to express the convention of multiplication having a higher precedence than addition, let’s define the binding power (bp) of * to be 20 and that of + to be 10 (the numbers are arbitrary, what’s important is that bp(*) > bp(+)). Thus, by the definition we’ve made above, the 2 will be associated with *, since its binding power is higher than that of +.

Prefix and infix operators

To parse the traditional infix-notation expression languages [2], we have to differentiate between the prefix form and infix form of tokens. The best example is the minus operator (-). In its infix form it is subtraction:

a = b - c  # a is b minus c

In its prefix form, it is negation:

a = -b   # b has a's magnitude but an opposite sign

To accommodate this difference, TDOP allows for different treatment of tokens in prefix and infix contexts. In TDOP terminology the handler of a token as prefix is called nud (for "null denotation") and the handler of a token as infix is called led (for "left denotation").

The TDOP algorithm

Here’s a basic TDOP parser:

def expression(rbp=0):
    global token
    t = token
    token = next()
    left = t.nud()
    while rbp < token.lbp:
        t = token
        token = next()
        left = t.led(left)

    return left

class literal_token(object):
    def __init__(self, value):
        self.value = int(value)
    def nud(self):
        return self.value

class operator_add_token(object):
    lbp = 10
    def led(self, left):
        right = expression(10)
        return left + right

class operator_mul_token(object):
    lbp = 20
    def led(self, left):
        return left * expression(20)

class end_token(object):
    lbp = 0

We only have to augment it with some support code consisting of a simple tokenizer [3] and the parser driver:

import re
token_pat = re.compile("\s*(?:(\d+)|(.))")

def tokenize(program):
    for number, operator in token_pat.findall(program):
        if number:
            yield literal_token(number)
        elif operator == "+":
            yield operator_add_token()
        elif operator == "*":
            yield operator_mul_token()
        else:
            raise SyntaxError('unknown operator: %s', operator)
    yield end_token()

def parse(program):
    global token, next
    next = tokenize(program).next
    token = next()
    return expression()

And we have a complete parser and evaluator for arithmetic expressions involving addition and multiplication.

Now let’s figure out how it actually works. Note that the token classes have several attributes (not all classes have all kinds of attributes):

  • lbp - the left binding power of the operator. For an infix operator, it tells us how strongly the operator binds to the argument at its left.
  • nud - this is the prefix handler we talked about. In this simple parser it exists only for the literals (the numbers)
  • led - the infix handler.

The key to enlightenment here is to notice how the expression function works, and how the operator handlers call it, passing in a binding power.

When expression is called, it is provided the rbp - right binding power of the operator that called it. It consumes tokens until it meets a token whose left binding power is equal or lower than rbp. Specifically, it means that it collects all tokens that bind together before returning to the operator that called it.

Handlers of operators call expression to process their arguments, providing it with their binding power to make sure it gets just the right tokens from the input.

Let’s see, for example, how this parser handles the expression:

3 + 1 * 2 * 4 + 5

Here’s the call trace of the parser’s functions when parsing this expression:

<<expression with rbp 0>>
    <<literal nud = 3>>
    <<led of "+">>
    <<expression with rbp 10>>
       <<literal nud = 1>>
       <<led of "*">>
       <<expression with rbp 20>>
          <<literal nud = 2>>
       <<led of "*">>
       <<expression with rbp 20>>
          <<literal nud = 4>>
    <<led of "+">>
    <<expression with rbp 10>>
       <<literal nud = 5>>

The following diagram shows the calls made to expression on various recursion levels:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/tdop_expr1.png

The arrows show the tokens on which each execution of expression works, and the number above them is the rbp given to expression for this call.

Apart from the initial call (with rbp=0) which spans the whole input, expression is called after each operator (by its led handler) to collect the right-side argument. As the diagram clearly shows, the binding power mechanism makes sure expression doesn’t go "too far" - only as far as the precedence of the invoking operator allows. The best place to see it is the long arrow after the first +, that collects all the multiplication terms (they must be grouped together due to the higher precedence of *) and returns before the second + is encountered (when the precedence ceases being higher than its rbp).

Another way to look at it: at any point in the execution of the parser, there’s an instance of expression for each precedence level that is active at that moment. This instance awaits the results of the higher-precedence instance and keeps going, until it has to stop itself and return its result to its caller.

If you understand this example, you understand TDOP parsing. All the rest is really just more of the same.

Enhancing the parser

The parser presented so far is very rudimentary, so let’s enhance it to be more realistic. First of all, what about unary operators?

As I’ve mentioned in the section on prefix and infix operators, TDOP makes an explicit distinction between the two, encoding it in the difference between the nud and led methods. Adding the subtraction operator handler [4]:

class operator_sub_token(object):
    lbp = 10
    def nud(self):
        return -expression(100)
    def led(self, left):
        return left - expression(10)

nud handles the unary (prefix) form of minus. It has no left argument (since it’s prefix), and it negates its right argument. The binding power passed into expression is high, since infix minus has a high precedence (higher than multiplication). led handles the infix case similarly to the handlers of + and *.

Now we can handle expressions like:

3 - 2 + 4 * -5

And get a correct result (-19).

How about right-associative operators? Let’s implement exponentiation (using the caret sign ^). To make the operation right-associative, we want the parser to treat subsequent exponentiation operators as sub-expressions of the first one. We can do that by calling expression in the handler of exponentiation with a rbp lower than the lbp of exponentiation:

class operator_pow_token(object):
    lbp = 30
    def led(self, left):
        return left ** expression(30 - 1)

When expression gets to the next ^ in its loop, it will find that still rbp < token.lbp and won’t return the result right away, but will collect the value of the sub-expression first.

And how about grouping with parentheses? Since each token can execute actions in TDOP, this can be easily handled by adding actions to the ( token.

class operator_lparen_token(object):
    lbp = 0
    def nud(self):
        expr = expression()
        match(operator_rparen_token)
        return expr

class operator_rparen_token(object):
    lbp = 0

Where match is the usual RD primitive:

def match(tok=None):
    global token
    if tok and tok != type(token):
        raise SyntaxError('Expected %s' % tok)
    token = next()

Note that ( has lbp=0, meaning that it doesn’t bind to any token on its left. It is treated as a prefix, and its nud collects an expression after the (, right until ) is encountered (which stops the expression parser since it also has lbp=0). Then it consumes the ) itself and returns the result of the expression [5].

Here’s the code for the complete parser, handling addition, subtraction, multiplication, division, exponentiation and grouping by parentheses:

import re

token_pat = re.compile("\s*(?:(\d+)|(.))")

def tokenize(program):
    for number, operator in token_pat.findall(program):
        if number:
            yield literal_token(number)
        elif operator == "+":
            yield operator_add_token()
        elif operator == "-":
            yield operator_sub_token()
        elif operator == "*":
            yield operator_mul_token()
        elif operator == "/":
            yield operator_div_token()
        elif operator == "^":
            yield operator_pow_token()
        elif operator == '(':
            yield operator_lparen_token()
        elif operator == ')':
            yield operator_rparen_token()
        else:
            raise SyntaxError('unknown operator: %s', operator)
    yield end_token()

def match(tok=None):
    global token
    if tok and tok != type(token):
        raise SyntaxError('Expected %s' % tok)
    token = next()

def parse(program):
    global token, next
    next = tokenize(program).next
    token = next()
    return expression()

def expression(rbp=0):
    global token
    t = token
    token = next()
    left = t.nud()
    while rbp < token.lbp:
        t = token
        token = next()
        left = t.led(left)
    return left

class literal_token(object):
    def __init__(self, value):
        self.value = int(value)
    def nud(self):
        return self.value

class operator_add_token(object):
    lbp = 10
    def nud(self):
        return expression(100)
    def led(self, left):
        right = expression(10)
        return left + right

class operator_sub_token(object):
    lbp = 10
    def nud(self):
        return -expression(100)
    def led(self, left):
        return left - expression(10)

class operator_mul_token(object):
    lbp = 20
    def led(self, left):
        return left * expression(20)

class operator_div_token(object):
    lbp = 20
    def led(self, left):
        return left / expression(20)

class operator_pow_token(object):
    lbp = 30
    def led(self, left):
        return left ** expression(30 - 1)

class operator_lparen_token(object):
    lbp = 0
    def nud(self):
        expr = expression()
        match(operator_rparen_token)
        return expr

class operator_rparen_token(object):
    lbp = 0

class end_token(object):
    lbp = 0

Sample usage:

>>> parse('3 * (2 + -4) ^ 4')
48

Closing words

When people consider parsing methods to implement, the debate usually goes between hand-coded RD parsers, auto-generated LL(k) parsers, or auto-generated LR parsers. TDOP is another alternative [6]. It’s an original and unusual parsing method that can handle complex grammars (not limited to expressions), relatively easy to code, and is quite fast.

What makes TDOP fast is that it doesn’t need deep recursive descents to parse expressions - only a couple of calls per token are required, no matter how the grammar looks. If you trace the token actions in the example parser I presented in this article, you’ll notice that on average, expression and one nud or led method are called per token, and that’s about it. Frederik Lundh compares the performance of TDOP with several other methods in his article, and gets very favorable results.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] Which is also the source for most of the code in this article - so the copyright is Fredrik Lundh’s
[2] Like C, Java, Python. An example of a language that doesn’t have infix notation is Lisp, which has prefix notation for expressions.
[3] This tokenizer just recognizes numbers and single-character operators.
[4] Note that to allow our parser actually recognize -, an appropriate dispatcher should be added to the tokenize function - this is left as an exercise to the reader.
[5] Quiz: is it useful having a led handler for a left paren as well? Hint: how would you implement function calls?
[6] By the way, I have no idea where to categorize it on the LL/LR scale? Any ideas?

Book review: “Memory management: Algorithms and Implementation in C/C++” by Bill Blunden

December 21st, 2009 at 6:33 pm

Memory allocation is a fascinating area, ripe in trade-offs and cutting-edge research. In this book, Bill Blunden manages to provide a pretty-good overview of the topic.

It begins with an introduction of the lowest levels - the hardware, namely the CPU memory management unit. Then it goes on to explain how operating systems manage memory - segmentation, paging, virtual memory and what’s between them. Next, memory is examined on the programming-language level - compiler-level and heap allocation mechanisms in Fortran, COBOL, Pascal, C and finally Java.

The second part of the book is the practice: the author implements several manual memory management schemes (own implementations of malloc/free) in C++, and compares them in terms of performance and other characteristics (like memory fragmentation). Finally, he implements a couple of simple garbage collectors (reference-counting, and mark-sweep), and in the last chapter of the book also briefly mentions the important topic of sub-allocators (also known as “pools” or “arenas”).

Overall, I enjoyed the book. But I do have a few points of (constructive) criticism. First of all, the book is a bit too conversational for such a technical work. It feels like a collection of blog posts, and thus also lacks in depth. For example, the section on memory management of Windows is quite disappointing. As much as I can admire the author’s attempt to show his exploration process armed by various tracing and monitoring tools, much of this information is well known and has been described. Instead, I would expect a more thorough presentation of the topic.

The other problem is the C++ code. C++ code in books is a pet peeve of mine - for some reason it tends to be exceptionally bad in most of them, and this book is no exception. I won’t go into examples because there are simply too many, so just a word of advice: if you intend to follow through this book actually implementing the code (always a good idea!) read about the algorithms the author describes, but write your own implementation. There’s nothing good to learn from the C++ code in this book, so you might as well get some more practice on your own.

Creating a tiny ‘Hello World’ executable in assembly

December 21st, 2009 at 6:53 am

By writing x86 assembly code and assembling it into a .COM file you can get very small executables. The .COM format, originated with 16-bit MS-DOS, is literally nothing except binary code. There’s no relocation information and no import tables, it is loaded by the OS at address 0×100 and starts executing from there. This executable format is still supported by modern Windows OSes, that run it inside a DOS emulator.

Here’s a basic ‘Hello, World!’:

org 100h
mov dx,msg
mov ah,9
int 21h
mov ah,4Ch
int 21h
msg db 'Hello, World!',0Dh,0Ah,'$'

The most interesting part of the code is the int 21h operations. These are software interrupts to call MS-DOS services. The code placed in the AH register prior to calling int 21h specifies the service to call into.

The first service is 9, which asks DOS to print the string pointed by DX into standard output, until a dollar ($) sign is encountered. This is what does the printing in this program. The second service is 4C, which asks DOS to terminate the program.

Assembling this code into a .COM file is easy with Nasm:

nasm -f bin helloworld.asm -o helloworld.com

This results in a 27-byte .COM file that executes and prints ‘Hello, World!’.