Welcome


... to my place on the web - a personal website where I keep articles, freeware programs and code snippets I've written, and a weblog where I share my ideas and insights on programming, technology, books, and life in general.

To browse the website, use the site map on the right-hand side of the screen. Below, you will find the ten most recent posts.

My favorite freeware programs for Windows

March 10th, 2010 at 6:07 am

For a long time I wanted to sit and write down a list of all the great freeware programs I use or have used in the past on my Windows machine. So here it is.

The goal of this post is threefold:

  1. People often ask me about the tools I use – this page will be the answer I can point them to.
  2. To combat the effects of age on my memory – I don’t want to keep looking for hours "for that cool program I love and don’t remember its name" after the next reinstall.
  3. As a way to say thanks to the creators of these programs – kudos to all the diligent programmers who create great tools and release them for free (as in beer, and most of all, as in speech), and also kudos to companies that provide versions of their programs for free.

My main PC at home runs Windows XP Home edition (SP3). It also has MS Office 2007 and MS Visio 2007 installed – these are the only programs that I actually bought for this computer – the rest are free.

The list

Here’s the whole list, sorted alphabetically by program name.

  • 7-Zip – for manipulating all kinds of archives
  • Audacity – editing and recording audio files
  • AutoHotkey – great for automating various aspects of Windows
  • CDBurnerXP – burning CDs and DVDs
  • CDex – for easy ripping of your old music CDs into MP3
  • com0com – virtual serial port driver. Useful for developing applications that communicate on the serial port
  • CurrPorts – monitoring TCP/IP network connections
  • DLL Export Viewer – view a list of exported functions in a Windows DLL
  • Filezilla – FTP client and server
  • Firefox – is my main browser for about 5 years (since one of its first versions). I use other browsers for testing and web development (Chrome, Safari), but Firefox is still the main workhorse of my internet experience. Some of its plugins, like Firebug, are useful applications in their own right
  • Frhed – hex editor
  • Inkscape – for drawing. I wish I had the artistic talents necessary to use it really well
  • Irfanview – photo & image viewer
  • Keepass – for storing all your passwords securely in a single place
  • Launchy – once I’ve found it, I can’t live without it. An amazing productivity tool
  • MinGW – a native Windows port of GCC and related tools
  • µTorrent – a sweet little torrent client. I’m amazed by the amount of functionality packed into a single tiny executable
  • Paint.NET – a replacement of mspaint, capable of much more. I use it mainly for screenshot and image manipulation (cropping, resizing, stitching a few together, etc.)
  • Processexplorer – a powerful tool for examining what the various processes on your Windows box are up to
  • Putty – Telnet/SSH client (with its auxiliary tools like Pageant)
  • Python – quite obvious for the readers of my blog, but I just had to put it here. I use the ActivePython implementation of CPython
  • RapidEE – the ultimate viewer & editor for environment variables
  • RealVNC – a VNC client I use to remotely control my Asus EEE Linux laptop
  • Regscanner – great for finding stuff in the Windows registry quickly
  • Scite – a great text editor suitable for programming. It’s my main editor for 90% of my development needs both at home and at work, heavily scripted and extended with Lua and Python scripts
  • Shexview – Windows shell extensions management
  • SIW – System Information for Windows
  • Spyder – comes as a Python library, but I use it as a stand-alone application so it’s in this list. This is a great Python shell – I use it as a general-purpose calculator and for simple computations that don’t deserve a script of their own
  • SQLite – apart from its various programming language bindings, I also use the SQLite command-line tools to help me with development needs (debugging, updating and creating test DBs, etc.)
  • Startup Control Panel – a lightweight tool for managing the processes that are run on start-up
  • Sygate Personal Firewall – I still use version 5.5 and it works great for my firewalling needs
  • TortoiseSVN – Subversion client embedded into Windows Explorer
  • Treesize – very useful for finding out the sizes of various folders and files on your PC in a simple way
  • Truecrypt – a complete solution for encrypting anything from simple files to whole hidden hard-drive partitions
  • VirtualBox – great free alternative to VMWare. I use it to run an Ubuntu Linux image on top of my Windows PC – works like a charm
  • Virtual Magnifying Glass – for easy zooming of areas on the screen – very useful for any graphic work like GUI development
  • Visual Studio Express – nice of Microsoft to release a very functional version of their excellent Visual Studio IDE for free. Indispensable for serious C/C++ development for Windows
  • VLC – I just can’t believe people are still using the Windows media player with its codec hell. Once I installed VLC (years ago) I forgot about codecs and about any other problems with playing videos
  • Winamp – still using version 5.04 to listen to MP3s, seeing no reason to upgrade to anything else
  • Winmerge – graphical diff and merge tool
  • Wireshark – network protocol analyzer

The server-side Javascript meme

March 6th, 2010 at 5:29 pm

In the past few months a new meme seems to be very vibrant in the world of web-development: server-side Javascript. The idea certainly isn’t new – Netscape had offered JS on the server-side sometime in the 1990s. Since then, powerful dynamic languages like PHP, Python and Ruby along with Java and ASP have taken over the server-side development scene. So what has happened recently to stir the balance?

It’s hard to tell, I can only guess:

  • Javascript is growing in popularity as more and more powerful applications get written in it to run in the browser. This increases the amount of programmers interested in it, as well as the total amount of Javascript code out there in the wild.
  • A zoo of new and powerful implementations has appeared recently. A real arms race between Google’s V8, Mozilla’s TraceMonkey and WebKit’s Squirrelfish boosted the performance of Javascript engines considerably in the past couple of years.

The first reason, in particular, is very important. These days a web-developer has to constantly switch between two programming languages – Javascript for the client, and his server-side language of choice for the server. Many people started feeling that this is a useless split of attention, and using a single language both for the client and the server is a good idea [1].

Why it’s bad

First of all, further fragmentation of the server-side development community. Aren’t there enough ways to write web-applications as it stands? Multiple languages to choose from, multiple frameworks to use once you’ve chosen a language. Do we need another language & platform for server-side development? What’s so bad with having two tools, each suitable for its domain?

The other reason I see is more controversial. Javascript is far from being an ideal language. While it has a good core in it [2], there are also a lot of quirks and design faults. Oh, and there’s a ton of horrible code out there that just gets copy-pasted from project to project. I’ll stop here since I don’t want to turn this post into a language flame-war, but I’ll do say that I’d prefer coding in Python or Ruby over Javascript, and thus it’s disturbing that Javascript starting to take a share of the server-side cake.

Why it’s good

Just as the rise in popularity of JS in the browser brought us faster engines that make the internet experience more pleasant, I think that examining new approaches is generally healthy. There’s a lot of excitement around this issue lately – excitement that brings talented developers in, places them in the zone for productive hacking, and overall provides a fresh perspective on things.

For example, one of the most exciting tools in this scene is Node.js. It’s a Javascript interpreter built on top of V8 [3]. What’s really cool about Node.js is that it’s completely event-based, designed and tuned for asynchronous non-blocking I/O. Asynchronous non-blocking I/O seems to be all the rage recently, as it was found to be greatly enhancing the performance of servers, allowing to handle huge loads of concurrent connections within a single process, without using threads.

This approach isn’t new. Twisted is a venerable and powerful event I/O library for Python, for example. However, when you code with Twisted you have to be careful to use only very specific Python modules – use one that is blocking and you’ve ruined the whole model. Node.js, on the other hand, comes with its own standard library that was designed from the ground-up to be 100% non-blocking.

It’s only fair to note that in this area of event-based programming, Javascript actually has an advantage. The language’s simple built-in way of creating anonymous functions is very helpful in this aspect. As a matter of fact, everyone who’s written AJAX code and has used a library like jQuery (which these days covers a high percentage of serious JS developers), is already used to the event model of programming. AJAX is all about asynchronous events for which you register an anonymous callback function to handle the result when it arrives.

Node.js isn’t the only general-purpose implementation of Javascript suitable for server-side development. Mozilla’s Java-based Rhino is an alternative, and there are many others. As expected, we can see a lot of fragmentation between the communities since Javascript doesn’t even have a commonly accepted standard library. Recent projects like CommonJS come to fix this situation, however.

On yet another hand, making one process run as fast as possible is a lofty goal, but what about true parallel programming? Event loops such as the ones offered by Node.js don’t scale to multi-CPU machines, so I wonder how this situation will be handled. My bet would be on relatively independent processes communicating via message-passing and talking to mutual data stores via TCP.

Conclusion

These are exciting times. The vibrancy reminds me of the Rails hype of a few years ago. As I’ve mentioned, from my point of view this situation has its pros and cons, but the pros outweigh the cons. Being locked in narrow directions hasn’t been historically fruitful. Mutual enrichment between communities and arms races are a much more viable way to progress.

It will be interesting to see where we’ll be a year from now – I wager that many things are yet to change. In any case, if you’re looking for a spot in an exciting niche of programming where you can make a difference, it seems that server-side Javascript, especially with event-based models like Node.js, fits the bill.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] As always happens, the solutions come from several sides. Technologies like Silverlight allow running C# and Python in the client, while cool projects like Pyjamas compile Python to Javascript allowing to write client-side code in Python and running it natively in the browser.
[2] If you restrain yourself to the subset described in the "Javascript, the Good Parts" book. A review is due, I know…
[3] Which, by the way, was specifically designed by Google to serve as an embeddable engine.

Zipped dump of a SQLite database with Python

March 5th, 2010 at 12:59 pm

Suppose you manage some data in a SQLite DB within a Python application. How can you dump the DB into a SQL dump file? Better yet, how can you directly create a zipped dump file (dumps tend to be big, and since they’re SQL code, can be compressed very nicely).

Here’s the code:

import sqlite3, sys, zipfile

dbname = sys.argv[1] if len(sys.argv) > 1 else 'testdb.db'

# Open the db and dump all its data into the 'data' buffer
con = sqlite3.connect(dbname)
data = '\n'.join(con.iterdump())
con.close()

# Create a zip file and write add the dump into it as
# a new file
zf = zipfile.ZipFile('dump.zip',
                     mode='w',
                     compression=zipfile.ZIP_DEFLATED)
zf.writestr('dump.sql', data)
zf.close()

It will work with Python 2.6 and later, since the iterdump method of sqlite3 is only available since that version.

Note that the .zip file is created on the fly from a buffer, without a real dump.sql file being created on the disk.

Book review: “Nonzero: The Logic of Human Destiny” by Robert Wright

March 4th, 2010 at 8:27 pm

Robert Wright is the author of “The Moral Animal”, which is a great book on the topic of evolutionary psychology I’ve very much enjoyed to read a few years ago. As a result, I had quite high expectations from “Nonzero”. It’s probably because of these expectations that I was disappointed by the book. It’s not a bad book, but it just isn’t what I expected.

As far as I understand, what Wright tries to do is to lay out a unified theory linking cultural and biological evolution together, using game theory. Nonzero is his buzzword for non-zero games “played” by biological and cultural entities (such as the commonly (ab)used Prisoner’s Dilemma). The first part of the book deals with cultural evolution, examining historical societies on various stages of development (from hunter gatherers to modern times). The author demonstrates how non-zero-sum-game thinking can explain the behavior of these societies, both internal and external.

Next, the book tackles biological evolution. Here again, Wright uses the nonzero theory to explain how organisms interact, and even how genes inside organisms can interact.

I guess I’ve found this book disappointing because there was nothing new in it. The first (historical) part is much better covered by other authors, like Jared Diamond. There’s nothing particularly innovative in the application of game theory to understand the dynamics of societies throughout history. The second (biological) part is just a much smaller-scale (and worse, IMHO) retelling of some chapters from “The Moral Animal”.

So if you’ve never read any book on such topics before, “Nonzero” can be a good choice – it’s written quite well and has enough information to make you think. But if you’ve read other books on the subject, or plan to do a wider reading overall, I think it’s safe to stay away. Read “Guns, Germs and Steel” and “The Moral Animal” instead – it will take thrice as long, but the gain of information and insight is going to be at least tenfold.

A simple canvas-based Javascript game, with a Django back-end

February 24th, 2010 at 7:17 pm

A few years ago I’ve released a clone of the "Colorful lines" game (called GLines on Gnome), named Perlines. It still works quite well and is available from the Programs and code section of the website.

In the past couple of weeks I’ve implemented a web-based version the game, using Javascript and the HTML5 canvas element for the game itself, and a Django back-end server for storing high-scores in a centralized manner.

Here’s a screenshot:

http://eli.thegreenplace.net/wp-content/uploads/2010/02/linesshot.png

And the game itself is available for playing here. Note that due to its use of the canvas element, it doesn’t support Internet Explorer. I tested it in Firefox, Safari and Chrome.

Some observations on the development of the game:

  • It’s really the first time I’ve written a large amount of Javascript code. It was an interesting experience about which I have mixed feelings – so I think it deserves a separate post.
  • The HTML5 canvas element is really nice and works just like any other canvas / graphical device context in a GUI framework. It has nice performance and an intuitive API.
  • It’s also the first time I’ve actually done any serious AJAX. It turned out to be pretty straightforward on the client, with the generous help of jQuery. On the server side it’s even simpler.
  • I was surprised how easy it is to craft a simple DB-backed server for storing the high-scores using Django. It was probably the most effortless part of the project – Django really does make server development in Python ridiculously simple.

I will play with it myself for a few more days and then release all its source code to the public domain. Naturally, the client-side code is accessible even now by viewing the source of the page in your browser.

P.S. There seems to be at least one other Javascript clone of the lines game online, but it doesn’t use canvas and doesn’t have a global high-scores table.

Book review: “Shogun” by James Clavell

February 20th, 2010 at 2:30 pm

A short and informal review for this one.

Nice book, very well written. Lots of fun to read – the plot is gripping, the characters are well developed, and the historical background is fascinating. Japanese customs are very unusual, very interesting, very inspiring. Though they don’t always make sense, the author did a great job promoting mutual tolerance in conflicting issues. Not all western customs make sense either… Moreover, he excellently presents the mind-blowingly complex politics in which the various rulers of Japan were involved. It’s fun to read how smart one should’ve been to succeed in that world.

One gripe though – the book is too long, almost 1200 pages. Could’ve been shorter, although not by much.

Finding out the mouse click position on a canvas with Javascript

February 13th, 2010 at 3:06 pm

I’m playing around with the HTML5 <canvas> element. One interesting thing to do is interact with the canvas using the mouse.

First we attach the mouse click event to the canvas (suppose we have an HTML page with a canvas element named canvas, and Game is a global "namespace object"):

Game.canvas = document.getElementById('canvas');
Game.canvas.addEventListener('click', on_canvas_click, false);

Now, in the event handler function, we can play with the event object. It has the clientX and clientY properties for finding out where the mouse was clicked, but these give imprecise results!

The reason for that is that the canvas itself isn’t placed at 0,0 on the client area. Therefore, we have to subtract its offset from the event’s coordinates, like this:

function on_canvas_click(ev) {
    var x = ev.clientX - Game.canvas.offsetLeft;
    var y = ev.clientY - Game.canvas.offsetTop;

    // ... x,y are the click coordinates relative to the
    // canvas itself
}

Removing epsilon productions from context free grammars

February 8th, 2010 at 6:53 am

Background

epsilon productions are very useful to express many grammars in a compact way. For example, take these simple function call productions in some imaginary C-like language:

func_call:: identifier '(' arguments_opt ')'
arguments_opt:: arguments_list | eps
arguments_list:: argument | argument ',' arguments_list

When composing grammars by hand, simplicity matters. It’s very useful to be able to look at arguments_opt and know that it’s an optional list of arguments. The same non-terminal can be reused in several other productions.

However, epsilon productions pose a problem for several algorithms that act on grammars. Therefore, prior to running these algorithms, epsilon productions have to be removed. Fortunately, this can be done relatively effortlessly in an automatic way.

Here I want to present an algorithm and a simple implementation for epsilon production removal.

The algorithm

Intuitively, it’s quite simple to remove epsilon productions. Consider the grammar for function calls presented above. The argument_opt nonterminal in func_call is just a short way of saying that there either is an argument list inside those parens or nothing. In other words, it can be rewritten as follows:

func_call:: identifier '(' arguments_opt ')'
          | identifier '(' ')'
arguments_opt:: arguments_list
arguments_list:: argument | argument ',' arguments_list

This duplication of productions for func_call will have to be repeated for every other production that had arguments_opt in it. This grammar looks somewhat strange, as arguments_opt is now identical to arguments_list. It is correct, however.

A more interesting case occurs when the epsilon production is in a nonterminal that appears more than once in some other production [1]. Consider:

B:: A z A
A:: a | eps

When we remove the epsilon production from A, we have to duplicate the productions that have A in them, but the production for B has two A. Since either of the A instances in the production can be empty, the only proper way to do this is go over all the combinations:

B:: z | A z | z A | A z A
A:: a

In the general case, if A appears k times in some production, this production will be replicated 2^k times, each time with a different combination [2].

This leads us to the algorithm:

  1. Pick a nonterminal A with an epsilon production
  2. Remove that epsilon production
  3. For each production containing A: Replicate it 2^k times where k is the number of A instances in the production, such that all combinations of A being there or not will be represented.
  4. If there are still epsilon productions in the grammar, go back to step 1.

A couple of points to pay attention to:

  • It’s obvious that a step of the algorithm can create new epsilon productions [3]. This is handled correctly, as it works iteratively until all epsilon productions are removed.
  • The only place where an epsilon production cannot be removed is at the start symbol. If the grammar can generate an empty string, we can’t ruin that. A special case will have to handle this case.

Implementation

Here’s an implementation of this algorithm in Python:

from collections import defaultdict

class CFG(object):
    def __init__(self):
        self.prod = defaultdict(list)
        self.start = None

    def set_start_symbol(self, start):
        """ Set the start symbol of the grammar.
        """
        self.start = start

    def add_prod(self, lhs, rhs):
        """ Add production to the grammar. 'rhs' can
            be several productions separated by '|'.
            Each production is a sequence of symbols
            separated by whitespace.
            Empty strings are interpreted as an eps-production.

            Usage:
                grammar.add_prod('NT', 'VP PP')
                grammar.add_prod('Digit', '1|2|3|4')

                # Optional Digit: digit or eps
                grammar.add_prod('Digit_opt', Digit |')
        """
        # The internal data-structure representing productions.
        # maps a nonterminal name to a list of productions, each
        # a list of symbols. An empty list [] specifies an
        # eps-production.
        #
        prods = rhs.split('|')
        for prod in prods:
            self.prod[lhs].append(prod.split())

    def remove_eps_productions(self):
        """ Removes epsilon productions from the grammar.

            The algorithm:

            1. Pick a nonterminal p_eps with an epsilon production
            2. Remove that epsilon production
            3. For each production containing p_eps, replace it
               with several productions such that all the
               combinations of p_eps being there or not will be
               represented.
            4. If there are still epsilon productions in the
               grammar, go back to step 1

            The replication can be demonstrated with an example.
            Suppose that A contains an epsilon production, and
            we've found a production B:: [A, k, A]
            Then this production of B will be replaced with these:
            [A, k], [k], [k, A], [A, k, A]
        """
        while True:
            # Find an epsilon production
            #
            p_eps, index = self._find_eps_production()

            # No epsilon productions? Then we're done...
            #
            if p_eps is None:
                break

            # Remove the epsilon production
            #
            del self.prod[p_eps][index]

            # Now find all the productions that contain the
            # production that removed.
            # For each such production, replicate it with all
            # the combinations of the removed production.
            #
            for lhs in self.prod:
                prods = []

                for lhs_prod in self.prod[lhs]:
                    num_p_eps = lhs_prod.count(p_eps)
                    if num_p_eps == 0:
                        prods.append(lhs_prod)
                    else:
                        prods.extend(self._create_prod_combinations(
                            prod=lhs_prod,
                            nt=p_eps,
                            count=num_p_eps))

                # Remove duplicates
                #
                prods = sorted(prods)
                prods = [prods[i] for i in xrange(len(prods))
                                  if i == 0 or prods[i] != prods[i-1]]

                self.prod[lhs] = prods

    def _find_eps_production(self):
        """ Finds an epsilon production in the grammar. If such
            a production is found, returns the pair (lhs, index):
            the name of the non-terminal that has an epsilon
            production and its index in lhs's list of productions.
            If no epsilon productions were found, returns the
            pair (None, None).

            Note: eps productions in the start symbol will be
            ignored, because we don't want to remove them.
        """
        for lhs in self.prod:
            if not self.start is None and lhs == self.start:
                continue

            for i, p in enumerate(self.prod[lhs]):
                if len(p) == 0:
                    return lhs, i

        return None, None

    def _create_prod_combinations(self, prod, nt, count):
        """ prod:
                A production (list) that contains at least one
                instance of 'nt'
            nt:
                The non-terminal which should be replicated
            count:
                The amount of times 'nt' appears in 'lhs_prod'.
                Assumed to be >= 1

            Returns the generated list of productions.
        """
        # The combinations are a kind of a powerset. Membership
        # in a powerset can be checked by using the binary
        # representation of a number.
        # There are 2^count possibilities in total.
        #
        numset = 1 << count
        new_prods = []

        for i in xrange(numset):
            nth_nt = 0
            new_prod = []

            for s in prod:
                if s == nt:
                    if i & (1 << nth_nt):
                        new_prod.append(s)
                    nth_nt += 1
                else:
                    new_prod.append(s)

            new_prods.append(new_prod)

        return new_prods

And here are the results with some of the sample grammars presented earlier in the article:

cfg = CFG()
cfg.add_prod('identifier', '( arguments_opt )')
cfg.add_prod('arguments_opt', 'arguments_list | ')
cfg.add_prod('arguments_list', 'argument | argument , arguments_list')

cfg.remove_eps_productions()
for p in cfg.prod:
    print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]

Produces:

func_call ::  ['identifier ( )', 'identifier ( arguments_opt )']
arguments_list ::  ['argument', 'argument , arguments_list']
arguments_opt ::  ['arguments_list']

As expected. And:

cfg = CFG()
cfg.add_prod('B', 'A z A')
cfg.add_prod('A', 'a | ')

cfg.remove_eps_productions()
for p in cfg.prod:
    print p, ':: ', [' '.join(pr) for pr in cfg.prod[p]]

Produces:

A ::  ['a']
B ::  ['A z', 'A z A', 'z', 'z A']

The implementation isn’t tuned for efficiency, but for simplicity. Luckily, CFGs are usually small enough to make the runtime of this implementation manageable. Note that the preservation of epsilon productions in the start rule is implemented in the _find_eps_production method.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] From here on, lowercase letters early in the alphabet (a, b, c…) are terminals. Early uppercase letters (A, B, C…) are nonterminals, and letters late in the alphabet (z, y, x…) are arbitrary strings of terminals and nonterminals.
[2] If this sounds like generating a powerset, you’re right.
[3] Consider the productions:
A:: a | eps
B:: b | A

After removing the epsilon production from A we’ll have:

A:: a
B:: b | A | eps

Generating random sentences from a context free grammar

January 28th, 2010 at 3:13 pm

Sometimes it’s interesting to randomly generate a large amount of valid sentences from a context free grammar. The best use for this is automated stress-testing of parsers for those grammars [1].

So how would you generate sentences?

Simple recursive algorithm

The first approach that springs to mind is a simple recursive traversal of the grammar, choosing productions at random. Here’s the algorithm with some infrastructure:

class CFG(object):
    def __init__(self):
        self.prod = defaultdict(list)

    def add_prod(self, lhs, rhs):
        """ Add production to the grammar. 'rhs' can
            be several productions separated by '|'.
            Each production is a sequence of symbols
            separated by whitespace.

            Usage:
                grammar.add_prod('NT', 'VP PP')
                grammar.add_prod('Digit', '1|2|3|4')
        """
        prods = rhs.split('|')
        for prod in prods:
            self.prod[lhs].append(tuple(prod.split()))

    def gen_random(self, symbol):
        """ Generate a random sentence from the
            grammar, starting with the given
            symbol.
        """
        sentence = ''

        # select one production of this symbol randomly
        rand_prod = random.choice(self.prod[symbol])

        for sym in rand_prod:
            # for non-terminals, recurse
            if sym in self.prod:
                sentence += self.gen_random(sym)
            else:
                sentence += sym + ' '

        return sentence

CFG represents a context free grammar. It holds productions in the prod attribute, which is a dictionary mapping a symbol to a list of its possible productions. Each production is a tuple of symbols. A symbol can either be a terminal or a nonterminal. Those are distinguished as follows: nonterminals have entries in prod, terminals do not.

gen_random is a simple recursive algorithm for generating a sentence starting with the given grammar symbol. It selects one of the productions of symbols randomly and iterates through it, recursing into nonterminals and emitting terminals directly.

Here’s an example usage of the class with a very simple natural-language grammar:

cfg1 = CFG()
cfg1.add_prod('S', 'NP VP')
cfg1.add_prod('NP', 'Det N | Det N')
cfg1.add_prod('NP', 'I | he | she | Joe')
cfg1.add_prod('VP', 'V NP | VP')
cfg1.add_prod('Det', 'a | the | my | his')
cfg1.add_prod('N', 'elephant | cat | jeans | suit')
cfg1.add_prod('V', 'kicked | followed | shot')

for i in xrange(10):
    print cfg1.gen_random('S')

And here are some sample statements generated by it:

the suit kicked my suit
she followed she
she shot a jeans
he shot I
a elephant followed the suit
he followed he
he shot the jeans
his cat kicked his elephant
I followed Joe
a elephant shot Joe

The problem with the simple algorithm

Consider the following grammar:

cfgg = CFG()
cfgg.add_prod('S', 'S S S S | a')

It has a single nonterminal, S and a single terminal a. Trying to generate a random sentence from it sometimes results in a RuntimeError exception being thrown: maximum recursion depth exceeded while calling a Python object. Why is that?

Consider what happens when gen_random runs on this grammar. In the first call, it has a 50% chance of selecting the S S S S production and a 50% chance of selecting a. If S S S S is selected, the algorithm will now recurse 4 times into each S. The chance of all those calls resulting in a is just 0.0625, and there’s a 0.9375 chance that at least one will result in S and generate S S S S again. As this process continues, chances get slimmer and slimmer that the algorithm will ever terminate. This isn’t good.

You may now think that this is a contrived example and real-life grammars are more well-behaved. Unfortunately this isn’t the case. Consider this (rather ordinary) arithmetic expression grammar:

cfg2 = CFG()
cfg2.add_prod('EXPR', 'TERM + EXPR')
cfg2.add_prod('EXPR', 'TERM - EXPR')
cfg2.add_prod('EXPR', 'TERM')
cfg2.add_prod('TERM', 'FACTOR * TERM')
cfg2.add_prod('TERM', 'FACTOR / TERM')
cfg2.add_prod('TERM', 'FACTOR')
cfg2.add_prod('FACTOR', 'ID | NUM | ( EXPR )')
cfg2.add_prod('ID', 'x | y | z | w')
cfg2.add_prod('NUM', '0|1|2|3|4|5|6|7|8|9')

When I try to generate random sentences from it, less than 30% percent of the runs terminate [2].

The culprit here is the ( EXPR ) production of FACTOR. An expression can get expanded into several factors, each of which can once again result in a whole new expression. Just a couple of such derivations can be enough for the whole generation process to diverge. And there’s no real way to get rid of this, because ( EXPR ) is an essential derivation of FACTOR, allowing us to parse expressions like 5 * (1 + x).

Thus, even for real-world grammars, the simple recursive approach is an inadequate solution. [3]

An improved generator: convergence

We can employ a clever trick to make the generator always converge (in the mathematical sense). Think of the grammar as representing an infinite tree:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/expr1.png

The bluish nodes represent nonterminals, and the greenish nodes represent possible productions. If we think of the grammar this way, it is obvious that the gen_random method presented earlier is a simple n-nary tree walk.

The idea of the algorithm is to attach weights to each possible production and select the production according to these weights. Once a production is selected, its weight is decreased and passed recursively down the tree. Therefore, once the generator runs into the same nonterminal and considers these productions again, there will be a lower chance for the same recursion to occur. A diagram shows this best:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/expr2.png

Note that initially all the productions of expr have the same weight, and will be selected with equal probability. Once term - expr is selected, the algorithm takes note of this. When the same choice is presented again, the weight of term - expr is decreased by some factor (in this case by a factor of 2). Note that it can be selected once again, but then for the next round its weight will be 0.25. This of course only applies to the same tree branch. If term - expr is selected in some other, unrelated branch, its weight is unaffected by this selection.

This improvement solves the divergence problem of the naive recursive algorithm. Here’s its implementation (it’s a method of the same CFG class presented above):

def gen_random_convergent(self,
      symbol,
      cfactor=0.25,
      pcount=defaultdict(int)
  ):
  """ Generate a random sentence from the
      grammar, starting with the given symbol.

      Uses a convergent algorithm - productions
      that have already appeared in the
      derivation on each branch have a smaller
      chance to be selected.

      cfactor - controls how tight the
      convergence is. 0 < cfactor < 1.0

      pcount is used internally by the
      recursive calls to pass on the
      productions that have been used in the
      branch.
  """
  sentence = ''

  # The possible productions of this symbol are weighted
  # by their appearance in the branch that has led to this
  # symbol in the derivation
  #
  weights = []
  for prod in self.prod[symbol]:
      if prod in pcount:
          weights.append(cfactor ** (pcount[prod]))
      else:
          weights.append(1.0)

  rand_prod = self.prod[symbol][weighted_choice(weights)]

  # pcount is a single object (created in the first call to
  # this method) that's being passed around into recursive
  # calls to count how many times productions have been
  # used.
  # Before recursive calls the count is updated, and after
  # the sentence for this call is ready, it is rolled-back
  # to avoid modifying the parent's pcount.
  #
  pcount[rand_prod] += 1

  for sym in rand_prod:
      # for non-terminals, recurse
      if sym in self.prod:
          sentence += self.gen_random_convergent(
                              sym,
                              cfactor=cfactor,
                              pcount=pcount)
      else:
          sentence += sym + ' '

  # backtracking: clear the modification to pcount
  pcount[rand_prod] -= 1
  return sentence

An auxiliary weighted_choice function is used:

def weighted_choice(weights):
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        rnd -= w
        if rnd < 0:
            return i

Note the cfactor parameter of the algorithm. This is the convergence factor – the probability by which a weight is multiplied each time it’s been selected. Having been selected N times, the weight becomes cfactor to the power N. I’ve plotted the average length of the generated sentence from the expression grammar as a function of cfactor:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/cfactor_plot.png

As expected, the average length grows with cfactor. If we set cfactor to 1.0, this becomes the naive algorithm where all the productions are always of equal weight.

Conclusion

While the naive algorithm is suitable for some simplistic cases, for real-world grammars it’s inadequate. A generalization that employs weighted selection using a convergence factor provides a much better solution that generates sentences from grammars with guaranteed termination. This is a sound and relatively efficient method that can be used in real-world applications to generate complex random test cases for parsers.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] Another could be some rudimentary form of random text generation, although the resulting text isn’t very comprehensible. Markov chains are much better for this.
[2] A larger percentage can be achieved by increasing Python’s recursion depth limit, but this is not the point.
[3] Some algorithms, like this one by Randal Schwartz, assign fixed weights to each production. While it could be used to decrease the chances of divergence, it’s not a really good general solution for our problem. However, it works great for simple, non-recursive grammars like the one presented in his article.

Book review: “Matplotlib for Python developers” by Sandro Tosi

January 27th, 2010 at 4:32 pm

Disclaimer: this book (in electronic format) was provided to me for review by Packt publishing

matplotlib is the most popular plotting library for Python, and rightly so. It produces extremely high-quality plots suitable for publication, and thus, coupled with numpy and scipy is one of the major driving forces in the scientific Python community, which gets more and more active in the past few years.

The library has a comprehensive reference documentation, but few high-quality tutorials. This is the niche this book attempts to fill. It is divided into two main parts. The first (about 1/3 of the book) serves as a tutorial to matplotlib, presenting its various features in an increasing level of complexity. The second part consists of:

  • Tutorials on integrating matplotlib with the major GUI frameworks used for Python – there are chapters for GTK+, wxPython and PyQt. These topics are commonly sought by beginning Python programmers (as the logs of my blog clearly show).
  • A chapter about “matplotlib on the web”, which is somewhat useless in my opinion, because it teaches absolutely nothing new about matplotlib.
  • A chapter called “matplotlib in the real world” which is a hodgepodge of data munging and plotting examples, which is either useful or not, depending on your experience and needs.

The book could clearly use some editing of the English (the author is not a native speaker, which is fine, but means that the editors should have done a more thorough job). Also, it has a peculiar organization – sub-chapters and sections aren’t numbered, which is very unusual and confusing, and makes cross references impossible.

All in all, I can see how this book could be useful to some users. Mainly, I think, for scientists who don’t want to google everything and to wade through docs and tutorials and want everything in a single place. But for Python hackers seeking to just make some plots, I doubt it’s of great value. All the information is available online, and if you know how to look for it, there will be no trouble finding what you need, way faster than reading through this book.