Weighted random generation in Python

January 22nd, 2010 at 2:15 pm

A problem I frequently run into is to randomly select an element from some kind of container, with the chances of each element to be selected not being equal, but defined by relative "weights" (or probabilities). This is called weighted random selection [1].

Simple "linear" approach

The following is a simple function to implement weighted random selection in Python. Given a list of weights, it returns an index randomly, according to these weights [2].

For example, given [2, 3, 5] it returns 0 (the index of the first element) with probability 0.2, 1 with probability 0.3 and 2 with probability 0.5. The weights need not sum up to anything in particular, and can actually be arbitrary Python floating point numbers.

import random

def weighted_choice(weights):
    totals = []
    running_total = 0

    for w in weights:
        running_total += w
        totals.append(running_total)

    rnd = random.random() * running_total
    for i, total in enumerate(totals):
        if rnd < total:
            return i

Giving up the temporary list

It turns out that the temporary totals list isn’t required at all. Employing some ingenuity, we can keep track of where we are in the list of weights by subtracting the current weight from the total:

def weighted_choice_sub(weights):
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        rnd -= w
        if rnd < 0:
            return i

This method is about twice as fast as the binary-search technique, although it has the same complexity overall. Building the temporary list of totals turns out to be a major part of the function’s runtime.

This approach has another interesting property. If we manage to sort the weights in descending order before passing them to weighted_choice_sub, it will run even faster, since the random call returns a uniformly distributed value and larger chunks of the total weight will be skipped in the beginning.

http://eli.thegreenplace.net/wp-content/uploads/2010/01/subweight.png

Indeed, when pre-sorted the runtime is further reduced by about 20%.

King of the hill

All the methods shown so far use the same technique – generate a random number between 0 and the sum of the weights, and find out into which "slice" it falls. Sometimes it’s also called the "roulette method".

There is a different approach however:

def weighted_choice_king(weights):
    total = 0
    winner = 0
    for i, w in enumerate(weights):
        total += w
        if random.random() * total < w:
            winner = i
    return winner

An interesting property of this algorithm is that you don’t need to know the amount of weights in advance in order to use it – so it could be used on some kind of stream.

Cool as this method is, it’s much slower than the others. I suspect this has something to do with Python’s random module not being too speedy, but it’s a fact. Even the simple linear approach is ~25% faster on long lists.

Preprocessing

In some cases you may want to select many random objects from the same weight distribution. In these cases, the totals list of weights can be precomputed and the binary-search approach will be very fast for just selecting the numbers. Something like this can be used:

class WeightedRandomGenerator(object):
    def __init__(self, weights):
        self.totals = []
        running_total = 0

        for w in weights:
            running_total += w
            self.totals.append(running_total)

    def next(self):
        rnd = random.random() * self.totals[-1]
        return bisect.bisect_right(self.totals, rnd)

    def __call__(self):
        return self.next()

As expected, for long lists this approach is about 100 times faster than calling weighted_random all the time. This is hardly a fair comparison, of course, but still a useful one to keep in mind when programming.

Conclusion

Here’s a graph comparing the performance of the various methods presented:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/w_runtime.png

The subtraction method is the fastest when you need one-off selections with different weights. If you can manage to supply a pre-sorted weights list, all the better – you’ll have a nice performance gain.

However, if you need to generate many numbers from the same set of weights, it definitely pays to pre-compute the table of cumulative totals and then only use binary search for each call (the wrg method). This is by far the fastest approach.

Note: after the initial posting on 22.01, this article went through a major revision on 25.01, incorporating information provided in the comments as well as other methods and comparisons.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] Or weighted random choice. If you read about this topic online you’ll find that there’s selection with and without replacement. This is irrelevant for this post since I’m only talking about selecting a single element, not a bunch at a time.
[2] There are many variations on this theme. Some find it more useful to have a list of (object, weight) pairs, or a dict mapping objects to weights. The method I present is the most general, and can be readily re-used or modified for different representations of objects and weights.

Related posts:

  1. Generating random sentences from a context free grammar
  2. Efficient integer exponentiation algorithms
  3. Horner’s rule: efficient evaluation of polynomials
  4. Rabin-Miller primality test implementation

6 Responses to “Weighted random generation in Python”

  1. Evan FriisNo Gravatar Says:

    There is also a pretty handy way to do this with numpy:

    import numpy as np
    
    def weighted_choice(weights):
        totals = np.cumsum(weights)
        norm = totals[-1]
        throw = np.random.rand()*norm
        return np.searchsorted(totals, throw)
  2. Richard TewNo Gravatar Says:

    This should be in the ‘random’ standard library module. It’s just one of those things that you end up writing in the course of the development of a code base. I used it all over the place in EVE Online, from NPC spawning to loot generation.

  3. Simon DavyNo Gravatar Says:

    Similar to a “roulette” select, it’s used a lot in genetic algorithms[1].

    In the first version, I think this would be better as it needs no extra list.

    choice = random() * sum(weights)
    for i, w in enumerate(weights):
        choice -= w
        if choice < 0:
            return i

    If you can contrive to order the weights in descending order, then this would definitely be quicker I think.

    If you can prebuild the totals list, then the bisect is probably faster.

    Of course, if you can guarantee the weights are ints you can build a simple list and use a single random int look up.

    I think this may be covered in TAOCP – but then, what isn’t? :)

    [1] http://en.wikipedia.org/wiki/Fitness_proportionate_selection

  4. elibenNo Gravatar Says:

    @Evan, interesting, thanks. The numpy version is a bit faster than the bisection method for very long lists, but for short lists it isn’t very good. The method Simon presents is faster in all cases

    @Simon, nice variation! It’s really fast – I think I’ll even add it to the article.

  5. denisNo Gravatar Says:

    hi Eli,
    you might look at Walker’s alias method, in Knuth Graphbase p. 392 or
    http://code.activestate.com/recipes/576564-walkers-alias-method-for-random-objects-with-diffe/

    and keep up the good posts.
    (Care to start a Real Doc project, collecting best sw doc in various categories ?)

    cheers
    — denis

  6. pronosticNo Gravatar Says:

    Thanks for sharing this “Walker’s alias method”. Very interesting !

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)