Weighted random generation in Python

January 22nd, 2010 at 2:15 pm

Update (02.09.2013): This article was written with Python 2.6 in mind; in Python 3.2 a new itertools.accumulate function was added; it provides a fast way to build an accumulated list and can be used for efficiently approaching this problem. See comments below for more information.

A problem I frequently run into is to randomly select an element from some kind of container, with the chances of each element to be selected not being equal, but defined by relative "weights" (or probabilities). This is called weighted random selection [1].

Simple "linear" approach

The following is a simple function to implement weighted random selection in Python. Given a list of weights, it returns an index randomly, according to these weights [2].

For example, given [2, 3, 5] it returns 0 (the index of the first element) with probability 0.2, 1 with probability 0.3 and 2 with probability 0.5. The weights need not sum up to anything in particular, and can actually be arbitrary Python floating point numbers.

import random

def weighted_choice(weights):
    totals = []
    running_total = 0

    for w in weights:
        running_total += w
        totals.append(running_total)

    rnd = random.random() * running_total
    for i, total in enumerate(totals):
        if rnd < total:
            return i

Giving up the temporary list

It turns out that the temporary totals list isn’t required at all. Employing some ingenuity, we can keep track of where we are in the list of weights by subtracting the current weight from the total:

def weighted_choice_sub(weights):
    rnd = random.random() * sum(weights)
    for i, w in enumerate(weights):
        rnd -= w
        if rnd < 0:
            return i

This method is about twice as fast as the binary-search technique, although it has the same complexity overall. Building the temporary list of totals turns out to be a major part of the function’s runtime.

This approach has another interesting property. If we manage to sort the weights in descending order before passing them to weighted_choice_sub, it will run even faster, since the random call returns a uniformly distributed value and larger chunks of the total weight will be skipped in the beginning.

http://eli.thegreenplace.net/wp-content/uploads/2010/01/subweight.png

Indeed, when pre-sorted the runtime is further reduced by about 20%.

King of the hill

All the methods shown so far use the same technique – generate a random number between 0 and the sum of the weights, and find out into which "slice" it falls. Sometimes it’s also called the "roulette method".

There is a different approach however:

def weighted_choice_king(weights):
    total = 0
    winner = 0
    for i, w in enumerate(weights):
        total += w
        if random.random() * total < w:
            winner = i
    return winner

An interesting property of this algorithm is that you don’t need to know the amount of weights in advance in order to use it – so it could be used on some kind of stream.

Cool as this method is, it’s much slower than the others. I suspect this has something to do with Python’s random module not being too speedy, but it’s a fact. Even the simple linear approach is ~25% faster on long lists.

Preprocessing

In some cases you may want to select many random objects from the same weight distribution. In these cases, the totals list of weights can be precomputed and the binary-search approach will be very fast for just selecting the numbers. Something like this can be used:

class WeightedRandomGenerator(object):
    def __init__(self, weights):
        self.totals = []
        running_total = 0

        for w in weights:
            running_total += w
            self.totals.append(running_total)

    def next(self):
        rnd = random.random() * self.totals[-1]
        return bisect.bisect_right(self.totals, rnd)

    def __call__(self):
        return self.next()

As expected, for long lists this approach is about 100 times faster than calling weighted_random all the time. This is hardly a fair comparison, of course, but still a useful one to keep in mind when programming.

Conclusion

Here’s a graph comparing the performance of the various methods presented:

http://eli.thegreenplace.net/wp-content/uploads/2010/01/w_runtime.png

The subtraction method is the fastest when you need one-off selections with different weights. If you can manage to supply a pre-sorted weights list, all the better – you’ll have a nice performance gain.

However, if you need to generate many numbers from the same set of weights, it definitely pays to pre-compute the table of cumulative totals and then only use binary search for each call (the wrg method). This is by far the fastest approach.

Note: after the initial posting on 22.01, this article went through a major revision on 25.01, incorporating information provided in the comments as well as other methods and comparisons.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg
[1] Or weighted random choice. If you read about this topic online you’ll find that there’s selection with and without replacement. This is irrelevant for this post since I’m only talking about selecting a single element, not a bunch at a time.
[2] There are many variations on this theme. Some find it more useful to have a list of (object, weight) pairs, or a dict mapping objects to weights. The method I present is the most general, and can be readily re-used or modified for different representations of objects and weights.

Related posts:

  1. Variance of the sum of independent random variables
  2. random: elevators, vacum cleaners, books
  3. random call for help, disabled users
  4. random stuff
  5. random names generator

19 Responses to “Weighted random generation in Python”

  1. Evan FriisNo Gravatar Says:

    There is also a pretty handy way to do this with numpy:

    import numpy as np
    
    def weighted_choice(weights):
        totals = np.cumsum(weights)
        norm = totals[-1]
        throw = np.random.rand()*norm
        return np.searchsorted(totals, throw)
  2. Richard TewNo Gravatar Says:

    This should be in the ‘random’ standard library module. It’s just one of those things that you end up writing in the course of the development of a code base. I used it all over the place in EVE Online, from NPC spawning to loot generation.

  3. Simon DavyNo Gravatar Says:

    Similar to a “roulette” select, it’s used a lot in genetic algorithms[1].

    In the first version, I think this would be better as it needs no extra list.

    choice = random() * sum(weights)
    for i, w in enumerate(weights):
        choice -= w
        if choice < 0:
            return i

    If you can contrive to order the weights in descending order, then this would definitely be quicker I think.

    If you can prebuild the totals list, then the bisect is probably faster.

    Of course, if you can guarantee the weights are ints you can build a simple list and use a single random int look up.

    I think this may be covered in TAOCP – but then, what isn’t? :)

    [1] http://en.wikipedia.org/wiki/Fitness_proportionate_selection

  4. elibenNo Gravatar Says:

    @Evan, interesting, thanks. The numpy version is a bit faster than the bisection method for very long lists, but for short lists it isn’t very good. The method Simon presents is faster in all cases

    @Simon, nice variation! It’s really fast – I think I’ll even add it to the article.

  5. denisNo Gravatar Says:

    hi Eli,
    you might look at Walker’s alias method, in Knuth Graphbase p. 392 or
    http://code.activestate.com/recipes/576564-walkers-alias-method-for-random-objects-with-diffe/

    and keep up the good posts.
    (Care to start a Real Doc project, collecting best sw doc in various categories ?)

    cheers
    — denis

  6. pronosticNo Gravatar Says:

    Thanks for sharing this “Walker’s alias method”. Very interesting !

  7. ToukipNo Gravatar Says:

    Very useful, thanks!!!

  8. DonNo Gravatar Says:

    Have you created a histogram of your results to see if the distribution maps to the weights; as it does not seem to work. I used a list of about 100 items with varying weights from 0-9 and then ran the selection in a loop increasing each weight bucket with each selection. Given 10,000 selections the distirbution should certainly have the lower weights being selected less than the higher weights, and this is simply not the case.

  9. elibenNo Gravatar Says:

    @Don,

    Which method are you referring to in particular? Can you link to the code you ran?

  10. DougNo Gravatar Says:

    I once used this algorithm in my Monte carlo for random text generation. Basically, each word has a list of succeeding words sorted by the probability. Then I use the weighted random scheme described in you post to pick the succeeding word.

  11. FuliNo Gravatar Says:

    Thanks Eli!
    I created my own solution for this purpose but your ones are more elegant, so I use them instead.

  12. MarkNo Gravatar Says:

    Good write up. Here’s a non-explicit class version of WeightedRandomGenerator as a python generator function. Also, a couple of examples of how to conveniently use the produced indices and check that the resulting selections are within bounds of reasonable randomness for the given initial weights. Note, itemgetter will work with any Python object that has __getitem__. Lastly, I did use numpy’s cumsum, but I didn’t use searchsorted. np.cumsum is just so convenient for the running totals that I couldn’t pass up using it.

    import numpy as np
    import itertools as it
    from collections import Counter
    import random, bisect
    from operator import itemgetter
    
    def wrg(wgts):
        totals = np.cumsum(wgts)
        wgtSum = totals[-1]
    
        # speed up namespace lookups
        ru01 = random.random
        bi_r  = bisect.bisect_right
    
        while True:
            yield bi_r(totals, ru01() * wgtSum)
    
    #
    # verify we're in the ballpark of the distribution
    #
    def displayCounts(container):
        cts = Counter(container)
        for c in cts:
            print c, cts[c]    
    
    if __name__ == "__main__":
        # using numpy indexing on numpy array
        colors = np.array(["yellow", "red", "green"])
        weights = [.8, .19, .01]
        selection = colors[list(it.islice(wrg(weights), 100))]
        print "using itemgetter:"
        displayCounts(selection)
    
        # using itemgetter + python list
        colors = ["yellow", "red", "green"]
        weights = [.8, .19, .01]
        selection = itemgetter(*list(it.islice(wrg(weights), 100)))(colors)
        print "using itemgetter:"
        displayCounts(selection)
  13. JohnNo Gravatar Says:

    How about using a decent language if you are going to look at performance?

  14. elibenNo Gravatar Says:

    John,

    Are you serious or just trolling? There are many reasons for choosing Python for certain projects. Once you’ve chosen it, however, it doesn’t mean all your sorting has to be monkey sort and all your data structures single-linked list. Nothing wrong in trying to produce optimal code within the domain you chose to operate in.

  15. EweiwiNo Gravatar Says:

    Intersting posts, I think I will go for Evans algorithm as I want to generate K random variables, if I would select the one of Simon, the performance would be quadratic, but for Evans routine, considering that the first 2 steps computed only once, it would stay linear I guess. Correct me if I am wrong!

  16. AlexNo Gravatar Says:

    Thanks for this! I’ve used one of the solutions above in at least 2 different projects, and tend to find many solutions/guidance to python-related problems from this blog. Thanks again.

  17. cvdnNo Gravatar Says:

    Thanks for your post! It was almost exactly what I was looking for – something to select from a weighted list with the chance of not selecting an item too. So I made this amendment:

    def weighted_choice_prob(weights):
        if sum(weights) < 100:
            weights.append(100 - sum(weights))
        rnd = random.random() * 100
        for i, w in enumerate(weights):
            rnd -= w
            if rnd < 0:
                if i == len(weights) - 1:
                    i = None
                return i

    Now, there is a chance None will be returned instead of an index, representing no items being selected. This is assuming the sum of the weights are less than or equal to 100.

  18. Serhiy StorchakaNo Gravatar Says:

    > The subtraction method is the fastest when you need one-off selections with different weights.

    Just use fast accumulation and binary search will be up to 3 times faster than the subtraction method.

    import random
    import bisect
    import itertools
    
    def weighted_choice_b2(weights):
        partsums = list(itertools.accumulate(weights))
        total = partsums[-1]
        rnd = random.random() * total
        return bisect.bisect_right(partsums, rnd)
  19. elibenNo Gravatar Says:

    @Serhiy,

    Thanks. Note that this article is about Python 2.6, and was written a month before Python 3.2 was released. itertools.accumulate is new since 3.2. I’ll add a note saying that this is a faster method for newer Pythons.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)