The other day, while playing with a simple program involving randomness, I
noticed something strange. Python's `random.randint()` function feels quite
slow, in comparison to other randomness-generating functions. Since
`randint()` is the canonical answer for "give me a random integer" in Python,
I decided to dig deeper to understand what's going on.

This is a brief post that dives into the implementation of the `random`
module, and discusses some alternative methods for generating pseudo-random
integers.

First, a basic benchmark (Python 3.6):

```
$ python3 -m timeit -s 'import random' 'random.random()'
10000000 loops, best of 3: 0.0523 usec per loop
$ python3 -m timeit -s 'import random' 'random.randint(0, 128)'
1000000 loops, best of 3: 1.09 usec per loop
```

Whoa! It's about 20x more expensive to generate a random integer in the range
`[0, 128)` than to generate a random float in the range `[0, 1)`. That's
pretty steep, indeed.

To understand why `randint()` is so slow, we'll have to dig into the Python
source. Let's start with `random()`
[1]. In `Lib/random.py`, the exported function `random` is an alias to the
`random` method of the class `Random`, which inherits this method directly
from `_Random`. This is the C companion defined in
`Modules/_randommodule.c`, and it defines its `random` method as follows:

```
static PyObject *
random_random(RandomObject *self, PyObject *Py_UNUSED(ignored))
{
uint32_t a=genrand_int32(self)>>5, b=genrand_int32(self)>>6;
return PyFloat_FromDouble((a*67108864.0+b)*(1.0/9007199254740992.0));
}
```

Where `getrand_int32` is defined directly above and implements a step of the
Mersenne Twister PRNG. All
in all, when we call `random.random()` in Python, the C function is directly
invoked and there's not much extra work done beyond converting the result of
`genrand_int32` to a floating point number in a line of C.

Now let's take a look at what `randint()` is up to:

```
def randint(self, a, b):
"""Return random integer in range [a, b], including both end points.
"""
return self.randrange(a, b+1)
```

It calls `randrange`, fair enough. Here it is:

```
def randrange(self, start, stop=None, step=1, _int=int):
"""Choose a random item from range(start, stop[, step]).
This fixes the problem with randint() which includes the
endpoint; in Python this is usually not what you want.
"""
# This code is a bit messy to make it fast for the
# common case while still doing adequate error checking.
istart = _int(start)
if istart != start:
raise ValueError("non-integer arg 1 for randrange()")
if stop is None:
if istart > 0:
return self._randbelow(istart)
raise ValueError("empty range for randrange()")
# stop argument supplied.
istop = _int(stop)
if istop != stop:
raise ValueError("non-integer stop for randrange()")
width = istop - istart
if step == 1 and width > 0:
return istart + self._randbelow(width)
if step == 1:
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
# Non-unit step argument supplied.
istep = _int(step)
if istep != step:
raise ValueError("non-integer step for randrange()")
if istep > 0:
n = (width + istep - 1) // istep
elif istep < 0:
n = (width + istep + 1) // istep
else:
raise ValueError("zero step for randrange()")
if n <= 0:
raise ValueError("empty range for randrange()")
return istart + istep*self._randbelow(n)
```

That's quite a bit of case checking and setting up parameters before we get to
the next level. There are a couple of fast-path cases (for example, when the
`stop` parameter is not supplied, this function will be a bit faster),
but overall after a bunch of checking we get to call the `_randbelow()`
method.

By default, `_randbelow()` gets mapped to `_randbelow_with_getrandbits()`:

```
def _randbelow_with_getrandbits(self, n):
"Return a random int in the range [0,n). Raises ValueError if n==0."
getrandbits = self.getrandbits
k = n.bit_length() # don't use (n-1) here because n can be 1
r = getrandbits(k) # 0 <= r < 2**k
while r >= n:
r = getrandbits(k)
return r
```

Note that it does a couple more computations and can end up invoking
`getrandbits()` multiple times (esp. if `n` is far from a power of two).
`getrandbits()` is in C, and while it also ends up invoking the PRNG
`getrand_int32()`, it's somewhat heavier than `random()` and runs twice as
slow.

In other words, there's a lot of Python and C code in the way to invoke the same
underlying C function. Since Python is bytecode-interpreted, all of this ends up
being quite a bit slower than simply calling the C function directly. A death
by a thousand cuts. To be fair, `randint()` is also more flexible in that it
can generate pseudo-random numbers of any size; that said, it's not very common
to need huge pseudo-random numbers, and our tests were with small numbers
anyway.

Here's a couple of experiments to help us test this hypothesis. First, let's try
to hit the fast-path we've seen above in `randrange`, by calling `randrange`
without a `stop` parameter:

```
$ python3 -m timeit -s 'import random' 'random.randrange(1)'
1000000 loops, best of 3: 0.784 usec per loop
```

As expected, the run-time is somewhat better than `randint`. Another
experiment is to rerun the comparison in PyPy, which is a JIT compiler that
should end up tracing through the Python code and generating efficient machine
code that strips a lot of abstractions.

```
$ pypy -m timeit -s 'import random' 'random.random()'
100000000 loops, best of 3: 0.0139 usec per loop
$ pypy -m timeit -s 'import random' 'random.randint(0, 128)'
100000000 loops, best of 3: 0.0168 usec per loop
```

As expected, the difference between these calls in PyPy is small.

## Faster methods for generating pseudo-random integers

So `randint()` turns out to be very slow. In most cases, no one cares; but
just occasionally, we need many random numbers - so what is there to do?

One tried and true trick is just using `random.random()` instead, multiplying
by our integer limit:

```
$ python3 -m timeit -s 'import random' 'int(128 * random.random())'
10000000 loops, best of 3: 0.193 usec per loop
```

This gives us pseudo-random integers in the range `[0, 128)`, much faster. One
word of caution: Python represents its floats in double-precision, with 53 bits
of accuracy. When the limit is above 53 bits, the numbers we'll be getting using
this method are not quite random - bits will be missing. This is rarely a
problem because we don't usually need such huge integers, but definitely
something to keep in mind [2].

Another quick way to generate pseudo-random integers is to use `getrandbits()`
directly:

```
$ python3 -m timeit -s 'import random' 'random.getrandbits(7)'
10000000 loops, best of 3: 0.102 usec per loop
```

This method is fast but limited - it only supports ranges that are powers of
two. If we want to limit the range we can't just compute a modulo - this will
skew the distribution; rather we'll have to use a loop similarly to what
`_randbelow_with_getrandbits()` does in the sample above. This will slow
things down, of course.

Finally, we can turn away from the `random` module altogether, and use Numpy:

```
$ python3 -m timeit -s 'import numpy.random' 'numpy.random.randint(128)'
1000000 loops, best of 3: 1.21 usec per loop
```

Surprisingly, this is slow; that's because Numpy isn't great for working with single datums - it likes to amortize costs over large arrays created / manipulated in C. To see this in action, let's see how long it takes to generate 100 random integers:

```
$ python3 -m timeit -s 'import numpy.random' 'numpy.random.randint(128, size=100)'
1000000 loops, best of 3: 1.91 usec per loop
```

Only 60% slower than generating a single one! With 0.019 usec per integer, this
is the fastest method *by far* - 3 times faster than calling
`random.random()`. The reason this method is so fast is because the Python
call overheads are amortized over all generated integers, and deep inside Numpy
runs an efficient C loop to generate them.

To conclude, use Numpy if you want to generate large numbers of random ints; if you're just generating one-at-a-time, it may not be as useful (but then how much do you care about performance, really?)

## Conclusion: performance vs. abstraction

In programming, performance and abstraction/flexibility are often at odds. By
making a certain function more flexible, we inevitably make it slower - and
`randint()` is a great practical example of this problem. 9 times out of 10
we don't care about the performance of these functions, but when we do, it's
useful to know what's going on and how to improve the situation.

In a way, pure Python code itself is one of the slowest abstractions we encounter, since every line gets translated to a bunch of bytecode that has to be interpreted at run-time.

To mitigate these effects, Python programmers who care about performance have many techniques at their disposal. Libraries like Numpy carefully move as much compute as possible to underlying C code; PyPy is a JIT compiler that can speed up most pure Python code sequences considerably. Numba is somewhere in between, while Cython lets us re-write chosen functions in a restricted subset of Python that can be efficiently compiled to machine code.

[1] | From this point on, file names point to source files in the CPython repository. Feel free to follow along on your machine or on Github. |

[2] | As an experiment, try to generate pseudo-random integers up to 2^54 using this technique. You'll notice that only even numbers are generated! More generally, the closer the multiplier is to machine precision, the less random the result becomes. Knuth has an interesting discussion of this in volume 2 of TAOCP - it has to do with unbalanced rounding that has to happen every time a precision-limited float is multiplied by an integer. That said, if the multiplier is much smaller than the precision, we'll be fine; for generating numbers up to 2^40, say, the bad effects on the distribution will be negligible. |