Benchmarking utility for Python



Go programmers have it pretty good with the benchmarking capabilities provided by the standard library. Say we want to benchmark a dot product implementation:

func dotProduct(a, b []float32) float32 {
  var dot float32
  for i := range a {
    dot += a[i] * b[i]
  }
  return dot
}

All we need to do is add a Benchmark* function in our test file [1]:

const benchArrSize = 1000000

func BenchmarkDot(b *testing.B) {
  aa := make([]float32, benchArrSize)
  bb := make([]float32, benchArrSize)

  for b.Loop() {
    dotProduct(aa, bb)
  }
}

And then run the benchmarks:

$ go test -bench=.
goos: linux
goarch: amd64
pkg: example.com
cpu: 13th Gen Intel(R) Core(TM) i7-13700K
BenchmarkDot-24           3136            381506 ns/op
PASS
ok    example.com     1.199s

The benchmark runner has a number of interesting features, but in this post we'll just focus on the basics - timing how long a certain computation takes. The number of times to loop is selected automatically [2]. Any setup code that only needs to be run once per benchmark is simply written out before the loop.

Benchmarks in Python

How about Python?

The closest way to replicate the Go experience is using the command-line mode of the timeit module; suppose we have this code in dot.py:

def dotProductLoop(a, b):
    result = 0
    for i in range(len(a)):
        result += a[i] * b[i]
    return result

We can run:

$ python3 -m timeit -s "import dot; a = [1]*1000000; b = [2]*1000000" "dot.dotProductLoop(a, b)"
10 loops, best of 5: 20.7 msec per loop

Where the -s flag takes setup code, and the string passed into the positional argument at the end is the code to repeat.

This method works, but has some issues, mostly due to requiring the execution of a separate Python process. The strings become unwieldy if some non-trivial setup or benchmarked code is required; it's difficult to share code between benchmarks and so on. We could "metaprogram" this by storing these command-line invocations in a shell script (or another Python script that uses subprocess to invoke them), but this isn't a pleasant development experience, especially if we want to keep a set of benchmarks in our code base alongside our code and run it periodically as part of a release process.

The timeit module also has support for programmatic control - we can import it in a Python script and use the functions and classes it exports to write benchmarks more naturally in code.

Here's a basic example:

import dot
import timeit

a = [1] * 1000000
b = [2] * 1000000

N = 10
print(timeit.timeit("dot.dotProductLoop(a, b)", globals=globals(), number=N))

Over time Python added some niceties to the timeit function, such as the globals parameter - in the past it used to be more cumbersome!

This invocation simply runs the benchmarked code (still provided as a string, though there's also an option to pass a lambda, with the caveat that the function invocation time will be counted in every loop) for the specified number of times and returns the total runtime in seconds, printing out something like 0.2081897109. For the per-loop time, we have to divide the result by N.

We can get the "automatic range" functionality of the command-line invocation by using the timeit.Timeit class explicitly, with its autorange method:

print(timeit.Timer("dot.dotProductLoop(a, b)", globals=globals()).autorange())

This prints out a tuple: how many loops were run, and the total execution time.

(10, 0.20622027607)

Finally, if we're interested in repeating the test multiple times, we can use the repeat function:

print(timeit.repeat("dot.dotProductLoop(a, b)", globals=globals(), number=N, repeat=5))

With the output:

[0.2064882309, 0.20689259003, 0.2068122789, 0.2074350470, 0.20825179701]

Note that we're back to having to use number=N explicitly - we lost the autorange capability, and also the times are printed out separately for each repetition, requiring an additional step to calculate the minimal one.

How can we replicate the command-line invocation with the programmatic API? I couldn't find a way to combine the functionality of repeat with that of autorange using the standard knobs [3]; please drop me a note if I've missed something obvious.

A new utility function

Unable to find the combined functionality of the command-line invocation in the timeit module itself, I wrote this simple utility function:

def autobench(stmt, globals=None, repeat=5):
    # Find the number of iterations to run
    timer = timeit.Timer(stmt, globals=globals)
    num, _ = timer.autorange()
    raw_timings = timer.repeat(repeat=repeat, number=num)
    best = min(raw_timings)
    print(f"{num} loops, best of {repeat}: {best/num:.3f}s per loop")

Here's how to use it:

autobench("dot.dotProductLoop(a, b)", globals=globals())

And it prints output similar to the command-line invocation:

10 loops, best of 5: 0.021s per loop

The autobench functionality is very close to what the timeit module itself is doing in its command-line mode. The only thing it doesn't take care of is automatic unit scaling (e.g. reporting the result in ns, us or ms, depending on the actual loop duration).

While there are several third-party modules that provide more enhanced benchmarking capabilities for Python code, I find this autobench function to be sufficient for 99% of my needs.

Code

All the code for this post, along with some additional implementation variants of dot product and a comparison of their performance using the methods described here is available on GitHub.


[1]This makes use of the spiffy new B.Loop() functionality, fresh from the oven in Go 1.24!
[2]Benchmarks may have drastically different runtimes, so this is a very important usability feature. For very short benchmarks we typically want to run the loop many times to have a good average; for long benchmarks, running many iterations is unnecessary and inconvenient. The runner starts by invoking the loop a single time, and keeps increasing the count until a "reasonable" total duration is reached (typically something like 1 second in Go, though this can be controlled with flags). Incidentally, this takes care of "warmup" factors that may be important for very quick pieces of code.
[3]By the way, the Go approach doesn't do repeat by default, but we can ask for it by passing the -repeat flag when running the benchmarks. In this mode, each run is reported separately and Go devs typically use the benchstat tool to aggregate these numbers together with statistical analysis and comparisons.

Recent posts

2025.02.03: Decorator JITs - Python as a DSL
2025.01.13: Reverse mode Automatic Differentiation
2024.12.31: Summary of reading: October - December 2024
2024.12.18: Implementing Raft: Part 5 - Exactly-once delivery
2024.11.22: GoMLX: ML in Go without Python
2024.11.11: ML in Go with a Python sidecar
2024.11.02: Ranging over functions in Go 1.23
2024.10.29: Bloch sphere
2024.10.17: Calculating the norm of a complex number
2024.10.10: Implementing Raft: Part 4 - Key/Value Database

See Archives for a full list.