Faster sorting with Go generics - Eli Bendersky's website

Recently, a new set of sorting functions has landed in Go's golang.org/x/exp/slices package [1]. These functions leverage Go generics to provide a more ergonomic API for sorting (without requiring users to implement sort.Interface), and also deliver a nice performance improvement, as the CL demonstrates.

In this post, I'll dive deep into why these generic functions are faster than the existing ones in the sort package, even though they use precisely the same algorithm and loop structure. This should hopefully be an interesting peek into how Go generics are implemented, in comparison to the existing dynamic dispatch mechanism (interfaces).

To decouple our discussion from the messy specifics of the Go standard library and experimental repositories, and to be able to focus on a smaller, simpler code sample, I've reimplemented this comparison on a simpler sorting algorithm, where the performance difference also manifests [2].

Bubble sort

Let's implement good old bubble sort!

The full code for this post, along with tests and benchmarks is available on GitHub. Here's a simple bubble sort implementation using a Go interface for customization:

func bubbleSortInterface(x sort.Interface) {
  n := x.Len()
  for {
    swapped := false
    for i := 1; i < n; i++ {
      if x.Less(i, i-1) {
        x.Swap(i, i-1)
        swapped = true
      }
    }
    if !swapped {
      return
    }
  }
}

As a reminder, sort.Interface is an interface defined in Go's sort package, and looks like this:

type Interface interface {
  Len() int
  Less(i, j int) bool
  Swap(i, j int)
}

The sort package also provides useful type adapters to imbue sort.Interface onto common slice types. For example, sort.StringSlice makes a []string implement this interface in a way that sorts elements in ascending order, and we can write something like:

var ss []string
// ... code that populates ss
bubbleSortInterface(sort.StringSlice(ss))

Generic bubble sort

Here's a generic bubble sort, for a slice of types that implement constraints.Ordered:

func bubbleSortGeneric[T constraints.Ordered](x []T) {
  n := len(x)
  for {
    swapped := false
    for i := 1; i < n; i++ {
      if x[i] < x[i-1] {
        x[i-1], x[i] = x[i], x[i-1]
        swapped = true
      }
    }
    if !swapped {
      return
    }
  }
}

As you can see, the code is exactly the same except that we can use len, < and swapping with multiple assignment directly instead of deferring to interface methods. This is possible because the function knows important things about what it's sorting: that it's a slice and also that it implements Ordered.

Benchmarking

I've benchmarked these two implementations against each other by sorting a randomly-generated slice of 1000 strings; here are the results on my machine:

$ go test -bench=.
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
BenchmarkSortStringInterface-8             124           9599141 ns/op
BenchmarkSortStringGeneric-8               158           7433097 ns/op

The generic version is over 20% faster. This is great news, overall. Go generics not only delivers a convenient way to write code that acts on multiple types, but can also provide performance benefits!

In the rest of this post, we'll discuss why generics is faster here.

Analyzing the interface version

In the code accompanying this post, I have a standalone runner program that's useful for profiling. It creates a large slice and sorts it using one of the methods provided on the command-line; it also enables pprof-based CPU profiling. Let's see how this looks for our interface-based bubble sort:

$ ./bubble.out -cpuprofile cpui.out -kind strinterface
$ go tool pprof -list bubbleSortInterface ./bubble.out cpui.out
<...>
ROUTINE ======================== main.bubbleSortInterface
     350ms      1.10s (flat, cum)   100% of Total
         .          .     26:
         .          .     27:func bubbleSortInterface(x sort.Interface) {
         .          .     28: n := x.Len()
         .          .     29: for {
         .          .     30:         swapped := false
      70ms       70ms     31:         for i := 1; i < n; i++ {
     160ms      830ms     32:                 if x.Less(i, i-1) {
      20ms      100ms     33:                         x.Swap(i, i-1)
         .          .     34:                         swapped = true
         .          .     35:                 }
         .          .     36:         }
     100ms      100ms     37:         if !swapped {
         .          .     38:                 return
         .          .     39:         }
         .          .     40: }
         .          .     41:}
         .          .     42:

As expected, the program spends most of its time in the inner loop doing comparisons and swaps, but mostly just comparisons. Bubble sort does O(N^2) comparisons for a sequence of length N.

If the majority of the time is spent on comparisons, it's interesting to examine what that actually entails; what instructions does the CPU execute while calling x.Less in this code? Luckily, I've recently written another blog post on exactly this topic! I recommend that you at least skim it, but the important part for our purpose is that calling x.Less means this sequence of instructions:

MOVQ  0x48(SP), DX
MOVQ  0x20(DX), SI
LEAQ  -0x1(CX), DI
MOVQ  DI, 0x30(SP)
MOVQ  0x50(SP), AX
MOVQ  CX, BX
MOVQ  DI, CX
NOPL
CALL  SI

In our code, we're running bubbleSortInterface(sort.StringSlice(ss)), so we have to turn to sort.StringSlice for the definition of the actual Less method that's going to be invoked:

func (x StringSlice) Less(i, j int) bool { return x[i] < x[j] }

In assembly, it looks like this:

CMPQ  0x10(R14), SP
JBE   0x4664d1
SUBQ  $0x28, SP
MOVQ  BP, 0x20(SP)
LEAQ  0x20(SP), BP
MOVQ  AX, 0x30(SP)
CMPQ  DI, BX
JBE   0x4664c5
SHLQ  $0x4, DI
MOVQ  0(DI)(AX*1), DX
MOVQ  0x8(DI)(AX*1), R8
CMPQ  SI, BX
JBE   0x4664b8
SHLQ  $0x4, SI
MOVQ  0(AX)(SI*1), CX
MOVQ  0x8(AX)(SI*1), DI
MOVQ  DX, AX
MOVQ  R8, BX
CALL  runtime.cmpstring(SB)
TESTQ AX, AX
SETL  AL
MOVQ  0x20(SP), BP
ADDQ  $0x28, SP
RET

If this looks longer than you've expected, it's because slice indexing in Go is protected against out-of-bounds access, and the comparisons we see at the beginning of the function are precisely that - jumping to a portion of the code (which I didn't include, for brevity) that invokes runtime.panicIndex. At last, the actual string comparison is performed with a call to runtime.cmpstring; this, on its own, is a pretty interesting function that's implemented in assembly for the common architectures (for example AMD64), but let's stop the rabbit hole here, since this part will be shared across both the implementations we're comparing.

Now we should have a fairly comprehensive understanding of where the CPU spends its time when running bubble sort using the interface dispatch version. Let's turn our attention to the generic version.

Detour: how generics are implemented in Go 1.18

Russ Cox has a short and excellent blog post titled The Generic Dilemma. Its main claim is that when a language decides on whether to have generics and how to implement them, it faces the following decision:

do you want slow programmers, slow compilers and bloated binaries, or slow execution times?

"Slow compilers and bloated binaries" refers to the C++ approach of implementing templates by full monomorphization - each template invocation is treated a bit like a macro expansion with its full code copy-pasted with the right types.

"Slow execution times" refers to the Java approach of boxing, or to dynamic languages where code is trivially generic due to transparent dynamic dispatch on every call.

Please note that none of these descriptions are meant to be denigratory; these are all real tradeoffs language designers face, Go included.

Specifically, Go has seriously considered both monomorphization (called "stenciling" in the Go world) and dynamic dispatch (called "dictionaries" in the Go world):

Neither approach is perfect on its own, due to the reasons stated above. Therefore, another design was proposed:

Generics implementation - GC Shape Stenciling

This "GC shape" approach is a compromise between the two extremes of stenciling and dictionaries. Depending on the instantiated type we either monomorphize or use dynamic dispatch. There is a more up-to-date document that describes how Go 1.18 does this in detail.

Specifically, different underlying types like ints and strings will get their own GC shape, meaning that a different function will be generated for each, with the types hard-coded (so this is monomorphization). On the other hand, all pointer types will be grouped in the same GC shape and will use dynamic dispatch.

Note that this is the state of the world in Go 1.18; it may, and most likely will change in the future, as the Go team is teaming up with the community to learn what works best for real-life workloads.

Analyzing the generic version

If we use pprof to analyze the generic version, we'll see that it also spends the majority of its time in comparisons, but overall less time than the interface version.

As discussed in the previous section, a string type will get its own GC shape and therefore its own function hard-coded for the string type. Let's see how this looks in the assembly.

First, rummaging through the debug information of the binary we'll find the symbol bubbleSortGeneric[go.shape.string_0] which represents the stenciled version of bubbleSortGeneric for the GC shape that string is currently the only member of. We won't find it as a standalone function to call, though, since it's inlined into its call site. This inlining does not affect performance in any way, so we'll just focus on the instructions for the inner loop that, to remind you, does this:

for i := 1; i < n; i++ {
  if x[i] < x[i-1] {
    x[i-1], x[i] = x[i], x[i-1]
    swapped = true
  }
}

And it translates to this assembly:

MOVQ  0x80(SP), R8
INCQ  R8
MOVQ  0x70(SP), CX
MOVQ  0x78(SP), BX
MOVQ  R8, DX
MOVL  AX, SI
MOVQ  0xb0(SP), AX
CMPQ  DX, BX
JLE   0x4aef20
MOVQ  DX, 0x80(SP)
MOVB  SI, 0x3d(SP)
MOVQ  DX, SI
SHLQ  $0x4, DX
MOVQ  DX, 0x90(SP)
MOVQ  0(DX)(AX*1), R8
MOVQ  0x8(DX)(AX*1), BX
LEAQ  -0x1(SI), R9
SHLQ  $0x4, R9
MOVQ  R9, 0x88(SP)
MOVQ  0(R9)(AX*1), CX
MOVQ  0x8(R9)(AX*1), DI
MOVQ  R8, AX
CALL  runtime.cmpstring(SB)
MOVQ  0xb0(SP), DX
MOVQ  0x90(SP), SI
LEAQ  0(SI)(DX*1), DI
MOVQ  0x88(SP), R8
LEAQ  0(R8)(DX*1), R9
TESTQ AX, AX
JGE   0x4af01a

The first thing to note is that there is no dynamic dispatch to the Less method. Each loop iteration invokes cmpstring directly. Second, the latter part of the assembly resembles the code of Less shown earlier, with one crucial difference - there are no bounds checks! Go includes a bounds-check elimination (BCE) pass which can get rid of the bounds checks for the comparison:

// ... earlier we had n := len(x)
for i := 1; i < n; i++ {
  if x[i] < x[i-1] {

The compiler knows that i is between 1 and len(x) at all times (by looking at the loop description and the fact that i is not otherwise modified [3]) and hence that x[i] and x[i-1] are both safely accessing the slice in-bounds.

In the interface version, the compiler does not eliminate the bounds checks from Less; the function is defined like this:

func (x StringSlice) Less(i, j int) bool { return x[i] < x[j] }

And who knows what indices are passed in! Moreover, because of the dynamic dispatch this function is not inlined its caller, where the compiler could perhaps have more insight into what's going on. The Go compiler has some devurtualization capabilities, but they don't kick in here. This is an additional interesting area of compiler improvements.

Generic sort with a custom comparison function

To verify some of the observations described earlier, I've implemented another version of the generic bubble sort; this time, without relying on constraints.Ordered but using a comparison function instead:

func bubbleSortFunc[T any](x []T, less func(a, b T) bool) {
  n := len(x)
  for {
    swapped := false
    for i := 1; i < n; i++ {
      if less(x[i], x[i-1]) {
        x[i-1], x[i] = x[i], x[i-1]
        swapped = true
      }
    }
    if !swapped {
      return
    }
  }
}

We can use this function to sort strings as follows:

bubbleSortFunc(ss, func(a, b string) bool { return a < b })

Before reading what follows, can you guess how well this approach does in the benchmarks compared to the generic sort that uses < and compared to the interface-based sort?

$ go test -bench=.
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
BenchmarkSortStringInterface-8             124           9599141 ns/op
BenchmarkSortStringGeneric-8               158           7433097 ns/op
BenchmarkSortStringFunc-8                  138           8584866 ns/op

It's about in the middle! 14% slower than the other generic function, but 10% faster than the interface-based version.

When looking at its assembly code, it's easy to see why. While the function-based version does not avoid the function call for each comparison (it has to call a function provided as an argument), it does avoid the bounds checks, because the access to x[i] and x[i-1] happens within the body of the function (and not in a function it invokes). So we get a partial win.

This comparison has interesting real-life implications because SortFunc is also a variant that was added to the golang.org/exp/slices, to provide more general sorting capabilities (for types that are unconstrained). This version also provides a speedup against sort.Sort.

Another implication is for sorting pointer types; as mentioned earlier, the Go compiler in 1.18 will group all pointer types into a single GC shape, meaning that it will need to pass a dictionary around for dynamic dispatch. This may make the code slower, though BCE should still kick in - so not much slower. I have a benchmark demonstrating this in the repository.

Finally, as a simple exercise - take the bubble sort functions shown here and benchmark sorting a large slice of integers using the generic function vs. the interface-based one. You'll likely see a massive speedup with generics, more than for strings. Can you figure out why?

Parting words

While this post was about 2/3 of the way through its draft process, this article written by Planetscale engineers was published. It describes some scenarios in which converting monomorphized code to generics makes it significantly slower. It's a very good article, and is worth a read (and I love the disassembly widget).

It's worth saying that generics in Go are very new. What we're playing with now is the initial implementation that was published literally 2 weeks ago! The first release focused on making generics work at all, with a large effort spent on correctness for many tricky corner cases. Making it fast is also planned, but it's not the highest priority right now.

That said, the implementation is in flux and being improved all the time; for example, this Go change which will make it into Go 1.19 already fixes some of the issues discussed in the Planetscale article. On the other hand, due to the reasons discussed in The Generic Dilemma, it's unlikely that Go's generics will ever be "zero-cost" in all possible scenarios. Go places a strong priority on fast compile times and compact binary sizes, so it has to make a certain set of tradeoffs with any design.

Give generics a ride for your use cases! Play, explore, try things out. Peek under the hood (hopefully my post helps with that), report issues that you find with correctness or performance. This is just the beginning for Go generics.

[1]

This package was created to incubate some new, generics-based functionality for Go slices with the 1.18 release, and will likely join the Go standard library in some future release.

Update (2023-06-03): this code is moving into the Go standard library as package slices in Go 1.21; before 1.21 is released, you can try it with gotip.

[2]

The standard library uses a variation of quicksort which is split over several functions and is more difficult to follow in a blog post due to its large code size and relative complexity.

It's worth noting that there are proposals to implement more sophisticated sorting algorithms in the standard library; the algorithm itself would be mostly orthogonal to the implementation considerations described in this post, however.

[3]	This is a gross oversimplification of how the compiler works! In reality the Go backend uses SSA form, which is perfect for optimizations like this.