Recently, a new set of sorting functions has landed in Go's golang.org/x/exp/slices package [1]. These functions leverage Go generics to provide a more ergonomic API for sorting (without requiring users to implement sort.Interface), and also deliver a nice performance improvement, as the CL demonstrates.
In this post, I'll dive deep into why these generic functions are faster than the existing ones in the sort package, even though they use precisely the same algorithm and loop structure. This should hopefully be an interesting peek into how Go generics are implemented, in comparison to the existing dynamic dispatch mechanism (interfaces).
To decouple our discussion from the messy specifics of the Go standard library and experimental repositories, and to be able to focus on a smaller, simpler code sample, I've reimplemented this comparison on a simpler sorting algorithm, where the performance difference also manifests [2].
Bubble sort
Let's implement good old bubble sort!
The full code for this post, along with tests and benchmarks is available on GitHub. Here's a simple bubble sort implementation using a Go interface for customization:
func bubbleSortInterface(x sort.Interface) {
n := x.Len()
for {
swapped := false
for i := 1; i < n; i++ {
if x.Less(i, i-1) {
x.Swap(i, i-1)
swapped = true
}
}
if !swapped {
return
}
}
}
As a reminder, sort.Interface is an interface defined in Go's sort package, and looks like this:
type Interface interface {
Len() int
Less(i, j int) bool
Swap(i, j int)
}
The sort package also provides useful type adapters to imbue sort.Interface onto common slice types. For example, sort.StringSlice makes a []string implement this interface in a way that sorts elements in ascending order, and we can write something like:
var ss []string
// ... code that populates ss
bubbleSortInterface(sort.StringSlice(ss))
Generic bubble sort
Here's a generic bubble sort, for a slice of types that implement constraints.Ordered:
func bubbleSortGeneric[T constraints.Ordered](x []T) {
n := len(x)
for {
swapped := false
for i := 1; i < n; i++ {
if x[i] < x[i-1] {
x[i-1], x[i] = x[i], x[i-1]
swapped = true
}
}
if !swapped {
return
}
}
}
As you can see, the code is exactly the same except that we can use len, < and swapping with multiple assignment directly instead of deferring to interface methods. This is possible because the function knows important things about what it's sorting: that it's a slice and also that it implements Ordered.
Benchmarking
I've benchmarked these two implementations against each other by sorting a randomly-generated slice of 1000 strings; here are the results on my machine:
$ go test -bench=.
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
BenchmarkSortStringInterface-8 124 9599141 ns/op
BenchmarkSortStringGeneric-8 158 7433097 ns/op
The generic version is over 20% faster. This is great news, overall. Go generics not only delivers a convenient way to write code that acts on multiple types, but can also provide performance benefits!
In the rest of this post, we'll discuss why generics is faster here.
Analyzing the interface version
In the code accompanying this post, I have a standalone runner program that's useful for profiling. It creates a large slice and sorts it using one of the methods provided on the command-line; it also enables pprof-based CPU profiling. Let's see how this looks for our interface-based bubble sort:
$ ./bubble.out -cpuprofile cpui.out -kind strinterface
$ go tool pprof -list bubbleSortInterface ./bubble.out cpui.out
<...>
ROUTINE ======================== main.bubbleSortInterface
350ms 1.10s (flat, cum) 100% of Total
. . 26:
. . 27:func bubbleSortInterface(x sort.Interface) {
. . 28: n := x.Len()
. . 29: for {
. . 30: swapped := false
70ms 70ms 31: for i := 1; i < n; i++ {
160ms 830ms 32: if x.Less(i, i-1) {
20ms 100ms 33: x.Swap(i, i-1)
. . 34: swapped = true
. . 35: }
. . 36: }
100ms 100ms 37: if !swapped {
. . 38: return
. . 39: }
. . 40: }
. . 41:}
. . 42:
As expected, the program spends most of its time in the inner loop doing comparisons and swaps, but mostly just comparisons. Bubble sort does O(N^2) comparisons for a sequence of length N.
If the majority of the time is spent on comparisons, it's interesting to examine what that actually entails; what instructions does the CPU execute while calling x.Less in this code? Luckily, I've recently written another blog post on exactly this topic! I recommend that you at least skim it, but the important part for our purpose is that calling x.Less means this sequence of instructions:
MOVQ 0x48(SP), DX
MOVQ 0x20(DX), SI
LEAQ -0x1(CX), DI
MOVQ DI, 0x30(SP)
MOVQ 0x50(SP), AX
MOVQ CX, BX
MOVQ DI, CX
NOPL
CALL SI
In our code, we're running bubbleSortInterface(sort.StringSlice(ss)), so we have to turn to sort.StringSlice for the definition of the actual Less method that's going to be invoked:
func (x StringSlice) Less(i, j int) bool { return x[i] < x[j] }
In assembly, it looks like this:
CMPQ 0x10(R14), SP
JBE 0x4664d1
SUBQ $0x28, SP
MOVQ BP, 0x20(SP)
LEAQ 0x20(SP), BP
MOVQ AX, 0x30(SP)
CMPQ DI, BX
JBE 0x4664c5
SHLQ $0x4, DI
MOVQ 0(DI)(AX*1), DX
MOVQ 0x8(DI)(AX*1), R8
CMPQ SI, BX
JBE 0x4664b8
SHLQ $0x4, SI
MOVQ 0(AX)(SI*1), CX
MOVQ 0x8(AX)(SI*1), DI
MOVQ DX, AX
MOVQ R8, BX
CALL runtime.cmpstring(SB)
TESTQ AX, AX
SETL AL
MOVQ 0x20(SP), BP
ADDQ $0x28, SP
RET
If this looks longer than you've expected, it's because slice indexing in Go is protected against out-of-bounds access, and the comparisons we see at the beginning of the function are precisely that - jumping to a portion of the code (which I didn't include, for brevity) that invokes runtime.panicIndex. At last, the actual string comparison is performed with a call to runtime.cmpstring; this, on its own, is a pretty interesting function that's implemented in assembly for the common architectures (for example AMD64), but let's stop the rabbit hole here, since this part will be shared across both the implementations we're comparing.
Now we should have a fairly comprehensive understanding of where the CPU spends its time when running bubble sort using the interface dispatch version. Let's turn our attention to the generic version.
Detour: how generics are implemented in Go 1.18
Russ Cox has a short and excellent blog post titled The Generic Dilemma. Its main claim is that when a language decides on whether to have generics and how to implement them, it faces the following decision:
do you want slow programmers, slow compilers and bloated binaries, or slow execution times?
"Slow compilers and bloated binaries" refers to the C++ approach of implementing templates by full monomorphization - each template invocation is treated a bit like a macro expansion with its full code copy-pasted with the right types.
"Slow execution times" refers to the Java approach of boxing, or to dynamic languages where code is trivially generic due to transparent dynamic dispatch on every call.
Please note that none of these descriptions are meant to be denigratory; these are all real tradeoffs language designers face, Go included.
Specifically, Go has seriously considered both monomorphization (called "stenciling" in the Go world) and dynamic dispatch (called "dictionaries" in the Go world):
Neither approach is perfect on its own, due to the reasons stated above. Therefore, another design was proposed:
This "GC shape" approach is a compromise between the two extremes of stenciling and dictionaries. Depending on the instantiated type we either monomorphize or use dynamic dispatch. There is a more up-to-date document that describes how Go 1.18 does this in detail.
Specifically, different underlying types like ints and strings will get their own GC shape, meaning that a different function will be generated for each, with the types hard-coded (so this is monomorphization). On the other hand, all pointer types will be grouped in the same GC shape and will use dynamic dispatch.
Note that this is the state of the world in Go 1.18; it may, and most likely will change in the future, as the Go team is teaming up with the community to learn what works best for real-life workloads.
Analyzing the generic version
If we use pprof to analyze the generic version, we'll see that it also spends the majority of its time in comparisons, but overall less time than the interface version.
As discussed in the previous section, a string type will get its own GC shape and therefore its own function hard-coded for the string type. Let's see how this looks in the assembly.
First, rummaging through the debug information of the binary we'll find the symbol bubbleSortGeneric[go.shape.string_0] which represents the stenciled version of bubbleSortGeneric for the GC shape that string is currently the only member of. We won't find it as a standalone function to call, though, since it's inlined into its call site. This inlining does not affect performance in any way, so we'll just focus on the instructions for the inner loop that, to remind you, does this:
for i := 1; i < n; i++ {
if x[i] < x[i-1] {
x[i-1], x[i] = x[i], x[i-1]
swapped = true
}
}
And it translates to this assembly:
MOVQ 0x80(SP), R8
INCQ R8
MOVQ 0x70(SP), CX
MOVQ 0x78(SP), BX
MOVQ R8, DX
MOVL AX, SI
MOVQ 0xb0(SP), AX
CMPQ DX, BX
JLE 0x4aef20
MOVQ DX, 0x80(SP)
MOVB SI, 0x3d(SP)
MOVQ DX, SI
SHLQ $0x4, DX
MOVQ DX, 0x90(SP)
MOVQ 0(DX)(AX*1), R8
MOVQ 0x8(DX)(AX*1), BX
LEAQ -0x1(SI), R9
SHLQ $0x4, R9
MOVQ R9, 0x88(SP)
MOVQ 0(R9)(AX*1), CX
MOVQ 0x8(R9)(AX*1), DI
MOVQ R8, AX
CALL runtime.cmpstring(SB)
MOVQ 0xb0(SP), DX
MOVQ 0x90(SP), SI
LEAQ 0(SI)(DX*1), DI
MOVQ 0x88(SP), R8
LEAQ 0(R8)(DX*1), R9
TESTQ AX, AX
JGE 0x4af01a
The first thing to note is that there is no dynamic dispatch to the Less method. Each loop iteration invokes cmpstring directly. Second, the latter part of the assembly resembles the code of Less shown earlier, with one crucial difference - there are no bounds checks! Go includes a bounds-check elimination (BCE) pass which can get rid of the bounds checks for the comparison:
// ... earlier we had n := len(x)
for i := 1; i < n; i++ {
if x[i] < x[i-1] {
The compiler knows that i is between 1 and len(x) at all times (by looking at the loop description and the fact that i is not otherwise modified [3]) and hence that x[i] and x[i-1] are both safely accessing the slice in-bounds.
In the interface version, the compiler does not eliminate the bounds checks from Less; the function is defined like this:
func (x StringSlice) Less(i, j int) bool { return x[i] < x[j] }
And who knows what indices are passed in! Moreover, because of the dynamic dispatch this function is not inlined its caller, where the compiler could perhaps have more insight into what's going on. The Go compiler has some devurtualization capabilities, but they don't kick in here. This is an additional interesting area of compiler improvements.
Generic sort with a custom comparison function
To verify some of the observations described earlier, I've implemented another version of the generic bubble sort; this time, without relying on constraints.Ordered but using a comparison function instead:
func bubbleSortFunc[T any](x []T, less func(a, b T) bool) {
n := len(x)
for {
swapped := false
for i := 1; i < n; i++ {
if less(x[i], x[i-1]) {
x[i-1], x[i] = x[i], x[i-1]
swapped = true
}
}
if !swapped {
return
}
}
}
We can use this function to sort strings as follows:
bubbleSortFunc(ss, func(a, b string) bool { return a < b })
Before reading what follows, can you guess how well this approach does in the benchmarks compared to the generic sort that uses < and compared to the interface-based sort?
$ go test -bench=.
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHz
BenchmarkSortStringInterface-8 124 9599141 ns/op
BenchmarkSortStringGeneric-8 158 7433097 ns/op
BenchmarkSortStringFunc-8 138 8584866 ns/op
It's about in the middle! 14% slower than the other generic function, but 10% faster than the interface-based version.
When looking at its assembly code, it's easy to see why. While the function-based version does not avoid the function call for each comparison (it has to call a function provided as an argument), it does avoid the bounds checks, because the access to x[i] and x[i-1] happens within the body of the function (and not in a function it invokes). So we get a partial win.
This comparison has interesting real-life implications because SortFunc is also a variant that was added to the golang.org/exp/slices, to provide more general sorting capabilities (for types that are unconstrained). This version also provides a speedup against sort.Sort.
Another implication is for sorting pointer types; as mentioned earlier, the Go compiler in 1.18 will group all pointer types into a single GC shape, meaning that it will need to pass a dictionary around for dynamic dispatch. This may make the code slower, though BCE should still kick in - so not much slower. I have a benchmark demonstrating this in the repository.
Finally, as a simple exercise - take the bubble sort functions shown here and benchmark sorting a large slice of integers using the generic function vs. the interface-based one. You'll likely see a massive speedup with generics, more than for strings. Can you figure out why?
Parting words
While this post was about 2/3 of the way through its draft process, this article written by Planetscale engineers was published. It describes some scenarios in which converting monomorphized code to generics makes it significantly slower. It's a very good article, and is worth a read (and I love the disassembly widget).
It's worth saying that generics in Go are very new. What we're playing with now is the initial implementation that was published literally 2 weeks ago! The first release focused on making generics work at all, with a large effort spent on correctness for many tricky corner cases. Making it fast is also planned, but it's not the highest priority right now.
That said, the implementation is in flux and being improved all the time; for example, this Go change which will make it into Go 1.19 already fixes some of the issues discussed in the Planetscale article. On the other hand, due to the reasons discussed in The Generic Dilemma, it's unlikely that Go's generics will ever be "zero-cost" in all possible scenarios. Go places a strong priority on fast compile times and compact binary sizes, so it has to make a certain set of tradeoffs with any design.
Give generics a ride for your use cases! Play, explore, try things out. Peek under the hood (hopefully my post helps with that), report issues that you find with correctness or performance. This is just the beginning for Go generics.
[1] | This package was created to incubate some new, generics-based functionality for Go slices with the 1.18 release, and will likely join the Go standard library in some future release. Update (2023-06-03): this code is moving into the Go standard library as package slices in Go 1.21; before 1.21 is released, you can try it with gotip. |
[2] | The standard library uses a variation of quicksort which is split over several functions and is more difficult to follow in a blog post due to its large code size and relative complexity. It's worth noting that there are proposals to implement more sophisticated sorting algorithms in the standard library; the algorithm itself would be mostly orthogonal to the implementation considerations described in this post, however. |
[3] | This is a gross oversimplification of how the compiler works! In reality the Go backend uses SSA form, which is perfect for optimizations like this. |