Earlier this week I ran into a fun quick blog post named Beating C with 70 lines of Go, which reimplements the basic functionality of wc in Go using various approaches and compares their performance. Apparently it's inspired by an earlier Haskell-based post and several other offshoots.

This reminded me of my earlier post about reimplementing wc in pure x64 assembly, where I also measured the performance of my program against wc.

The optimized approach taken in the Go implementation is very similar to what I did in assembly, so it seemed like an interesting comparison. I started by generating a ~580 MiB file using xmlgen and ran the various versions against each other:

  • LC_TYPE=POSIX wc: 2.13 sec
  • wc-naive.go: 3.53 sec
  • wc-chunks.go: 1.37 sec
  • wcx64: 1.2 sec

Note the LC_TYPE setting for the system's wc. This is important for a fair comparison, because without this wc will attempt to do utf-8 decoding on all bytes in the file, which results in significant slowdowns. Since the Go versions use byte-counts and so does my wcx64, I force the comparison to be fair. In fact, this isn't a bad result for Go - the straightforward solution is almost as fast as the same approach direct-coded in assembly!

The Go blog post follows with parallelized versions which are much faster than the serial one, but I'm excluding it here because all the other competitors are single-threaded. This is not a serious benchmark anyway. If you prefer to be serious, this response using SIMD-optimized C blows everything out of the water:

  • fastlwc: 0.11 sec

The conclusion? Well, there's no real conclusion here, beyond that coding exercises like this are fun in any language :-)