Adventures in JIT compilation: Part 2 - an x64 JIT



In the first part of the series I've briefly introduced the BF source language and went on to present four interpreters with increasing degree of optimization. That post should serve as a good backgroud before diving into actual JIT-ing.

Another important part of the background puzzle is my How to JIT - an introduction post from 2013; there, I discuss some of the basic tools needed to emit executable x64 machine code at run-time and actually run it on Linux. Please go through it quickly if these things are new to you.

The two phases of JIT

As I wrote previously, the JIT technique is easier to understand when divided into two distinct phases:

  1. Create machine code at program run-time.
  2. Execute that machine code, also at program run-time.

Phase 2 for our BF JIT is exactly identical to the method described in that introductory post. Take a look at the JitProgram class in jit_utils for details. We'll be more focused on phase 1, which will be translating BF to x64 machine code; per the definition quoted in part 1 of the series, we're going to develop an actual BF compiler (compiling from BF source to x64 machine code).

Compilers, assemblers and instruction encoding

Traditionally, compilation was divided into several stages. The actual source language compiler would translate some higher-level language to target-specific assembly; then, an assembler would translate assembly to actual machine code [1]. There's a number of important benefits assembly language provides over raw machine code. Salient examples include:

  • Instruction encoding: it's certainly nicer to write inc %r13 to increment the contents of register r13 than to write 0x49, 0xFF, 0xC5. Instruction encoding for the popular architectures is notoriously complicated.
  • Naming labels and procedures for jumps/calls: it's easier to write jl loop than to figure out the encoding for the instruction, along with the relative position of the loop label and encoding the delta to it (not to mention this delta changes every time we add instructions in between and needs to be recomputed). Similarly for functions, call foo instead of doing it by address.

One of my guiding principles through the field of programming is that before diving into the possible solutions for a problem (for example, some library for doing X) it's worth working through the problem manually first (doing X by hand, without libraries). Grinding your teeth over issues for a while is the best way to appreciate what the shrinkwrapped solution/library does for you.

In this spirit, our first JIT is going to be completely hand-written.

Simple JIT - hand-rolling x64 instruction encoding

Out first JIT for this post is simplejit.cpp. Similarly to the interpreters of part 1, all the action happens in a single function (here called simplejit) invoked from main. simplejit goes through the BF source and emits x64 machine code into a memory buffer; in the end, it jumps to this machine code to run the BF program.

Here's its beginning:

std::vector<uint8_t> memory(MEMORY_SIZE, 0);

// Registers used in the program:
//
// r13: the data pointer -- contains the address of memory.data()
//
// rax, rdi, rsi, rdx: used for making system calls, per the ABI.

CodeEmitter emitter;

// Throughout the translation loop, this stack contains offsets (in the
// emitter code vector) of locations for fixup.
std::stack<size_t> open_bracket_stack;

// movabs <address of memory.data>, %r13
emitter.EmitBytes({0x49, 0xBD});
emitter.EmitUint64((uint64_t)memory.data());

As usual, we have our BF memory buffer in a std::vector. The comments reveal some of the conventions used througout the emitted program: our "data pointer" will be in r13.

CodeEmitter is a very simple utility to append bytes and words to a vector of bytes. Its full code is here. It's platform independent except the assumption of little-endian (for EmitUint64 it will write the lowest byte of the 64-bit word first, then the second lowest byte, etc.)

Our first bit of actual machine code emission follows:

// movabs <address of memory.data>, %r13
emitter.EmitBytes({0x49, 0xBD});
emitter.EmitUint64((uint64_t)memory.data());

And it's a cool one, mixing elements from the host (the C++ program doing the emission) and the JITed code. First note the usage of movabs, a x64 instruction useful for placing 64-bit immediates in a register. This is exactly what we're doing here - placing the address of the data buffer of memory in r13. The call to EmitBytes with a cryptic sequence of hex values is preceded by a snippet of assembly in a comment - the assembly conveys the meaning for human readers, the hex values are the actual encoding the machine will understand.

Then comes the BF compilation loop, which looks at the next BF instruction and emits the appropriate machine code for it. Our compiler works in a single pass; this means that there's a bit of trickiness in handling the jumps, as we will soon see.

for (size_t pc = 0; pc < p.instructions.size(); ++pc) {
  char instruction = p.instructions[pc];
  switch (instruction) {
  case '>':
    // inc %r13
    emitter.EmitBytes({0x49, 0xFF, 0xC5});
    break;
  case '<':
    // dec %r13
    emitter.EmitBytes({0x49, 0xFF, 0xCD});
    break;
  case '+':
    // Our memory is byte-addressable, so using addb/subb for modifying it.
    // addb $1, 0(%r13)
    emitter.EmitBytes({0x41, 0x80, 0x45, 0x00, 0x01});
    break;
  case '-':
    // subb $1, 0(%r13)
    emitter.EmitBytes({0x41, 0x80, 0x6D, 0x00, 0x01});
    break;

These are pretty straightforward; since r13 is the data pointer, > and < increment and decrement it, while + and - increment and decrement what it's pointing to. One slightly subtle aspect is that I chose a byte-value memory for our BF implementations; this means we have to be careful when reading or writing to memory and do byte-addressing (the b suffixes on add and sub above) rather than the default 64-bit-addressing.

The code emitted for . and , is a bit more exciting; in the effort of avoiding any external dependencies, we're going to invoke Linux system calls directly. WRITE for .; READ for ,. We're using the x64 ABI here with the syscall identifier in rax:

  // To emit one byte to stdout, call the write syscall with fd=1 (for
  // stdout), buf=address of byte, count=1.
  //
  // mov $1, %rax
  // mov $1, %rdi
  // mov %r13, %rsi
  // mov $1, %rdx
  // syscall
  emitter.EmitBytes({0x48, 0xC7, 0xC0, 0x01, 0x00, 0x00, 0x00});
  emitter.EmitBytes({0x48, 0xC7, 0xC7, 0x01, 0x00, 0x00, 0x00});
  emitter.EmitBytes({0x4C, 0x89, 0xEE});
  emitter.EmitBytes({0x48, 0xC7, 0xC2, 0x01, 0x00, 0x00, 0x00});
  emitter.EmitBytes({0x0F, 0x05});
  break;
case ',':
  // To read one byte from stdin, call the read syscall with fd=0 (for
  // stdin),
  // buf=address of byte, count=1.
  emitter.EmitBytes({0x48, 0xC7, 0xC0, 0x00, 0x00, 0x00, 0x00});
  emitter.EmitBytes({0x48, 0xC7, 0xC7, 0x00, 0x00, 0x00, 0x00});
  emitter.EmitBytes({0x4C, 0x89, 0xEE});
  emitter.EmitBytes({0x48, 0xC7, 0xC2, 0x01, 0x00, 0x00, 0x00});
  emitter.EmitBytes({0x0F, 0x05});
  break;

The comments certainly help, don't they? I hope these snippets are a great motivation for using assembly language rather than encoding instructions manually :-)

The jump instructions are always the most interesting in BF. For [ we do:

case '[':
  // For the jumps we always emit the instruciton for 32-bit pc-relative
  // jump, without worrying about potentially short jumps and relaxation.

  // cmpb $0, 0(%r13)
  emitter.EmitBytes({0x41, 0x80, 0x7d, 0x00, 0x00});

  // Save the location in the stack, and emit JZ (with 32-bit relative
  // offset) with 4 placeholder zeroes that will be fixed up later.
  open_bracket_stack.push(emitter.size());
  emitter.EmitBytes({0x0F, 0x84});
  emitter.EmitUint32(0);
  break;

Note that we don't know where this jump leads at this point - it will go to the matching ], which we haven't encountered yet! Therefore, to keep our compilation in a single pass [2] we use the time-honored technique of backpatching by emitting a placeholder value for the jump and fixing it up once we encounter the matching label. Another thing to note is always using a 32-bit pc-relative jump, for simplicity; we could save a couple of bytes with a short jump in most cases (see my article on assembler relaxation for the full scoop), but I don't think it's worth the effort here.

Compiling the matching ] is a bit trickier; I hope the comments do a good job explaining what's going on, and the code itself is optimized for readability rather than cleverness:

case ']': {
  if (open_bracket_stack.empty()) {
    DIE << "unmatched closing ']' at pc=" << pc;
  }
  size_t open_bracket_offset = open_bracket_stack.top();
  open_bracket_stack.pop();

  // cmpb $0, 0(%r13)
  emitter.EmitBytes({0x41, 0x80, 0x7d, 0x00, 0x00});

  // open_bracket_offset points to the JZ that jumps to this closing
  // bracket. We'll need to fix up the offset for that JZ, as well as emit a
  // JNZ with a correct offset back. Note that both [ and ] jump to the
  // instruction *after* the matching bracket if their condition is
  // fulfilled.

  // Compute the offset for this jump. The jump start is computed from after
  // the jump instruction, and the target is the instruction after the one
  // saved on the stack.
  size_t jump_back_from = emitter.size() + 6;
  size_t jump_back_to = open_bracket_offset + 6;
  uint32_t pcrel_offset_back =
      compute_relative_32bit_offset(jump_back_from, jump_back_to);

  // jnz <open_bracket_location>
  emitter.EmitBytes({0x0F, 0x85});
  emitter.EmitUint32(pcrel_offset_back);

  // Also fix up the forward jump at the matching [. Note that here we don't
  // need to add the size of this jmp to the "jump to" offset, since the jmp
  // was already emitted and the emitter size was bumped forward.
  size_t jump_forward_from = open_bracket_offset + 6;
  size_t jump_forward_to = emitter.size();
  uint32_t pcrel_offset_forward =
      compute_relative_32bit_offset(jump_forward_from, jump_forward_to);
  emitter.ReplaceUint32AtOffset(open_bracket_offset + 2,
                                pcrel_offset_forward);
  break;
}

This concludes the compiler loop; we end up with a bunch of potentially executable machine code in vector. This code refers to the host program (the address of memory.data()), but that's OK since the host program's lifetime wraps the lifetime of the JITed code. What's remaining is to actually invoke this machine code:

// ... after the compilation loop

// The emitted code will be called as a function from C++; therefore it has to
// use the proper calling convention. Emit a 'ret' for orderly return to the
// caller.
emitter.EmitByte(0xC3);

// Load the emitted code to executable memory and run it.
std::vector<uint8_t> emitted_code = emitter.code();
JitProgram jit_program(emitted_code);

// JittedFunc is the C++ type for the JIT function emitted here. The emitted
// function is callable from C++ and follows the x64 System V ABI.
using JittedFunc = void (*)(void);

JittedFunc func = (JittedFunc)jit_program.program_memory();
func();

The call should be familiar from reading the How to JIT post. Note that here we opted for the simplest function possible - no arguments, no return value; in future sections we'll spice it up a bit.

Taking our JIT for a spin

In part 1, I presented a trivial BF program that prints the numbers 1 to 5 to the screen:

++++++++ ++++++++ ++++++++ ++++++++ ++++++++ ++++++++
>+++++
[<+.>-]

Let's see what our compiler translates it to. Even though the code vector inside simplejit is ephemeral (lives only temporarily in memory), we can serialize it to a binary file which we can then disassemble (with objdump -D -b binary -mi386:x86-64). The following is the disassembly listing with comments I embedded to explain what's going on:

 # The runtime address of memory.data() goes into r13; note that this will
 # likely be a different value in every invocation of the JIT.

  0:   49 bd f0 54 e3 00 00    movabs $0xe354f0,%r13
  7:   00 00 00

 # A sequence of 48 instructions that all do the same, for the initial sequence
 # of +s; this makes me miss our optimizing interpreter, by worry not - we'll
 # make this go away later in the post.

  a:   41 80 45 00 01          addb   $0x1,0x0(%r13)
  f:   41 80 45 00 01          addb   $0x1,0x0(%r13)

 # [...] 46 more 'addb'

 # >+++++

 fa:   49 ff c5                inc    %r13
 fd:   41 80 45 00 01          addb   $0x1,0x0(%r13)
102:   41 80 45 00 01          addb   $0x1,0x0(%r13)
107:   41 80 45 00 01          addb   $0x1,0x0(%r13)
10c:   41 80 45 00 01          addb   $0x1,0x0(%r13)
111:   41 80 45 00 01          addb   $0x1,0x0(%r13)

 # Here comes the loop! Note that the relative jump offset is already inserted
 # into the 'je' instruction by the backpatching process.

116:   41 80 7d 00 00          cmpb   $0x0,0x0(%r13)
11b:   0f 84 35 00 00 00       je     0x156
121:   49 ff cd                dec    %r13
124:   41 80 45 00 01          addb   $0x1,0x0(%r13)

 # The '.' is translated into a syscall to WRITE

129:   48 c7 c0 01 00 00 00    mov    $0x1,%rax
130:   48 c7 c7 01 00 00 00    mov    $0x1,%rdi
137:   4c 89 ee                mov    %r13,%rsi
13a:   48 c7 c2 01 00 00 00    mov    $0x1,%rdx
141:   0f 05                   syscall
143:   49 ff c5                inc    %r13
146:   41 80 6d 00 01          subb   $0x1,0x0(%r13)
14b:   41 80 7d 00 00          cmpb   $0x0,0x0(%r13)

 # Jump back to beginning of loop

150:   0f 85 cb ff ff ff       jne    0x121

 # We're done

156:   c3                      retq

How does it perform?

It's time to measure the performance of our JIT against the interpreters from part 1. optinterp3 was about 10x faster than the naive interpreter - how will this JIT measure up? Note that it has no optimizations (except not having to recompute the jump destination for every loop iteration as the naive interpreter did). Can you guess? The results may surprise you...

The simple JIT runs mandelbrot in 2.89 seconds, and factor in 0.94 seconds - much faster still than opt3interp; here's the comparison plot (omitting the slower JITs since they skew the scale):

BF opt3 vs simplejit

Why is this so? opt3interp is heavily optimized - it folds entire loops into a single operation; simplejit does none of this - we've just seen the embarrassing sequence of addbs it emits for a long sequence of +s.

The reason is that the baseline performance of the JIT is vastly better. I've mentioned this briefly in part 1 - imagine what's needed to interpret a single instruction in the fastest interpreter.

  1. Advance pc and compare it to program size.
  2. Grab the instruction at pc.
  3. Switch on the value of the instruction to the right case.
  4. Execute the case.

This requires a whole sequence of machine instructions, with at least two branches (one for the loop, one for the switch). On the other hand, the JIT just emits a single instruction - no branches. I would say that - depending on what the compiler did while compiling the interpreter - the JIT is between 4 and 8 times faster at running any given BF operation. It has to run many more BF operations because it doesn't optimize, but this difference is insufficient to close the huge baseline gap. Later in this post we're going to see an optimized JIT which performs even better.

But first, let's talk about this painful instruction encoding business.

Manually encoding instructions

As promised, simplejit is completely self-contained. It doesn't use any external libraries, and encodes all the instructions by hand. It's not hard to see how painful that process is, and the code is absolutely unreadable unless accompanied by detailed comments; moreover, changing the code is a pain, and changes happen in unexpected ways. For example, if we want to use some other register in an instruction, the change to emitted code won't be intuitive. add %r8, %r9 is encoded as 0x4C, 0x01, 0xC8, but add %r8, %r10 is 0x4C, 0x01, 0xD0; since registers are specified in sub-byte nibbles, one needs very good memory and tons of experience to predict what goes where.

Would you expect related instructions to look somewhat similar? They don't. inc %r13 is encoded as 0x49, 0xFF, 0xC0, for example. To put it bluntly - unless you're Mel, you're going to have a hard time. Now imagine that you have to support emitting code for multiple architectures!

This is why all compilers, VMs and related projects have their own layers to help with this encoding task, along with related tasks like labels and jump computations. Most are not exposed for easy usage outside their project; others, like DynASM (developed as part of the LuaJIT project) are packaged for separate usage. DynASM is an example of a low-level framework - providing instruction encoding and not much else; some frameworks are higher-level, doing more compiler-y things like register allocation. One example is libjit; another is LLVM.

asmjit

While looking for a library to help me encode instructions, I initially tried DynASM. It's an interesting approach - and you can see Josh Haberman's post about using it for a simple BF JIT, but I found it to be a bit too abandonware-ish for my taste. Besides, I don't like the funky preprocessor approach with a dependency on Lua.

So I found another project that seemed to fit the bill - asmjit - a pure C++ library without any preprocessing. asmjit began about 3 years ago to ease its author's development of fast kernels for graphics code. Its documentation isn't much better than dynasm's, but being just a C++ library I found it easier to dive into the source when questions arose the docs couldn't answer. Besides, the author is very active and quick in answering questions on Github and adding missing featuers. Therefore, the rest of this post shows BF JITs that use asmjit - these can also serve as a non-trivial tutorial for the library.

simpleasmjit - JIT with sane instruction encoding

Enter simpleasmjit.cpp - the same simple JIT (no optimizations) as simplejit, but using asmjit for the instruction encoding, labels and so on. Just for fun, we'll mix things up a bit. First, we'll change the JITed function signature from void (*)(void) to void (*)(uint64_t); the address of the BF memory buffer will be passed as argument into the JITed function rather than hard-coded into it.

Second, we'll use actual C functions to emit / input characters, rather than system calls. Moreover, since putchar and getchar may be macros on some systems, taking their address can be unsafe. So we'll wrap them in actual C++ functions, whose address it is safe to take in emitted code:

void myputchar(uint8_t c) {
  putchar(c);
}

uint8_t mygetchar() {
  return getchar();
}

simpleasmjit starts by initializing an asmjit runtime, code holder and assembler [3]:

asmjit::JitRuntime jit_runtime;
asmjit::CodeHolder code;
code.init(jit_runtime.getCodeInfo());
asmjit::X86Assembler assm(&code);

Next, we'll give a mnemonic name to our data pointer, and emit a copy of the address of the memory buffer into it (it's in rdi initially, as the first function argument in the x64 ABI):

// We pass the data pointer as an argument to the JITed function, so it's
// expected to be in rdi. Move it to r13.
asmjit::X86Gp dataptr = asmjit::x86::r13;
assm.mov(dataptr, asmjit::x86::rdi);

Then we get to the usual BF processing loop that emits code for every BF op:

for (size_t pc = 0; pc < p.instructions.size(); ++pc) {
  char instruction = p.instructions[pc];
  switch (instruction) {
  case '>':
    // inc %r13
    assm.inc(dataptr);
    break;
  case '<':
    // dec %r13
    assm.dec(dataptr);
    break;
  case '+':
    // addb $1, 0(%r13)
    assm.add(asmjit::x86::byte_ptr(dataptr), 1);
    break;
  case '-':
    // subb $1, 0(%r13)
    assm.sub(asmjit::x86::byte_ptr(dataptr), 1);
    break;

Notice the difference! No more obscure hex codes - assm.inc(dataptr) is so much nicer than 0x49, 0xFF, 0xC5, isn't it?

For input and output we emit calls to our wrapper functions:

case '.':
  // call myputchar [dataptr]
  assm.mov(asmjit::x86::rdi, asmjit::x86::byte_ptr(dataptr));
  assm.call(asmjit::imm_ptr(myputchar));
  break;
case ',':
  // [dataptr] = call mygetchar
  // Store only the low byte to memory to avoid overwriting unrelated data.
  assm.call(asmjit::imm_ptr(mygetchar));
  assm.mov(asmjit::x86::byte_ptr(dataptr), asmjit::x86::ax);
  break;

The magic is in the imm_ptr modifier, which places the address of the function in the emitted code.

Finally, the code handling [ and ] is also much simpler due to asmjit's labels, which can be used before they're actually emitted:

case '[': {
  assm.cmp(asmjit::x86::byte_ptr(dataptr), 0);
  asmjit::Label open_label = assm.newLabel();
  asmjit::Label close_label = assm.newLabel();

  // Jump past the closing ']' if [dataptr] = 0; close_label wasn't bound
  // yet (it will be bound when we handle the matching ']'), but asmjit lets
  // us emit the jump now and will handle the back-patching later.
  assm.jz(close_label);

  // open_label is bound past the jump; all in all, we're emitting:
  //
  //    cmpb 0(%r13), 0
  //    jz close_label
  // open_label:
  //    ...
  assm.bind(open_label);

  // Save both labels on the stack.
  open_bracket_stack.push(BracketLabels(open_label, close_label));
  break;
}
case ']': {
  if (open_bracket_stack.empty()) {
    DIE << "unmatched closing ']' at pc=" << pc;
  }
  BracketLabels labels = open_bracket_stack.top();
  open_bracket_stack.pop();

  //    cmpb 0(%r13), 0
  //    jnz open_label
  // close_label:
  //    ...
  assm.cmp(asmjit::x86::byte_ptr(dataptr), 0);
  assm.jnz(labels.open_label);
  assm.bind(labels.close_label);
  break;
}

We just have to remember which label we used for the jump and emit the exact same Label object - asmjit handles the backpatching on its own! Moreover, all the jump offset computations are performed automatically.

Finally, after emitting the code we can call it:

using JittedFunc = void (*)(uint64_t);

JittedFunc func;
asmjit::Error err = jit_runtime.add(&func, &code);
// [...]
// Call it, passing the address of memory as a parameter.
func((uint64_t)memory.data());

That's it. This JIT emits virtually the same exact code as simplejit, and thus we don't expect it to perform any differently. The main point of this exercise is to show how much simpler and more pleasant emitting code is with a library like asmjit. It hides all the icky encoding and offset computations, letting us focus on what's actually unique for our program - the sequence of instructions emitted.

optasmjit - combining BF optimizations with a JIT

Finally, it's time to combine the clever optimizations we've developed in part 1 with the JIT. Here, I'm essentially taking optinterp3 from part 1 and bolting a JIT backend onto it. The result is optasmjit.cpp.

Recall that instead of the 8 BF ops, we have an extended set, with integer arguments, that conveys higher-level ops in some cases:

enum class BfOpKind {
  INVALID_OP = 0,
  INC_PTR,
  DEC_PTR,
  INC_DATA,
  DEC_DATA,
  READ_STDIN,
  WRITE_STDOUT,
  LOOP_SET_TO_ZERO,
  LOOP_MOVE_PTR,
  LOOP_MOVE_DATA,
  JUMP_IF_DATA_ZERO,
  JUMP_IF_DATA_NOT_ZERO
};

The translation phase from BF ops to a sequence of BfOpKind is exactly the same as it was in optinterp3. Let's take a look at how a couple of the new ops are implemented now:

case BfOpKind::INC_PTR:
  assm.add(dataptr, op.argument);
  break;

As before with the interpreters, an increment of 1 is replaced by the addition of an argument. We use a different instruction for this - add instead of inc [4]. How about something more interesting:

case BfOpKind::LOOP_MOVE_DATA: {
  // Only move if the current data isn't 0:
  //
  //   cmpb 0(%r13), 0
  //   jz skip_move
  //   <...> move data
  // skip_move:
  asmjit::Label skip_move = assm.newLabel();
  assm.cmp(asmjit::x86::byte_ptr(dataptr), 0);
  assm.jz(skip_move);

  assm.mov(asmjit::x86::r14, dataptr);
  if (op.argument < 0) {
    assm.sub(asmjit::x86::r14, -op.argument);
  } else {
    assm.add(asmjit::x86::r14, op.argument);
  }
  // Use rax as a temporary holding the value of at the original pointer;
  // then use al to add it to the new location, so that only the target
  // location is affected: addb %al, 0(%r13)
  assm.mov(asmjit::x86::rax, asmjit::x86::byte_ptr(dataptr));
  assm.add(asmjit::x86::byte_ptr(asmjit::x86::r14), asmjit::x86::al);
  assm.mov(asmjit::x86::byte_ptr(dataptr), 0);
  assm.bind(skip_move);
  break;
}

I'll just note again how much simpler this code is to write with asmjit than without it. Also note the careful handling of the byte-granulated data when touching memory - I ran into a number of nasty bugs when developing this. In fact, using the native machine word size (64 bits in this case) for BF memory cells would've made everything much simpler; 8-bit cells are closer to the common semantics of the language and provide an extra challenge.

Performance

Let's see how optasmjit fares against the fastest interpreter and the unoptimized JIT - 0.93 seconds for mandelbrot, 0.3 seconds for factor - another factor of 3 in performance:

BF opt3 vs simplejit vs optasmjit

Notably, the performance delta with the optimized interpreter is huge: the JIT is more than 4x faster. If we compare it all the way to the initial simple interpreter, optasmjit is about 40x faster - making it hard to even compare on the same chart :-)

BF full performance comparison for part 2

JITs are fun!

I find writing JITs lots of fun. It's really nice to be able to hand-craft every instruction emitted by the compiler. While this is quite painful to do without any encoding help, libraries like asmjit make the process much more pleasant.

We've done quite a bit in this part of the series. optasmjit is a genuine optimizing JIT for BF! It:

  • Parses BF source
  • Translates it to a sequence of higher-level ops
  • Optimizes these ops
  • Compiles the ops to tight x64 assembly in memory and runs them

Let's connect these steps to some real compiler jargon. BfOpKind ops can be seen as the compiler IR. Translation of human-readable source code to IR is often the first step in compilation (though it in itself is sometimes divided into multiple steps for realistic languages). The translation/compilation of ops to assembly is often called "lowering"; in some compilers this involves multiple steps and intermediate IRs.

I left a lot of code out of the blog post - otherwise it would be huge! I encourage you to go back through the full source files discussed here and understand what's going on - every JIT is a single standalone C++ file.


[1]I said traditionally because many modern compilers no longer work this way. For example, LLVM compiles IR to another, much lower-level IR that represents machine-code level instructions; assembly can be emitted from this IR, but also machine code directly - so the assembler is integrated into the compiler.
[2]Some compilers would do two passes; this is similar to our first interpreter optimization in part 1: the first pass collects information (such as location of all matching ]s), so the second pass already knows what offsets to emit.
[3]Please refer to asmjit's documentation for the full scoop. I'll also mention that asmjit has a "compiler" layer which does more sophisticated things like register allocation; in this post I'm only using the base assembly layer.
[4]Wondering whether we could have just used add 1 instead of inc in the first place? Certainly! In fact, while there probably used to be a good reason for a separate inc instruction, in these days of complex multi-port pipelined x64 CPUs, it's not clear which one is faster. I just wanted to show both for diversity.

Adventures in JIT compilation: Part 1 - an interpteter



This is the first post in a series about JIT compilers. The plan is to take a simple input language and develop some interpreters and JITs for it, in roughtly increasing degree of complexity. It's my hope that by the end of the series readers will have a good understanding of what it takes to develop a JIT compiler and what are some of the tools available to assist with the task.

The input language will be Brainfuck, or BF as I'll be referring to it from now and throughout the series. I think it's a good language for the purpose since it really boils programmability down to the essentials. Even though it's very tedious to program in, BF is fairly "mainstream" as far as programming languages go, with some concepts like memory pointers and loops mapping directly to familiar C constructs.

As the implementation language I'll be using C++. This is, perhaps, not the most commonly used "starter" language; that said, most compilers I know are written in C++ (or C), and hence many of the most popular low-level code-generation libraries in existence are in these languages. In later parts of this series we'll be using some C++ libraries, and this is by far easiest to do from C++ itself. Besides, I try to keep my code straightforward throughout the series - there is very little use of advanced C++ features here.

The BF language

The BF language is simple to describe, but I don't do this here. Please take a look at the spec, read the Wikipedia page, or one of the other existing resources. An in-browser interpreter such as this one can be very useful.

I'll just give an example to develop a taste for the language. The following BF program prints the numbers 1 to 5 to the screen:

++++++++ ++++++++ ++++++++ ++++++++ ++++++++ ++++++++
>+++++
[<+.>-]

Here's what it does:

  • Line 1 initializes memory cell 0 to the value 48, which happens to be the ASCII code for 0.
  • Line 2 initializes memory cell 1 to 5, which is our loop counter.
  • Line 3 is a loop that, at each iteration, increments cell 0 and prints its value out, then decrements cell 1 and checks if it has reached the value 0.

A simple interpreter

To get an initial feel for the language and to have a reliable reference implementation, we'll start with a simple interpreter that processes one BF character at a time and does what's necessary to "execute" it.

One of the reasons for my choosing BF as the source language is its simplicity. You'll find a lot of tutorials online that purport to develop interpreters or compilers but end up focusing 90% of their time on writing the parser. I think the later stages of compilation are much more interesting, so my "parser" for BF looks like this:

struct Program {
  std::string instructions;
};

Program parse_from_stream(std::istream& stream) {
  Program program;

  for (std::string line; std::getline(stream, line);) {
    for (auto c : line) {
      if (c == '>' || c == '<' || c == '+' || c == '-' || c == '.' ||
          c == ',' || c == '[' || c == ']') {
        program.instructions.push_back(c);
      }
    }
  }
  return program;
}

Note that this is a valid implementation according to the BF spec: all characters except the 8 supported ones are to be treated as comments and ignored [1]. This parser is going to serve us throughout the series.

With that out of the way, here's the actual interpreter:

constexpr int MEMORY_SIZE = 30000;

void simpleinterp(const Program& p, bool verbose) {
  // Initialize state.
  std::vector<uint8_t> memory(MEMORY_SIZE, 0);
  size_t pc = 0;
  size_t dataptr = 0;

  while (pc < p.instructions.size()) {
    char instruction = p.instructions[pc];
    switch (instruction) {
    case '>':
      dataptr++;
      break;
    case '<':
      dataptr--;
      break;
    case '+':
      memory[dataptr]++;
      break;
    case '-':
      memory[dataptr]--;
      break;
    case '.':
      std::cout.put(memory[dataptr]);
      break;
    case ',':
      memory[dataptr] = std::cin.get();
      break;

    // [...]

All these cases are rather trivial. The more interesting ones are the control flow ops - [ and ]. We'll start with [ - jump forward if the current data location is zero. This op makes it possible to skip a loop or implement a simple if-like condition.

case '[':
  if (memory[dataptr] == 0) {
    int bracket_nesting = 1;
    size_t saved_pc = pc;

    while (bracket_nesting && ++pc < p.instructions.size()) {
      if (p.instructions[pc] == ']') {
        bracket_nesting--;
      } else if (p.instructions[pc] == '[') {
        bracket_nesting++;
      }
    }

    if (!bracket_nesting) {
      break;
    } else {
      DIE << "unmatched '[' at pc=" << saved_pc;
    }
  }
  break;

The most important thing to note here is that the [ and ] brackets in BF can be nested; therefore, when figuring out where to jump, we have to find the matching bracket. If this seems like something wasteful to do at run-time, you're right - keep reading!

For ] we do something very similar. In BF, ] is jumping to an earlier [ if the current data location is not zero. This is how loops advance to the next iteration (or stop).

case ']':
  if (memory[dataptr] != 0) {
    int bracket_nesting = 1;
    size_t saved_pc = pc;

    while (bracket_nesting && pc > 0) {
      pc--;
      if (p.instructions[pc] == '[') {
        bracket_nesting--;
      } else if (p.instructions[pc] == ']') {
        bracket_nesting++;
      }
    }

    if (!bracket_nesting) {
      break;
    } else {
      DIE << "unmatched ']' at pc=" << saved_pc;
    }
  }
  break;

And that's it! The interpreter loop concludes with:

  default: { DIE << "bad char '" << instruction << "' at pc=" << pc; }
  }

  pc++;
}

The full code for this simple interpreter can be found in simpleinterp.cpp.

Measuring the performance of BF programs

Whenever we develop something like an interpreter or compiler, execution speed is a paramount concern. Therefore, it's common for compiler writers to have benchmark suites they refer to for measurements. For BF, I'll be using a couple of programs throughout the series to measure how fast our implementation is. One is a Mandelbrot generator; another is a factorization program invoked on the large-ish prime 179424691 [2].

The simple interpreter shown above takes 38.6 seconds on mandelbrot and 16.5 seconds on factor [3]. Now let's see how we can greatly improve these numbers.

Optimized interpreter - take 1

The most obvious optimization opportunity for the simple interpreter is to avoid laboriously looking for the matching bracket every time a [ or ] is encountered. Imagine a realistic program with a hot inner loop (by "hot" here I mean it runs many, many - possibly billions - of times throughtout the execution of the program). Is it really necessary to scan the source to find the matching bracket every single time? Of course not. We can just precompute these jump destinations ahead of time, since the BF program doesn't change throughout its execution.

This is the idea behind optinterp.cpp - our first optimized interpreter. Much of the code is the same as for the simple interpreter, so I'll just highlight the differences. A crucial addition is this function, which is run before the actual interpretation happens:

std::vector<size_t> compute_jumptable(const Program& p) {
  size_t pc = 0;
  size_t program_size = p.instructions.size();
  std::vector<size_t> jumptable(program_size, 0);

  while (pc < program_size) {
    char instruction = p.instructions[pc];
    if (instruction == '[') {
      int bracket_nesting = 1;
      size_t seek = pc;

      while (bracket_nesting && ++seek < program_size) {
        if (p.instructions[seek] == ']') {
          bracket_nesting--;
        } else if (p.instructions[seek] == '[') {
          bracket_nesting++;
        }
      }

      if (!bracket_nesting) {
        jumptable[pc] = seek;
        jumptable[seek] = pc;
      } else {
        DIE << "unmatched '[' at pc=" << pc;
      }
    }
    pc++;
  }

  return jumptable;
}

It computes the jump destinations for all the [ and ] ops in the program. Its operation is essentially identical to the scanning forward / backward for a matching bracket in the main loop of the simple interpreter. The result is the vector jumptable, where for every [ and ] at offset i in the program, jumptable[i] holds the offset of the matching bracket. For any other op at offset i, jumptable[i] is simply 0.

The actual main loop of optinterp is the same as in simpleinterp, except for the clauses for brackets, which become simply:

case '[':
  if (memory[dataptr] == 0) {
    pc = jumptable[pc];
  }
  break;
case ']':
  if (memory[dataptr] != 0) {
    pc = jumptable[pc];
  }
  break;

As you'd expect, optinterp is quite a bit faster; it takes only 18.4 seconds to run mandelbrot and 6.7 seconds to run factor - more than a factor of 2 improvement!

BF interpreter runtime plot

Optimized interpreter - take 2

The optimization applied in the previous section was very beneficial, but it's also fairly trivial - we avoid completely unnecessary work at run-time, if we can just precompute it at compile time. To make our interpreter even faster, we'll have to get more creative.

The first step in optimizing anything is measuring and profiling the current code. Some past experience helps avoid needless steps in this process. For example, it's fairly clear that almost 100% of the run-time of the interpreter will be spent in the single function that interprets the program; therefore, function/call profiling won't be of much help.

The main loop is fairly small, however, and there doesn't appear to be much to optimize at first glance (disregarding micro-optimizaitons which I won't worry about here). Well, except that this loop runs for every BF instruction encountered in the source program, so it can run tons of times. So what we'll do is get a breakdown of the ops that execute during a typical program run. The code for optinterp already has this tracing included - it's protected with the BFTRACE preprocessor macro because it's costly and we want to avoid doing it in "real" runs.

Here's the execution profile we get from a typical run of the factor benchmark on the prime 179424691. On the left is the operation, and on the right the number of times it was executed by the interpreter for the program at hand:

.  -->  21
,  -->  10
+  -->  212,428,900
]  -->  242,695,606
<  -->  1,220,387,704
-  -->  212,328,376
>  -->  1,220,387,724
[  -->  118,341,127
.. Total: 3,226,569,468

A couple of immediate observations:

  1. The total number of operations is huge: over 3 billion times around the main interpreter loop. It's a good thing we're using C++ for the interpreter - running 3 billion iterations of anything in a higher level language would be painful!
  2. The ratio of pointer movement instructions to loops is suspiciously high. There's something like 242 million loop iterations executed (the count for ]) but a total of 2.4 billion pointer moves: < and >.

We'd expect and hope for the hot loops to be short and tight - why is every loop doing so much?

A cursory glance at the source of factor.bf provides a clue. Here's a representative snippet:

  <<<<<<<<<<<<<<<<<[<<<<<<<<<<]>>>>>>>>>>
  [>>>>>>>>[-]<<[->+<]<[->>>+<<<]>>>>>]<<<<<<<<<<
  [+>>>>>>>[-<<<<<<<+>>>>>>>[-<<<<<<<->>>>>>+>
           [-<<<<<<<+>>>>>>>[-<<<<<<<->>>>>>+>
           [-<<<<<<<+>>>>>>>[-<<<<<<<->>>>>>+>
           [-<<<<<<<+>>>>>>>[-<<<<<<<->>>>>>+>
           [-<<<<<<<+>>>>>>>]]]]]]]]]<<<<<<<
           [->>>>>>>+<<<<<<<]-<<<<<<<<<<]
  >>>>>>>
  [-<<<<<<<<<<<+>>>>>>>>>>>]
    >>>[>>>>>>>[-<<<<<<<<<<<+++++>>>>>>>>>>>]>>>]<<<<<<<<<<
  [+>>>>>>>>[-<<<<<<<<+>>>>>>>>[-<<<<<<<<->>>>>+>>>
            [-<<<<<<<<+>>>>>>>>[-<<<<<<<<->>>>>+>>>
            [-<<<<<<<<+>>>>>>>>[-<<<<<<<<->>>>>+>>>
            [-<<<<<<<<+>>>>>>>>[-<<<<<<<<->>>>>+>>>
            [-<<<<<<<<+>>>>>>>>]]]]]]]]]<<<<<<<<
            [->>>>>>>>+<<<<<<<<]-<<<<<<<<<<]
  >>>>>>>>[-<<<<<<<<<<<<<+>>>>>>>>>>>>>]>>
  [>>>>>>>>[-<<<<<<<<<<<<<+++++>>>>>>>>>>>>>]>>]<<<<<<<<<<
  [<<<<<<<<<<]>>>>>>>>>>
  >>>>>>
]
<<<<<<

Note the fairly long sequences of <<<<<<<, >>>>>>>. Just a bit of thought about the semantics of BF makes this clear - these are necessary to get anything done because we want to be able to get from cell to cell to update data.

Now let's think what it means to execute a sequence such as <<<<<<< in our interpreter. Our main loop executes 7 times, each time doing:

  1. Advance pc and compare it to program size.
  2. Grab the instruction at pc.
  3. Switch on the value of the instruction to the right case.
  4. Execute the case.

That's quite expensive. What if we could compress all the long sequences of <<<<<<<? After all, what we do for a single < is:

case '<':
  dataptr--;
  break;

So for seven <s we could do:

case ... something representing 7 <s ...:
  dataptr -= 7;
  break;

This is easy to generalize. We can detect any consecutive sequence in the BF source and encode it as a pair: the operation, and the repetition count. Then at execution time we simply repeat the op the required number of times.

The full code for this interpreter is optinterp2.cpp. Previously, we kept a separate jump table correlated to the [ and ] instructions in the input program. Now we need extra information for every BF instruction, so we'll just translate the Program into a sequences of ops of the type:

enum class BfOpKind {
  INVALID_OP = 0,
  INC_PTR,
  DEC_PTR,
  INC_DATA,
  DEC_DATA,
  READ_STDIN,
  WRITE_STDOUT,
  JUMP_IF_DATA_ZERO,
  JUMP_IF_DATA_NOT_ZERO
};

// Every op has a single numeric argument. For JUMP_* ops it's the offset to
// which a jump should be made; for all other ops, it's the number of times the
// op is to be repeated.
struct BfOp {
  BfOp(BfOpKind kind_param, size_t argument_param)
      : kind(kind_param), argument(argument_param) {}

  BfOpKind kind = BfOpKind::INVALID_OP;
  size_t argument = 0;
};

The interpretation happens in two steps. First we run translate_program to read the program and generate a std::vector<BfOp>. This translation is pretty straight-forward: it detects repetitions in ops like < and encodes them in the argument field. A slightly tricky aspect here is handling the jumps, since the offsets of all ops in the program change (a run of seven consecutive <s turns into a single DEC_PTR, for example). Take a look at the code for the full details.

As planned, the main interpreter loop becomes:

switch (kind) {
case BfOpKind::INC_PTR:
  dataptr += op.argument;
  break;
case BfOpKind::DEC_PTR:
  dataptr -= op.argument;
  break;
case BfOpKind::INC_DATA:
  memory[dataptr] += op.argument;
  break;
case BfOpKind::DEC_DATA:
  memory[dataptr] -= op.argument;
  break;
// [...] etc.

How fast is it? The mandelbrot benchmark now takes 11.9 seconds, and factor takes 3.7 seconds; another 40% reduction in run-time.

BF interpreter runtime plot with opt2

Optimized interpreter - take 3

Our optimized interpreter now runs the mandelbrot benchmark more than 3x faster than the original, naive interpreter. Can we do even better?

First, let's take a look at instruction tracing for optinterp2, repeating the previous experiment:

.  -->  21
]  -->  242,695,606
,  -->  10
+  -->  191,440,613
<  -->  214,595,790
-  -->  205,040,514
>  -->  270,123,690
[  -->  118,341,127
.. Total: 1,242,237,371

The total instruction count went down almost 3x. Also, now the number of BF loop executions is more comparable to the number of other instructions, meaning that we don't do too much work in every iteration. This was our goal with the optimization of repetitions, after all.

In fact, this execution profile is annoyingly flat. Performance gurus don't like flat profiles because there's nothing in particular sticking out that one could optimize. This usually means we should measure / trace something else as well.

An interesting question worth answering is - what is every BF loop doing. In other words, what are the hottest loops we are running, and can we spend some more specialized effort to optimize them? This would require more sophisticated tracing machinery, which I've already included in the code of optinterp2. This machinery traces loops and records the instruction sequence executed by each loop iteration in the program. It then sorts them by the number of appearances and shows the most common (hottest) loops. Here is the result for the factor benchmark:

-1<10+1>10      --> 32,276,219
-1              --> 28,538,377
-1<4+1>4        --> 15,701,515
-1>3+1>1+1<4    --> 12,581,941
-1>3+1>2+1<5    --> 9,579,970
-1<3+1>3        --> 9,004,028
>3              --> 8,911,600
-1<1-1>1        --> 6,093,976
-1>3+1<3        --> 6,085,735
-1<1+1<3+1>4    --> 5,853,530
-1>3+2<3        --> 5,586,229
>2              --> 5,416,630
-1>1+1<1        --> 5,104,333

What do these traces mean? The first, most common one says:

  1. Decrement current memory cell
  2. Move 10 cells to the left
  3. Increment current memory cell
  4. Move 10 cells to the right

The loop doing this was executed 32 million times! Similarly, a loop doing the simple "decrement the current cell" was executed 28 million times. If you look in the source of factor.bf, these loops are easy to spot. The first one is [-<<<<<<<<<<+>>>>>>>>>>]; the second one is just [-].

What if we could optimize these loops entirely away? After all, they are doing something that is much easier to express in a higher-level language. [-] merely sets the current memory cell to 0. [-<<<<<<<<<<+>>>>>>>>>>] is more involved, but not much more: all it does is add the value of the current memory cell 10 cells to the left. The trace shown above features many loops of this kind, along with another; loops like [>>>] move to the right in jumps of 3 until encountering a non-zero cell.

In optinterp2 we've added higher-level ops to the interpreter. We can add some even higher-level ops to optimize away these loops. optinterp3.cpp does just that. It adds a few more operation kinds for encoding common loops:

enum class BfOpKind {
  INVALID_OP = 0,
  INC_PTR,
  DEC_PTR,
  INC_DATA,
  DEC_DATA,
  READ_STDIN,
  WRITE_STDOUT,
  LOOP_SET_TO_ZERO,
  LOOP_MOVE_PTR,
  LOOP_MOVE_DATA,
  JUMP_IF_DATA_ZERO,
  JUMP_IF_DATA_NOT_ZERO
};

The new ops are LOOP_SET_TO_ZERO which replaces [-], LOOP_MOVE_PTR for loops like [>>>] and LOOP_MOVE_DATA for loops like [-<<<+>>>]. We'll now need a slightly more sophisticated translation step that detects these loops in the input program and emits the proper LOOP_* ops. For an example of how it's done, here's the translation for [-]:

std::vector<BfOp> optimize_loop(const std::vector<BfOp>& ops,
                                size_t loop_start) {
  std::vector<BfOp> new_ops;

  if (ops.size() - loop_start == 2) {
    BfOp repeated_op = ops[loop_start + 1];
    if (repeated_op.kind == BfOpKind::INC_DATA ||
        repeated_op.kind == BfOpKind::DEC_DATA) {
      new_ops.push_back(BfOp(BfOpKind::LOOP_SET_TO_ZERO, 0));

  // [...]

This function is called when translating the BF program to a sequence of ops. loop_start is the index in ops where the most recent loop starts. The code shown above detects the case where the only contents of the loop is a single - (or + since in BF memory cells hold unsigned values with wrap-around). In such cases, a LOOP_SET_TO_ZERO op is emitted. When the interpreter itself runs into a LOOP_SET_TO_ZERO, it does just what you'd expect:

case BfOpKind::LOOP_SET_TO_ZERO:
  memory[dataptr] = 0;
  break;

The other loop optimizations are a tiny bit more involved, but all follow the same basic idea.

We expect this optimization to be very significant - we've just taken some of the hottest loops the program runs and folded them into a single, efficient instruction (or a sequence of efficient instructions for pointer-movement loops). And indeed, optinterp3 is very fast: 3.9 seconds on mandelbrot and 1.97 seconds on factor.

BF interpreter runtime plot with opt3

The overall speedup is dramatic. optinterp3 is almost 10x faster than simpleinterp on our benchmarks. While we could certainly make it even faster, I think these optimizations are sufficient for our needs; let's talk about what we can learn from them instead.

On compilers, bytecode and tracing JITs

It turns out there's a surprising amount of insight to be gained from the exercise this post went through.

First, let's start with the distinction between compilers and interpreters. According to Wikipedia, a compiler is:

a computer program (or a set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language), with the latter often having a binary form known as object code.

gcc would be the canonical example of this: it transforms source code written in C (or C++) into assembly language for, say, Intel CPUs. But there are many other kinds of compilers: Chicken compiles Scheme into C; Rhino compiles Javascript to JVM bytecode; the Clang frontend compiles C++ to LLVM IR. CPython, the canonical Python implementation compiles Python source code to bytecode, and so on. In general, the term Bytecode refers to any intermediate representation / virtual instruction set designed for efficient interpretation.

Based on this definition, while simpleinterp is indeed just a BF interpreter, the optimized interpreters described here are more like compilers + bytecode interpreters. Consider optinterp3 for example. The source language is BF; the target language is bytecode with the following instruction set:

enum class BfOpKind {
  INVALID_OP = 0,
  INC_PTR,
  DEC_PTR,
  INC_DATA,
  DEC_DATA,
  READ_STDIN,
  WRITE_STDOUT,
  LOOP_SET_TO_ZERO,
  LOOP_MOVE_PTR,
  LOOP_MOVE_DATA,
  JUMP_IF_DATA_ZERO,
  JUMP_IF_DATA_NOT_ZERO
};

... where each instruction has a single argument. optinterp3 first compiles BF to this bytecode, and only then executes the bytecode. So if we squint a bit, there's a JIT compiler here already - with the caveat that the compilation target is not executable machine code but rather this specialized bytecode. Worry not - we'll get to a real JIT in the next part of the series.

Finally, I'd like to point out that the loop optimizations performed in optinterp3 are the static version of a tracing JIT. We used tracing to observe which loops occur most commonly in our benchmarks, and optimized these loops. While the loops we optimized were very generic and surely appear in most large BF programs, we could take it further. We could optimize even more of the hottest loops, but with time we'd get into more specialized paths in our particular benchmarks.

To be fully generic, we'd have to defer this optimization to run-time, which is what a tracing JIT does. A tracing JIT interprets code in the source language and keeps track of the hottest loops (and for dynamic languages, of the actual run-time types of values flowing through the loops). When the same loop goes over some threshold (say, invoked more than a million times) the loop is optimized and compiled into efficient machine code.

Parting words for part 1

In this post, we've seen an interpreter for BF being gradually refined from a naive approach to a compile-to-optimized-bytecode approach, speeding it up 10x in the process. Hopefully this provides a good feel for the source language, as well as some of the tradeoffs involved in optimizing for it.

In the next part of the series I'll present an actual JIT for BF - compiling BF into tight x64 machine code and invoking it, all at run-time. I'll show how to construct such a JIT compiler entirely from scratch (using nothing but the standard system libraries) and also how to use assembly encoding libraries for easier development. Stay tuned - it's going to be fun!


[1]An observant reader will note I could have used a switch statement or a lookup table here for more efficiency. I'm an avid adherent to the "until you've measured it, it ain't slow" philosophy. The parsing stage of BF takes negligible time for any realistic program, so it's hardly important to optimize this part. On my machine, the largest BF program I could find (Mandelbrot) is ~11,000 instructions and takes 360 microseconds to parse, most of which is almost certainly dominated by the file read time.
[2]

I'm using two different programs to prevent overfitting for one particular benchmark. Naturally it would be more professional to use a whole suite of benchmarks, but this is just a hobby blog post so let's not overdo it B-)

In fact, even real-world benchmark suites for large projects tend to be poor approximations for the real world.

[3]

All the performance numbers for this series were collected on my Haswell Linux (Ubuntu 14.04) box; the C++ code was compiled with the default gcc 4.8.4. I got slightly different results with Clang in some cases, but this series is not about comparing host C++ compilers, so I'll just use gcc throughout.

It's also worth noting these execution times are end-to-end for the whole binary and include loading the interpreter binary, reading the BF file, doing any pre-processing / optimization / code emission and actually running the benchmark. For any run time beyond ~100 ms, all secondary factors are negligible compared to the actual benchmark run-time (the time it took to execute the BF program at hand).


reStructuredText vs. Markdown for technical documentation



Markdown and reStructuredText are two markup languages with plain text formatting syntax designed for easy input with any text editor. Each has a whole host of tools that can convert marked up text to publishing formats like HTML or PDF.

Software developers these days have to be familiar with such markup languages because they serve as the basis for many documentation systems. In this post I want to examine the tradeoffs between Markdown and reStructuredText from the point of view of a programmer.

Where Markdown shines

The history of markup languages for describing complex document structure with rich formatting through textual input is long and illustrious, dating back at least to the early 1970s with troff and later on TeX. In the 1990s these formats escaped the specialized dominion of mathematicians and programmers, as a multitude of people went online and wanted to interact through mediums like forums. This led to the birth of markup languages like BBCode in 1998.

Markdown came a bit later, in 2004, and really pushed the concept over the brink of ubiquitousness. Was it because of the relative fame of its inventors (John Gruber and Aaron Swartz)? Or maybe it was just a good idea in the right place at the right time, coupled with a catchy name? It's hard to say now, but one thing is certain - these days Markdown the big gorilla in any discussion of textual markup languages. It's likely to be the first thing on one's mind when thinking of a techology to use for, say, documentation or textual entry into some program.

Markdown logo

So the best thing about Markdown is, IMHO, its popularity. It's a natural choice since it's so familiar, and one can find tools in almost any conceivable programming language and environment for parsing and munging it. Just by virtue of being the default markup language for StackOverflow, Reddit and Github, Markdown is probably well familiar to most developers these days.

Where reST shines

reStructuredText's initial release dates back to 2002, actually predating Markdown. The problem is, it lived in relative obscurity for most of its life, confined to some parts of the Python community. The core Python documentation has been written in reST for quite a while, but only after the release of Sphinx has it seen serious uptake outside. These days reST is taken more seriously - Github supports it for pages and wikis, and some major projects use it by default for their documentation - including the Linux kernel, OpenCV and LLVM/Clang.

To me, reST stands out against Markdown it three main aspects, which I'll cover in detail:

  1. It's more fully-featured.
  2. It's much more standardized and uniform.
  3. It has built-in support for extensions.

More features

reST comes with more built-in features for writing more complex documents. Some examples I personally use the most: footnotes, tables, citations, tables of contents. There is no standard way of doing these in Markdown, which is a problem because these and other features are important for implementing complete documentation systems. Sure, these can be added as extensions; but Markdown doesn't have a standard extension mechanism, which means that every system develops its own non-standard way of doing things. Which leads me to...

More standardized and uniform

The original Markdown syntax was defined de-facto by its initial implementation; there was no real standard to speak of, and the built-in assumptions and bugs of the initial implementation became unoffically baked in. There's a long and fascinating background story about the standartization attempt of Markdown led by Jeff Attwood (for the sake of StackOveflow); you can easily google for it.

I'll just point to the CommonMark spec, which is the result of this attempt. It has a section named "Why is a spec needed" listing some of the underspecified aspecst of Markdown; it's worth reading.

Due to this, what ends up happening is that there's no single Markdown. There is a multitude of related markup languages with a common core, some more conformant than others. When features are missing, sites/tools usually roll on some custom extension which isn't coordinated with other sites/tools.

reST, on the contrary, has a fairly comprehensive spec and a single canonical implementation that is still being actively developed - the docutils project. Sure, there are alternative implementations (such as a JS one for client-side rendering), but these at least can follow the written-down spec. Therefore, there's really just one reST, and the source you write is likely to work in multiple systems.

A class diagram for docutils parser

Built-in support for extensions

As discussed before, Markdown implementations are all over the place when it comes to features beyond the commonly-agreed-upon core. reST is very different. Extension is a core design principle, and both custom roles (for inline elements) and directives (for block elements) can be easily added. It's therefore straightforward to add extensions for commonly-needed stuff like syntax-highlighted code blocks, math equations for Latex rendering and so on.

With Markdown, to add an extension one has to modify the parser, which makes every Markdown implementation out there an island of its own. In reST, adding an extension is just an API call in docutils. With this in hand, documentation systems like Sphinx and static website generators like Pelican heavily customize their reST input language while using the original docutils parser.

Conclusion

So, which one to choose? I'd say this depends on the use case. For fully-fledged documentation of a large (or small) software project, I'd definitely go with reST, most likely using Sphinx. I hope this post managed to convey why reST is a better choice for this scenario.

For a simple markup system use in things like forum comments, or marking up chat messages, the decision is trickier. Markdown is a good choice because more users would be familiar with it. On the other hand, if you're already using reST for something else, consistency is important too.