On the uses and misuses of panics in Go



Go has a unique approach to error handling, with a combination of explicit error values and an exception-like panic mechanism. In this post I'm looking at the philosophical aspects of panic, trying to understand some of the conflicting guidelines coming from the Go team.

Most of all, it's a story of pragmatism in language design, with some musings on the merits and dangers of a pragmatic approach.

Introduction - errors and exceptions

Many programming languages support exceptions as a standard way of handling errors - for example Java or Python. While certainly convenient, exceptions have many issues as well, which is why they're frowned upon by other languages or their style guides. The main criticism of exceptions is that they introduce a "side channel" for control flow; when reading the code, you also have to keep in mind the path in which exceptions can flow and this makes reasoning about some code very difficult [1].

Let's talk about error handling in Go to make this concrete. I assume you know how "standard" error handling in Go works - it's quite hard to miss! Here's how we open a file:

f, err := os.Open("file.txt")
if err != nil {
  // handle error here
}
// do stuff with f here

If the file is missing, say, os.Open will return a non-nil error. In some other languages, errors are done differently. For example, Python's built-in open function will raise an exception if something is wrong:

try:
  with open("file.txt") as f:
    # do stuff with f here
except OSError as err:
  # handle exception err here

While error handling via exceptions in Python is consistent, it's also a target of criticism because of its pervasiveness. Even iterators use exceptions to signal the end of a sequence (StopIteration). The main question is "what does exceptional really mean?". Here's a relevant comment by Rob Pike from a mailing list discussion where the modern incarnation of Go's panic/recover mechanism is proposed:

This is exactly the kind of thing the proposal tries to avoid. Panic and recover are not an exception mechanism as usually defined because the usual approach, which ties exceptions to a control structure, encourages fine-grained exception handling that makes code unreadable in practice. There really is a difference between an error and what we call a panic, and we want that difference to matter. Consider Java, in which opening a file can throw an exception. In my experience few things are less exceptional than failing to open a file, and requiring me to write inside-out code to handle such a quotidian operation feels like a Procrustean imposition.

To be objective, exceptions have proponents that scoff at Go's explicit error handling for many reasons. For one, note the order of the code in the two minimal samples above. In Python the primary flow of the program immediately follows the open call, and the error handling is delegated to a later stage (not to mention that in many cases the exception will be caught higher up the call stack and not in this function at all). In Go, on the other hand, error handling comes first and may obscure the main flow of the program. Moreover, Go's error handling is very verbose - this being one of the major criticisms of the language. I'll mention one potential way to address this later in the post.

In addition to Rob's quote above, Go's philosophy towards exceptions is summarized well in the FAQ:

We believe that coupling exceptions to a control structure, as in the try-catch-finally idiom, results in convoluted code. It also tends to encourage programmers to label too many ordinary errors, such as failing to open a file, as exceptional.

However, in some cases having an exception-like mechanism is actually useful; in a high-level language like Go it's even essential, I'd say. This is why panic and recover exist.

The need to occasionally panic

Go is a safe language with run-time checks for some serious programming errors. For example, when you try to access a slice out of bounds, the result is not undefined behavior. Rather, the Go runtime will panic. For example, this minimal program:

package main

import (
  "fmt"
)

func main() {
  s := make([]string, 3)
  fmt.Println(s[5])
}

Will die with a runtime error followed by a stack trace:

panic: runtime error: index out of range

goroutine 1 [running]:
main.main()
  /tmp/sandbox209906601/main.go:9 +0x40

Other common things that panic include accessing nil structure fields through pointers, closing an already-closed channel, etc. What are the alternatives to panicking here? We could make slice access return a result, err pair, and we could also make slice element assignment a function that potentially returns an error, but this would result in cumbersome code. Imagine rewriting this snippet, where foo, bar and baz are all slices of strings:

foo[i] = bar[i] + baz[i]

Into something like:

br, err := bar[i]
if err != nil {
  return err
}
bz, err := baz[i]
if err != nil {
  return err
}
err := assign_slice_element(foo, i, br + bz)
if err != nil {
  return err
}

Looks like fun? Nope. Different languages handle this in different ways: Python and Java throw an exception if i points out of bounds in either of the slices/lists/arrays. C will emit code without bounds checking, with a good chance of accessing/trampling the wrong memory area, crashing the process or exposing it to security vulnerabilities. C++ takes the middle way, with some "performance oriented" modes that are unsafe and other modes (std::vector::at) that could throw an exception.

Since the verbosity of the rewritten snippet above is unacceptable, Go chose to have panics, which is an exception-like mechanism reserved for truly exceptional conditions such as bugs in the code.

This isn't restricted to built-ins; user code can also panic if needed. Sometimes code encounters errors that can only mean something went horribly wrong - a bug, or some key invariant being violated. For example, a switch case that just can't happen in the current context. The panic function exists for just such cases; it's morally equivalent to Python's raise and C++'s throw. That said, subtle but powerful restrictions on how exceptions are caught make Go's exception handling unique.

Restrictions on panic recovery in Go

When a panic is not caught/recovered anywhere in the calling stack, it will end up crashing the program with a stack dump, as shown above. This behavior is pretty useful for debugging, but not ideal in realistic scenarios. If we're writing a server that serves many cliens, we don't want it to immediately crash because of an internal bug in some data parsing library it's using. It would be much better to catch such an error, log it, and keep serving other clients. Go's recover function can help with that. Here's a code sample from Effective Go demonstrating this:

func server(workChan <-chan *Work) {
    for work := range workChan {
        go safelyDo(work)
    }
}

func safelyDo(work *Work) {
    defer func() {
        if err := recover(); err != nil {
            log.Println("work failed:", err)
        }
    }()
    do(work)
}

Even though Go's panic and recover resemble exceptions at first glance, they come with some important and deliberate limitations that makes them less prone to common problems with exception-heavy code. Here's another quote from the FAQ:

Go also has a couple of built-in functions to signal and recover from truly exceptional conditions. The recovery mechanism is executed only as part of a function's state being torn down after an error, which is sufficient to handle catastrophe but requires no extra control structures and, when used well, can result in clean error-handling code.

Rob Pike's quote from the thread I linked to earlier is also relevant:

Our proposal instead ties the handling to a function - a dying function - and thereby, deliberately, makes it harder to use. We want you think of panics as, well, panics! They are rare events that very few functions should ever need to think about. If you want to protect your code, one or two recover calls should do it for the whole program. If you're already worrying about discriminating different kinds of panics, you've lost sight of the ball.

The specific limitation is that recover can only be called in a defer code block, which cannot return control to an arbitrary point, but can only do clean-ups and tweak the function's return values. The Python file-opening exception handling shown above can't work in Go - we can't just catch OSError and try to open another file (or create the file we failed to open) without significant restructuring of the code.

This limitation is coupled with an important coding guideline - keep panics within package boundaries. It's a good practice for packages not to panic in their public interfaces. Rather, the public-facing functions and methods should recover from panics internally and translate them to error messages. This makes panics friendlier to use, though high-availability servers will still likely want to have outer recovers installed just in case.

The siren call of panic

Every language feature is destined to be misused; this is just a fact of programming life, and it's not different for Go's panic. This is not to say that all misuess are categorically wrong though, just that the feature ends up being used for goals it was not originally designed to fulfill.

Consider this real example from the scanInt method in fmt/scan.go of the Go (1.10) standard library:

func (s *ss) scanInt(verb rune, bitSize int) int64 {
  if verb == 'c' {
    return s.scanRune(bitSize)
  }
  s.SkipSpace()
  s.notEOF()
  base, digits := s.getBase(verb)
  // ... other code
}

Each one of the methods SkipSpace, notEOF and getBase can fail, but where is the error handling? In fact, this package - like several others in the standard library - is using panics for some of its error handling internally. A panic from each of these will be recovered in the public API (like the Token method) and converted to an error. If we had to rewrite this code with explicit error handling, it would be more cumbersome, for sure [2]:

if err := s.SkipSpace(); err != nil {
  return err
}
if err := s.notEOF(); err != nil {
  return err
}
base, digits, err := s.getBase(verb)
if err != nil {
  return err
}
// ... other code

Of course, panic is not the only way to solve this. As Rob Pike says, Errors are Values and thus they are programmable, and we could devise some clever way to make the code flow better without using an exception-like escape mechanism. Other languages have useful features that would make it much simpler; for example Rust has the ? operator [3] that propagates an error returned from a given expression automatically, so in a hypothetical syntax we could write:

s.SkipSpace()?
s.notEOF()?
base, digits := s.getBase(verb)?

But we don't have this in Go (yet?), so the core Go team made the choice to use panics instead. They even condone this pattern in Effective Go:

With our recovery pattern in place, the do function (and anything it calls) can get out of any bad situation cleanly by calling panic. We can use that idea to simplify error handling in complex software.

And it's being used in several more places; a few I found with a quick search:

  • fmt/scan.go
  • json/encode.go
  • text/template/parse/parser.go

But isn't this... wrong?

I emphatize with folks lured by the siren, its call is strong here! But I also can't shake off the feeling that this goes against the principles designed into the language. In the quote shown above Rob Pike says:

In my experience few things are less exceptional than failing to open a file

But what is less exceptional than running into an unexpected character while parsing? Isn't it the most common kind of error a parser encounters? Pike goes on to say:

We want you think of panics as, well, panics! They are rare events that very few functions should ever need to think about.

But is a parsing error rare? And very many functions in fmt/scan.go have to "think about" panics because that's what they use for signaling errors!

If you're already worrying about discriminating different kinds of panics, you've lost sight of the ball.

But here is errorHandler from fmt/scan.go:

func errorHandler(errp *error) {
  if e := recover(); e != nil {
    if se, ok := e.(scanError); ok { // catch local error
      *errp = se.err
    } else if eof, ok := e.(error); ok && eof == io.EOF { // out of input
      *errp = eof
    } else {
      panic(e)
    }
  }
}

Is this not "worrying about discriminating different kinds of panics"?

Conclusion - pragmatism vs. purity

It's not my intention to attack the Go standard library developers here. As I've mentioned, I fully see why panics are attractive in some cases where call stacks are deep and sequences of error-signaling operations are common. I really hope Go will introduce some syntax that will make propagating an error easier, which would render this discussion moot.

Sometimes, it's better to be pragmatic than a zealot. If a certain language feature is really helpful in solving a problem, even outside of its classical domain of use, it may be better to use it than to ardently sticking to principles and ending up with convoluted code. Kind-of like my old defense of using goto for error handling in C. The Go guidelines are clear and the restrictions on recover are craftily placed - even when used for control flow in parsers, it's much harder to misuse than classical exceptions.

Interestingly, when this problem first drew my attention I was looking into the source of the json/encode.go package. It turns out that it was recently fixed to use classical error handling! Yes, some code turned more verbose, from:

if destring {
  switch qv := d.valueQuoted().(type) {
    case nil:
      d.literalStore(nullLiteral, subv, false)
    case string:
      d.literalStore([]byte(qv), subv, true)
    // ... other code

To:

if destring {
  q, err := d.valueQuoted()
  if err != nil {
    return err
  }
  switch qv := q.(type) {
  case nil:
    if err := d.literalStore(nullLiteral, subv, false); err != nil {
      return err
    }
  case string:
    if err := d.literalStore([]byte(qv), subv, true); err != nil {
      return err
    }

But overall it's not that bad and certainly wouldn't look unfamiliar to a Go coder. And it gives me hope :-)


[1]C++'s set of exception safety guarantees is a good example of some of the complexities involved.
[2]If you spend some time reading the mailing list discussion where the recover mechanism was proposed, you'll find Russ Cox mentioning a similar issue with parsing a binary stream and how to propagate errors through the process.
[3]Even C++ has a similar pattern that you will find in some codebases where a standard return type is used. Macros commonly named ASSIGN_OR_RETURN are popular in C++ code released by Google and show up in other places like LLVM.

Launching Linux threads and processes with clone



Due to variation between operating systems and the way OS courses are taught, some programmers may have an outdated mental model about the difference between processes and threads in Linux. Even the name "thread" suggests something extremely lightweight compared to a heavy "process" - a mostly wrong intuition.

In fact, for the Linux kernel itself there's absolutely no difference between what userspace sees as processes (the result of fork) and as threads (the result of pthread_create). Both are represented by the same data structures and scheduled similarly. In kernel nomenclature this is called tasks (the main structure representing a task in the kernel is task_struct), and I'll be using this term from now on.

In Linux, threads are just tasks that share some resources, most notably their memory space; processes, on the other hand, are tasks that don't share resources. For application programmers, proceses and threads are created and managed in very different ways. For processes there's a slew of process-management APIs like fork, wait and so on. For threads there's the pthread library. However, deep in the guts of these APIs and libraries, both processes and threads come into existence through a single Linux system call - clone.

The clone system call

We can think of clone as the unifying implementation shared between processes and threads. Whatever perceived difference there is between processes and threads on Linux is achieved through passing different flags to clone. Therefore, it's most useful to think of processes and threads not as two completely different concepts, but rather as two variants of the same concept - starting a concurrent task. The differences are mostly about what is shared between this new task and the task that started it.

Here is a code sample demonstrating the most important sharing aspect of threads - memory. It uses clone in two ways, once with the CLONE_VM flag and once without. CLONE_VM tells clone to share the virtual memory between the calling task and the new task clone is about to create [1]. As we'll see later on, this is the flag used by pthread_create:

static int child_func(void* arg) {
  char* buf = (char*)arg;
  printf("Child sees buf = \"%s\"\n", buf);
  strcpy(buf, "hello from child");
  return 0;
}

int main(int argc, char** argv) {
  // Allocate stack for child task.
  const int STACK_SIZE = 65536;
  char* stack = malloc(STACK_SIZE);
  if (!stack) {
    perror("malloc");
    exit(1);
  }

  // When called with the command-line argument "vm", set the CLONE_VM flag on.
  unsigned long flags = 0;
  if (argc > 1 && !strcmp(argv[1], "vm")) {
    flags |= CLONE_VM;
  }

  char buf[100];
  strcpy(buf, "hello from parent");
  if (clone(child_func, stack + STACK_SIZE, flags | SIGCHLD, buf) == -1) {
    perror("clone");
    exit(1);
  }

  int status;
  if (wait(&status) == -1) {
    perror("wait");
    exit(1);
  }

  printf("Child exited with status %d. buf = \"%s\"\n", status, buf);
  return 0;
}

Some things to note when clone is invoked:

  1. It takes a function pointer to the code the new task will run, similarly to threading APIs, and unlike the fork API. This is the glibc wrapper for clone. There's also a raw system call which is discussed below.
  2. The stack for the new task has to be allocated by the parent and passed into clone.
  3. The SIGCHLD flag tells the kernel to send the SIGCHLD to the parent when the child terminates, which lets the parent use the plain wait call to wait for the child to exit. This is the only flag the sample passes into clone by default.

This code sample passes a buffer into the child, and the child writes a string into it. When called without the vm command-line argument, the CLONE_VM flag is off, and the parent's virtual memory is copied into the child. The child sees the message the parent placed in buf, but whatever it writes into buf goes into its own copy and the parent can't see it. Here's the output:

$ ./clone-vm-sample
Child sees buf = "hello from parent"
Child exited with status 0. buf = "hello from parent"

But when the vm argument is passed, CLONE_VM is set and the child task shares the parent's memory. Its writing into buf will now be observable from the parent:

$ ./clone-vm-sample vm
Child sees buf = "hello from parent"
Child exited with status 0. buf = "hello from child"

A bunch of other CLONE_* flags can specify other things that will be shared with the parent: CLONE_FILES will share the open file descriptors, CLONE_SIGHAND will share the signal dispositions, and so on.

Other flags are there to implement the semantics required by POSIX threads. For example, CLONE_THREAD asks the kernel to assign the same thread group id to the child as to the parent, in order to comply with POSIX's requirement of all threads in a process sharing a single process ID [2].

Calling clone in process and thread creation

Let's dig through some code in glibc to see how clone is invoked, starting with fork, which is routed to __libc_fork in sysdeps/nptl/fork.c. The actual implementation is specific to the threading library, hence the location in the nptl folder. The first thing __libc_fork does is invoke the fork handlers potentially registered beforehead with pthread_atfork.

The actual cloning happens with:

pid = ARCH_FORK ();

Where ARCH_FORK is a macro defined per architecture (exact syscall ABIs are architecture-specific). For x86_64 it maps to:

#define ARCH_FORK() \
  INLINE_SYSCALL (clone, 4,                                                   \
                  CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0,     \
                  NULL, &THREAD_SELF->tid)

The CLONE_CHILD_* flags are useful for some threading libraries (though not the default on Linux today - NPTL). Otherwise, the invocation is very similar to the clone code sample shown in the previous section.

You may wonder where is the function pointer in this call. Nice catch! This is the raw call version of clone, where execution continues from the point of the call in both parent and child - close to the usual semantics of fork.

Now let's turn to pthread_create. Through a dizzying chain of macros it reaches a function named create_thread (defined in sysdeps/unix/sysv/linux/createthread.c) that calls clone with:

const int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
                       | CLONE_SIGHAND | CLONE_THREAD
                       | CLONE_SETTLS | CLONE_PARENT_SETTID
                       | CLONE_CHILD_CLEARTID
                       | 0);

ARCH_CLONE (&start_thread, STACK_VARIABLES_ARGS,
            clone_flags, pd, &pd->tid, tp, &pd->tid)

Browse through man 2 clone to understand the flags passed into the call. Briefly, it is asked to share the virtual memory, file system, open files, shared memory and signal handlers with the parent thread/process. Additional flags are passed to implement proper identification - all threads launched from a single process have to share its process ID to be POSIX compliant.

Reading the glibc source code is quite an exercise in mental resilience, but it's really interesting to see how everything fits together "in the real world".

Benchmarking process vs. thread creation

Given the information presented earlier in the post, I would expect process creation to be somewhat more expensive than thread creation, but not dramatically so. Since fork and pthread_create route to the same system call in Linux, the difference would come from the different flags they pass in. When pthread_create passes all these CLONE_* flags, it tells the kernel there's no need to copy the virtual memory image, the open files, the signal handlers, and so on. Obviously, this saves time.

For processes, there's a bit of copying to be done when fork is invoked, which costs time. The biggest chunk of time probably goes to copying the memory image due to the lack of CLONE_VM. Note, however, that it's not just copying the whole memory; Linux has an important optimization by using COW (Copy On Write) pages. The child's memory pages are initially mapped to the same pages shared by the parent, and only when we modify them the copy happens. This is very important because processes will often use a lot of shared read-only memory (think of the global structures used by the standard library, for example).

That said, the page tables still have to be copied. The size of a process's page tables can be observed by looking in /proc/<pid>/status - the VmPTE indicator. These can be around tens of kilobytes for small processes, and higher for larger processes. Not a lot of data to copy, but definitely some extra work for the CPU.

I wrote a benchmark that times process and threads launches, as a function of the virtual memory allocated before fork or pthread_create. The launch is averaged over 10,000 instances to remove warm-up effects and jitter:

Launch time for fork/thread as function of memory image

Several things to note:

  1. Indeed, launching processes is slower than threads, 35 us. 5 microseconds for a 2-MB heap. But it's still very fast! 35 micro-seconds is not a lot of time at all. If your latency budget is willing to tolerate a 5 us overhead, it will almost certainly be fine with a 35 us overhead, unless you're working on some super-tight hard realtime system (in which case you shouldn't be using Linux!)
  2. As expected, the time to launch a process when the heap is larger grows. The time delta is the time needed to copy the extra page table entries. For threads, on the other hand, there is absolutely no difference since the memory is completely shared.

Interestingly, it's easy to observe from these numbers that not the whole memory image is being copied. On the same machine this benchmark was run on, just a simple memcpy of 2 MB takes over 60 us, so it couldn't have copied 2 MB of heap to the child in the 30 us difference. Copying 64K (a reasonable size for a page table) takes 3 us, which makes sense because the cloning involves more than a simple memcpy. To me this is another sign of how fast these launches are, since we're in the same ballpark of performance with modestly sized memory copies.

Creation time is not the only performance benchmark of importance. It's also interesting to measure how long it takes to switch context between tasks when using threads or processes. I plan to cover this in a future post.


[1]It may be just me, but I find this terminology a bit confusing. In my mind the word clone is synonymous to copy, so when we turn on a flag named "clone the VM" I'd expect the VM to be copied rather than shared. IMHO it would be clearer if this flag was named SHARE_VM.
[2]It's certainly interesting to see this evolution of concepts over time. Thread APIs were defined in times where there was a real difference between processes and threads and their design reflects that. In modern Linux the kernel has to bend over backwards to provide the illusion of the difference although very little of it exists.

Basics of Futexes



The futex (short for "Fast userspace mutex") mechanism was proposed by Linux contributors from IBM in 2002 [1]; it was integrated into the kernel in late 2003. The main idea is to enable a more efficient way for userspace code to synchronize multiple threads, with minimal kernel involvement.

In this post I want to provide a basic overview of futexes, how they work, and how they're used to implement the more familiar synchronization primitives in higher-level APIs and languages.

An important disclaimer: futexes are a very low-level feature of the Linux kernel, suitable for use in foundational runtime components like the C/C++ standard libraries. It is extremely unlikely that you will ever need to use them in application code.

Motivation

Before the introduction of futexes, system calls were required for locking and unlocking shared resources (for example semop). System calls are relatively expensive, however, requiring a context switch from userspace to kernel space; as programs became increasingly concurrent, locks started showing up on profiles as a significant percentage of the run time. This is very unfortunate, given that locks accomplish no real work ("business logic") but are only there to guarantee that access to shared resources is safe.

The futex proposal is based on a clever observation: in most cases, locks are actually not contended. If a thread comes upon a free lock, locking it can be cheap because most likely no other thread is trying to lock it at the exact same time. So we can get by without a system call, attemping much cheaper atomic operations first [2]. There's a very high chance that the atomic instruction will succeed.

However, in the unlikely event that another thread did try to take the lock at the same time, the atomic approach may fail. In this case there are two options. We can busy-loop using the atomic until the lock is cleared; while this is 100% userspace, it can also be extremely wasteful since looping can significantly occupy a core, and the lock can be held for a long time. The alternative is to "sleep" until the lock is free (or at least there's a high chance that it's free); we need the kernel to help with that, and this is where futexes come in.

Simple futex use - waiting and waking

The futex(2) system call multiplexes a lot of functionality on top of a single interface. I will not discuss any of the advanced options here (some of them are so esoteric they're not even officially documented) but will focus on just FUTEX_WAIT and FUTEX_WAKE. The man page description starts with a good introduction:

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.

Simply stated, a futex is a kernel construct that helps userspace code synchronize on shared events. Some userspace processes (or threads) can wait on an event (FUTEX_WAIT), while another userspace process can signal the event (FUTEX_WAKE) to notify waiters. The waiting is efficient - the waiters are suspended by the kernel and are only scheduled anew when there's a wake-up signal.

Be sure to read the futex man page beyond the introduction; blog posts are not a substitute for documentation! At the very least read about the FUTEX_WAIT and FUTEX_WAKE calls, the arguments they take, their return values and possible errors.

Let's study a simple example demonstrating basic usage of futexes to coordinate two processes. The main function sets up the machinery and launches a child process that:

  1. Waits for 0xA to be written into a shared memory slot.
  2. Writes 0xB into the same memory slot.

Meanwhile, the parent:

  1. Writes 0xA into the shared memory slot.
  2. Waits for 0xB to be written into the slot.

This is a simple handshake between two processes. Here's the code:

int main(int argc, char** argv) {
  int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
  if (shm_id < 0) {
    perror("shmget");
    exit(1);
  }
  int* shared_data = shmat(shm_id, NULL, 0);
  *shared_data = 0;

  int forkstatus = fork();
  if (forkstatus < 0) {
    perror("fork");
    exit(1);
  }

  if (forkstatus == 0) {
    // Child process

    printf("child waiting for A\n");
    wait_on_futex_value(shared_data, 0xA);

    printf("child writing B\n");
    // Write 0xB to the shared data and wake up parent.
    *shared_data = 0xB;
    wake_futex_blocking(shared_data);
  } else {
    // Parent process.

    printf("parent writing A\n");
    // Write 0xA to the shared data and wake up child.
    *shared_data = 0xA;
    wake_futex_blocking(shared_data);

    printf("parent waiting for B\n");
    wait_on_futex_value(shared_data, 0xB);

    // Wait for the child to terminate.
    wait(NULL);
    shmdt(shared_data);
  }

  return 0;
}

Note that we use POSIX shared memory APIs to create a memory location mapped into both processes. We can't just use a regular pointer here, because the address spaces of the two processes will be different [3].

Note that this is not a canonical usage of futex, which would be better employed to wait until a value changes from something rather than to something. It's just here to show the various possibilities in return values from futex. Later in the post a more canonical usage is demonstrated when we implement a mutex.

Here is wait_on_futex_value:

void wait_on_futex_value(int* futex_addr, int val) {
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAIT, val, NULL, NULL, 0);
    if (futex_rc == -1) {
      if (errno != EAGAIN) {
        perror("futex");
        exit(1);
      }
    } else if (futex_rc == 0) {
      if (*futex_addr == val) {
        // This is a real wakeup.
        return;
      }
    } else {
      abort();
    }
  }
}

This function's main added value on top of the futex system call is looping around when the wakeup is spurious. This can happen when val is not the expected value (yet) and also when another process was woken up before this one (can't really happen in this code sample, but is a real possibility in other scenarios).

Futex semantics are tricky [4]! FUTEX_WAIT will immediately return if the value at the futex address is not equal to val. In our case this can happen if the child issued a wait before the parent wrote 0xA, for example. The futex call will return an error with EAGAIN in this case.

Here is wake_futex_blocking:

void wake_futex_blocking(int* futex_addr) {
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAKE, 1, NULL, NULL, 0);
    if (futex_rc == -1) {
      perror("futex wake");
      exit(1);
    } else if (futex_rc > 0) {
      return;
    }
  }
}

It's a blocking wrapper around FUTEX_WAKE, which will normally return quickly regardless of how many waiters it has woken up. In our sample, this waiting is part of the handshake, but in many cases you won't see it.

Futexes are kernel queues for userspace code

Simply stated, a futex is a queue the kernel manages for userspace convenience. It lets usespace code ask the kernel to suspend until a certain condition is satisfied, and lets other userspace code signal that condition and wake up waiting processes. Earlier we've menioned busy-looping as one approach to wait on success of atomic operations; a kernel-managed queue is the much more efficient alternative, absolving userspace code from the need to burn billions of CPU cycles on pointless spinning.

Here's a diagram from LWN's "A futex overview and update":

Futex implementation diagram from LWN

In the Linux kernel, futexes are implemented in kernel/futex.c. The kernel keeps a hash table keyed by the address to quickly find the proper queue data structure and adds the calling process to the wait queue. There's quite a bit of complication, of course, due to using fine-grained locking within the kernel itself and the various advanced options of futexes.

Timed blocking with FUTEX_WAIT

The futex system call has a timeout parameter which lets user code implement waiting with a time-out.

The futex-wait-timeout sample shows this in action. Here is the relevant part of the child process which waits on a futex:

printf("child waiting for A\n");
struct timespec timeout = {.tv_sec = 0, .tv_nsec = 500000000};
while (1) {
  unsigned long long t1 = time_ns();
  int futex_rc = futex(shared_data, FUTEX_WAIT, 0xA, &timeout, NULL, 0);
  printf("child woken up rc=%d errno=%s, elapsed=%llu\n", futex_rc,
         futex_rc ? strerror(errno) : "", time_ns() - t1);
  if (futex_rc == 0 && *shared_data == 0xA) {
    break;
  }
}

If the wait takes longer than 500 ms, the process will loop and wait again. The sample lets you configure the length of time the parent process keeps the child waiting and observe the effects.

Using a futex to implement a simple mutex

In the motivation section that started this post, I explained how futexes help implement efficient locking in the common low-contention case. It's time to show a realistic implementation of a mutex using futexes and atomics. This is based on the second implementation in Ulrich Drepper's "Futexes are Tricky" paper.

For this sample I'm switching to C++, to use its standardized atomics (available since C++11). The full code is here; here is the important part:

class Mutex {
public:
  Mutex() : atom_(0) {}

  void lock() {
    int c = cmpxchg(&atom_, 0, 1);
    // If the lock was previously unlocked, there's nothing else for us to do.
    // Otherwise, we'll probably have to wait.
    if (c != 0) {
      do {
        // If the mutex is locked, we signal that we're waiting by setting the
        // atom to 2. A shortcut checks is it's 2 already and avoids the atomic
        // operation in this case.
        if (c == 2 || cmpxchg(&atom_, 1, 2) != 0) {
          // Here we have to actually sleep, because the mutex is actually
          // locked. Note that it's not necessary to loop around this syscall;
          // a spurious wakeup will do no harm since we only exit the do...while
          // loop when atom_ is indeed 0.
          syscall(SYS_futex, (int*)&atom_, FUTEX_WAIT, 2, 0, 0, 0);
        }
        // We're here when either:
        // (a) the mutex was in fact unlocked (by an intervening thread).
        // (b) we slept waiting for the atom and were awoken.
        //
        // So we try to lock the atom again. We set teh state to 2 because we
        // can't be certain there's no other thread at this exact point. So we
        // prefer to err on the safe side.
      } while ((c = cmpxchg(&atom_, 0, 2)) != 0);
    }
  }

  void unlock() {
    if (atom_.fetch_sub(1) != 1) {
      atom_.store(0);
      syscall(SYS_futex, (int*)&atom_, FUTEX_WAKE, 1, 0, 0, 0);
    }
  }

private:
  // 0 means unlocked
  // 1 means locked, no waiters
  // 2 means locked, there are waiters in lock()
  std::atomic<int> atom_;
};

Where cmpxhg is a simple wrapper to subdue C++'s atomic primitive to the expected interface:

// An atomic_compare_exchange wrapper with semantics expected by the paper's
// mutex - return the old value stored in the atom.
int cmpxchg(std::atomic<int>* atom, int expected, int desired) {
  int* ep = &expected;
  std::atomic_compare_exchange_strong(atom, ep, desired);
  return *ep;
}

The code snippet is heavily commented to explain how it works; reading Drepper's paper is recommended in any case, as it builds up to this implementation by first examining a simpler one which is subtly incorrect. One slightly non-kosher thing this code does is access the internal representation of std::atomic by casting the address of atom_ to int* when passing it to the futex syscall. This is because futex expects a simple address, while C++ atomics wrap their actual data in opaque types. This works on Linux on x64, but isn't generally portable. To make std::atomic play well with futex in a portable we'd have to add a portability layer. But it's not a need that comes up in practice - mixing futex with C++11 is not something anyone should do - these snippets are just demonstrational!

An interesting observation is about the meaning of the value sitting in the atom_ member. Recall that the futex syscall doesn't assign any meaning to the value - it's up to the user to do that. The 0,1,2 convention is useful for mutexes, and also the one used by the glibc implementation for low-level locks.

glibc mutex and low-level lock

This brings us to the glibc implementation of POSIX threads, which have the pthread_mutex_t type. As I've mentioned in the beginning of the post, futexes are not really for regular user code; rather, they are used by low-level runtimes and libraries to implement other, higher-level primitives. In this context, it's interesting to see how a mutex is implemented for NPTL. In the glibc source tree, this code is in nptl/pthread_mutex_lock.c

The code is significantly complicated by all the different types of mutexes it has to support, but we can discover some familiar building blocks if we dig deep enough. In addition to the file mentioned above, other files to look at (for x86) are sysdeps/unix/sysv/linux/x86_64/lowlevellock.h and nptl/lowlevellock.c. The code is dense, but the combination of atomic compare-and-exchange operations and futex invocations is apparent. The low-level lock machinery (lll_ or LLL_ prefixes) is used throughout the glibc code-base, not just in the implementation of POSIX threads.

The beginning of the comment at the top of sysdeps/nptl/lowlevellock.h should be familiar by now:

/* Low-level locks use a combination of atomic operations (to acquire and
   release lock ownership) and futex operations (to block until the state
   of a lock changes).  A lock can be in one of three states:
   0:  not acquired,
   1:  acquired with no waiters; no other threads are blocked or about to block
       for changes to the lock state,
   >1: acquired, possibly with waiters; there may be other threads blocked or
       about to block for changes to the lock state.

   We expect that the common case is an uncontended lock, so we just need
   to transition the lock between states 0 and 1; releasing the lock does
   not need to wake any other blocked threads.  If the lock is contended
   and a thread decides to block using a futex operation, then this thread
   needs to first change the state to >1; if this state is observed during
   lock release, the releasing thread will wake one of the potentially
   blocked threads.
 ..
 */

Futexes in the Go runtime

The Go runtime does not use libc, in most cases. Therefore, it cannot rely on the POSIX thread implementation in its own code. It invokes the underlying OS's system calls directly instead.

That makes it a good alternative candidate to study for its use of futexes. Since it can't just use a pthread_mutex_t for its locking, it has to roll its own lock. Let's see how this is done, by starting with the user-visible sync.Mutex type (in src/sync/mutex.go).

The Lock method of sync.Mutex is quite involved, as you might imagine. It first tries to use an atomic swap to quickly acquire a lock. If it turns out it has to wait, it defers to runtime_SemacquireMutex, which in turn calls runtime.lock. That function is defined in src/runtime/lock_futex.go [5], and defines some constants that will appear familiar:

const (
  mutex_unlocked = 0
  mutex_locked   = 1
  mutex_sleeping = 2

...
)

// Possible lock states are mutex_unlocked, mutex_locked and mutex_sleeping.
// mutex_sleeping means that there is presumably at least one sleeping thread.

runtime.lock also tries to speculatively grab a lock with an atomic; this function is used in a bunch of places in the Go runtime, so that makes sense, but I wonder if they couldn't have optimized the two consecutive atomics that occur when it's called by Mutex.lock, somehow.

If it discovers it has to sleep, it defers to futexsleep, which is OS-specific and lives in src/runtime/os_linux.go. This function calls invokes the futex system call directly with FUTEX_WAIT_PRIVATE (recall that this is sufficient for a single process, which the Go runtime fulfills).


[1]See "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux" by Franke, Russell, Kirkwood. Published in 2002 for the Ottawa Linux Symposium.
[2]Most modern processors have built-in atomic instructions implemented in HW. For example on Intel architectures cmpxhg is an instruction. While it's not as cheap as non-atomic instructions (especially in multi-core systems), it's significantly cheaper than system calls.
[3]The code repository for this post also contains an equivalent sample using threads instead of processes. There we don't need to use shared memory but can instead use the address of a stack variable.
[4]There's a paper written by Ulrich Drepper named "Futexes are Tricky" that explores some of the nuances. I'll be using it later on for the mutex discussion. It's a very good paper - please read it if you're interested in the topic.
[5]For OSes that expose the futex(2) system call. The Go runtime has a fallback onto the semaphore system calls if futex is not supported.