Basics of Futexes

The futex (short for "Fast userspace mutex") mechanism was proposed by Linux contributors from IBM in 2002 [1]; it was integrated into the kernel in late 2003. The main idea is to enable a more efficient way for userspace code to synchronize multiple threads, with minimal kernel involvement.

In this post I want to provide a basic overview of futexes, how they work, and how they're used to implement the more familiar synchronization primitives in higher-level APIs and languages.

An important disclaimer: futexes are a very low-level feature of the Linux kernel, suitable for use in foundational runtime components like the C/C++ standard libraries. It is extremely unlikely that you will ever need to use them in application code.


Before the introduction of futexes, system calls were required for locking and unlocking shared resources (for example semop). System calls are relatively expensive, however, requiring a context switch from userspace to kernel space; as programs became increasingly concurrent, locks started showing up on profiles as a significant percentage of the run time. This is very unfortunate, given that locks accomplish no real work ("business logic") but are only there to guarantee that access to shared resources is safe.

The futex proposal is based on a clever observation: in most cases, locks are actually not contended. If a thread comes upon a free lock, locking it can be cheap because most likely no other thread is trying to lock it at the exact same time. So we can get by without a system call, attemping much cheaper atomic operations first [2]. There's a very high chance that the atomic instruction will succeed.

However, in the unlikely event that another thread did try to take the lock at the same time, the atomic approach may fail. In this case there are two options. We can busy-loop using the atomic until the lock is cleared; while this is 100% userspace, it can also be extremely wasteful since looping can significantly occupy a core, and the lock can be held for a long time. The alternative is to "sleep" until the lock is free (or at least there's a high chance that it's free); we need the kernel to help with that, and this is where futexes come in.

Simple futex use - waiting and waking

The futex(2) system call multiplexes a lot of functionality on top of a single interface. I will not discuss any of the advanced options here (some of them are so esoteric they're not even officially documented) but will focus on just FUTEX_WAIT and FUTEX_WAKE. The man page description starts with a good introduction:

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.

Simply stated, a futex is a kernel construct that helps userspace code synchronize on shared events. Some userspace processes (or threads) can wait on an event (FUTEX_WAIT), while another userspace process can signal the event (FUTEX_WAKE) to notify waiters. The waiting is efficient - the waiters are suspended by the kernel and are only scheduled anew when there's a wake-up signal.

Be sure to read the futex man page beyond the introduction; blog posts are not a substitute for documentation! At the very least read about the FUTEX_WAIT and FUTEX_WAKE calls, the arguments they take, their return values and possible errors.

Let's study a simple example demonstrating basic usage of futexes to coordinate two processes. The main function sets up the machinery and launches a child process that:

  1. Waits for 0xA to be written into a shared memory slot.
  2. Writes 0xB into the same memory slot.

Meanwhile, the parent:

  1. Writes 0xA into the shared memory slot.
  2. Waits for 0xB to be written into the slot.

This is a simple handshake between two processes. Here's the code:

int main(int argc, char** argv) {
  int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
  if (shm_id < 0) {
  int* shared_data = shmat(shm_id, NULL, 0);
  *shared_data = 0;

  int forkstatus = fork();
  if (forkstatus < 0) {

  if (forkstatus == 0) {
    // Child process

    printf("child waiting for A\n");
    wait_on_futex_value(shared_data, 0xA);

    printf("child writing B\n");
    // Write 0xB to the shared data and wake up parent.
    *shared_data = 0xB;
  } else {
    // Parent process.

    printf("parent writing A\n");
    // Write 0xA to the shared data and wake up child.
    *shared_data = 0xA;

    printf("parent waiting for B\n");
    wait_on_futex_value(shared_data, 0xB);

    // Wait for the child to terminate.

  return 0;

Note that we use POSIX shared memory APIs to create a memory location mapped into both processes. We can't just use a regular pointer here, because the address spaces of the two processes will be different [3].

Note that this is not a canonical usage of futex, which would be better employed to wait until a value changes from something rather than to something. It's just here to show the various possibilities in return values from futex. Later in the post a more canonical usage is demonstrated when we implement a mutex.

Here is wait_on_futex_value:

void wait_on_futex_value(int* futex_addr, int val) {
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAIT, val, NULL, NULL, 0);
    if (futex_rc == -1) {
      if (errno != EAGAIN) {
    } else if (futex_rc == 0) {
      if (*futex_addr == val) {
        // This is a real wakeup.
    } else {

This function's main added value on top of the futex system call is looping around when the wakeup is spurious. This can happen when val is not the expected value (yet) and also when another process was woken up before this one (can't really happen in this code sample, but is a real possibility in other scenarios).

Futex semantics are tricky [4]! FUTEX_WAIT will immediately return if the value at the futex address is not equal to val. In our case this can happen if the child issued a wait before the parent wrote 0xA, for example. The futex call will return an error with EAGAIN in this case.

Here is wake_futex_blocking:

void wake_futex_blocking(int* futex_addr) {
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAKE, 1, NULL, NULL, 0);
    if (futex_rc == -1) {
      perror("futex wake");
    } else if (futex_rc > 0) {

It's a blocking wrapper around FUTEX_WAKE, which will normally return quickly regardless of how many waiters it has woken up. In our sample, this waiting is part of the handshake, but in many cases you won't see it.

Futexes are kernel queues for userspace code

Simply stated, a futex is a queue the kernel manages for userspace convenience. It lets usespace code ask the kernel to suspend until a certain condition is satisfied, and lets other userspace code signal that condition and wake up waiting processes. Earlier we've menioned busy-looping as one approach to wait on success of atomic operations; a kernel-managed queue is the much more efficient alternative, absolving userspace code from the need to burn billions of CPU cycles on pointless spinning.

Here's a diagram from LWN's "A futex overview and update":

Futex implementation diagram from LWN

In the Linux kernel, futexes are implemented in kernel/futex.c. The kernel keeps a hash table keyed by the address to quickly find the proper queue data structure and adds the calling process to the wait queue. There's quite a bit of complication, of course, due to using fine-grained locking within the kernel itself and the various advanced options of futexes.

Timed blocking with FUTEX_WAIT

The futex system call has a timeout parameter which lets user code implement waiting with a time-out.

The futex-wait-timeout sample shows this in action. Here is the relevant part of the child process which waits on a futex:

printf("child waiting for A\n");
struct timespec timeout = {.tv_sec = 0, .tv_nsec = 500000000};
while (1) {
  unsigned long long t1 = time_ns();
  int futex_rc = futex(shared_data, FUTEX_WAIT, 0xA, &timeout, NULL, 0);
  printf("child woken up rc=%d errno=%s, elapsed=%llu\n", futex_rc,
         futex_rc ? strerror(errno) : "", time_ns() - t1);
  if (futex_rc == 0 && *shared_data == 0xA) {

If the wait takes longer than 500 ms, the process will loop and wait again. The sample lets you configure the length of time the parent process keeps the child waiting and observe the effects.

Using a futex to implement a simple mutex

In the motivation section that started this post, I explained how futexes help implement efficient locking in the common low-contention case. It's time to show a realistic implementation of a mutex using futexes and atomics. This is based on the second implementation in Ulrich Drepper's "Futexes are Tricky" paper.

For this sample I'm switching to C++, to use its standardized atomics (available since C++11). The full code is here; here is the important part:

class Mutex {
  Mutex() : atom_(0) {}

  void lock() {
    int c = cmpxchg(&atom_, 0, 1);
    // If the lock was previously unlocked, there's nothing else for us to do.
    // Otherwise, we'll probably have to wait.
    if (c != 0) {
      do {
        // If the mutex is locked, we signal that we're waiting by setting the
        // atom to 2. A shortcut checks is it's 2 already and avoids the atomic
        // operation in this case.
        if (c == 2 || cmpxchg(&atom_, 1, 2) != 0) {
          // Here we have to actually sleep, because the mutex is actually
          // locked. Note that it's not necessary to loop around this syscall;
          // a spurious wakeup will do no harm since we only exit the do...while
          // loop when atom_ is indeed 0.
          syscall(SYS_futex, (int*)&atom_, FUTEX_WAIT, 2, 0, 0, 0);
        // We're here when either:
        // (a) the mutex was in fact unlocked (by an intervening thread).
        // (b) we slept waiting for the atom and were awoken.
        // So we try to lock the atom again. We set teh state to 2 because we
        // can't be certain there's no other thread at this exact point. So we
        // prefer to err on the safe side.
      } while ((c = cmpxchg(&atom_, 0, 2)) != 0);

  void unlock() {
    if (atom_.fetch_sub(1) != 1) {;
      syscall(SYS_futex, (int*)&atom_, FUTEX_WAKE, 1, 0, 0, 0);

  // 0 means unlocked
  // 1 means locked, no waiters
  // 2 means locked, there are waiters in lock()
  std::atomic<int> atom_;

Where cmpxhg is a simple wrapper to subdue C++'s atomic primitive to the expected interface:

// An atomic_compare_exchange wrapper with semantics expected by the paper's
// mutex - return the old value stored in the atom.
int cmpxchg(std::atomic<int>* atom, int expected, int desired) {
  int* ep = &expected;
  std::atomic_compare_exchange_strong(atom, ep, desired);
  return *ep;

The code snippet is heavily commented to explain how it works; reading Drepper's paper is recommended in any case, as it builds up to this implementation by first examining a simpler one which is subtly incorrect. One slightly non-kosher thing this code does is access the internal representation of std::atomic by casting the address of atom_ to int* when passing it to the futex syscall. This is because futex expects a simple address, while C++ atomics wrap their actual data in opaque types. This works on Linux on x64, but isn't generally portable. To make std::atomic play well with futex in a portable we'd have to add a portability layer. But it's not a need that comes up in practice - mixing futex with C++11 is not something anyone should do - these snippets are just demonstrational!

An interesting observation is about the meaning of the value sitting in the atom_ member. Recall that the futex syscall doesn't assign any meaning to the value - it's up to the user to do that. The 0,1,2 convention is useful for mutexes, and also the one used by the glibc implementation for low-level locks.

glibc mutex and low-level lock

This brings us to the glibc implementation of POSIX threads, which have the pthread_mutex_t type. As I've mentioned in the beginning of the post, futexes are not really for regular user code; rather, they are used by low-level runtimes and libraries to implement other, higher-level primitives. In this context, it's interesting to see how a mutex is implemented for NPTL. In the glibc source tree, this code is in nptl/pthread_mutex_lock.c

The code is significantly complicated by all the different types of mutexes it has to support, but we can discover some familiar building blocks if we dig deep enough. In addition to the file mentioned above, other files to look at (for x86) are sysdeps/unix/sysv/linux/x86_64/lowlevellock.h and nptl/lowlevellock.c. The code is dense, but the combination of atomic compare-and-exchange operations and futex invocations is apparent. The low-level lock machinery (lll_ or LLL_ prefixes) is used throughout the glibc code-base, not just in the implementation of POSIX threads.

The beginning of the comment at the top of systeds/nptl/lowlevellock.h should be familiar by now:

/* Low-level locks use a combination of atomic operations (to acquire and
   release lock ownership) and futex operations (to block until the state
   of a lock changes).  A lock can be in one of three states:
   0:  not acquired,
   1:  acquired with no waiters; no other threads are blocked or about to block
       for changes to the lock state,
   >1: acquired, possibly with waiters; there may be other threads blocked or
       about to block for changes to the lock state.

   We expect that the common case is an uncontended lock, so we just need
   to transition the lock between states 0 and 1; releasing the lock does
   not need to wake any other blocked threads.  If the lock is contended
   and a thread decides to block using a futex operation, then this thread
   needs to first change the state to >1; if this state is observed during
   lock release, the releasing thread will wake one of the potentially
   blocked threads.

Futexes in the Go runtime

The Go runtime does not use libc, in most cases. Therefore, it cannot rely on the POSIX thread implementation in its own code. It invokes the underlying OS's system calls directly instead.

That makes it a good alternative candidate to study for its use of futexes. Since it can't just use a pthread_mutex_t for its locking, it has to roll its own lock. Let's see how this is done, by starting with the user-visible sync.Mutex type (in src/sync/mutex.go).

The Lock method of sync.Mutex is quite involved, as you might imagine. It first tries to use an atomic swap to quickly acquire a lock. If it turns out it has to wait, it defers to runtime_SemacquireMutex, which in turn calls runtime.lock. That function is defined in src/runtime/lock_futex.go [5], and defines some constants that will appear familiar:

const (
  mutex_unlocked = 0
  mutex_locked   = 1
  mutex_sleeping = 2


// Possible lock states are mutex_unlocked, mutex_locked and mutex_sleeping.
// mutex_sleeping means that there is presumably at least one sleeping thread.

runtime.lock also tries to speculatively grab a lock with an atomic; this function is used in a bunch of places in the Go runtime, so that makes sense, but I wonder if they couldn't have optimized the two consecutive atomics that occur when it's called by Mutex.lock, somehow.

If it discovers it has to sleep, it defers to futexsleep, which is OS-specific and lives in src/runtime/os_linux.go. This function calls invokes the futex system call directly with FUTEX_WAIT_PRIVATE (recall that this is sufficient for a single process, which the Go runtime fulfills).

[1]See "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux" by Franke, Russell, Kirkwood. Published in 2002 for the Ottawa Linux Symposium.
[2]Most modern processors have built-in atomic instructions implemented in HW. For example on Intel architectures cmpxhg is an instruction. While it's not as cheap as non-atomic instructions (especially in multi-core systems), it's significantly cheaper than system calls.
[3]The code repository for this post also contains an equivalent sample using threads instead of processes. There we don't need to use shared memory but can instead use the address of a stack variable.
[4]There's a paper written by Ulrich Drepper named "Futexes are Tricky" that explores some of the nuances. I'll be using it later on for the mutex discussion. It's a very good paper - please read it if you're interested in the topic.
[5]For OSes that expose the futex(2) system call. The Go runtime has a fallback onto the semaphore system calls if futex is not supported.

Elegant Python code for a Markov chain text generator

While preparing the post on minimal char-based RNNs, I coded a simple Markov chain text generator to serve as a comparison for the quality of the RNN model. That code turned out to be consice and quite elegant (IMHO!), so it seemsed like I should write a few words about it.

It's so short I'm just going to paste it here in its entirety, but this link should have it in a Python file with some extra debugging information for tinkering, along with a sample input file.

from collections import defaultdict, Counter
import random
import sys

# This is the length of the "state" the current character is predicted from.
# For Markov chains with memory, this is the "order" of the chain. For n-grams,
# n is STATE_LEN+1 since it includes the predicted character as well.

data =
model = defaultdict(Counter)

print('Learning model...')
for i in range(len(data) - STATE_LEN):
    state = data[i:i + STATE_LEN]
    next = data[i + STATE_LEN]
    model[state][next] += 1

state = random.choice(list(model))
out = list(state)
for i in range(400):
    out.extend(random.choices(list(model[state]), model[state].values()))
    state = state[1:] + out[-1]

Without going into too much details, a Markov Chain is a model describing the probabilities of events based on the current state only (without having to recall all past states). It's very easy to implement and "train".

In the code shown above, the most important part to grok is the data structure model. It's a dictionary mapping a string state to the probabilities of characters following this state. The size of that string is configurable, but let's just assume it's 4 for the rest of the discussion. This is the order of the Markov chain. For every string seen in the input, we look at the character following it and increment a counter for that character; the end result is a dictionary mapping the alphabet to integers. For example, we may find that for the state "foob", 'a' appeared 75 times right after it, 'b' appeared 25 times, 'e' 44 times and so on.

The learning process is simply sliding a "window" of 4 characters over the input, recording these appearances:

Markov chain sliding window diagram

The learning loop is extremely concise; this is made possible by the right choice of Python data structures. First, we use a defaultdict for the model itself; this lets us avoid existence checks or try for states that don't appear in the model at all.

Second, the objects contained inside model are of type Counter, which is a subclass of dict with some special sauce. In its most basic usage, a counter is meant to store an integer count for its keys - exactly what we need here. So a lot of power is packed into this simple statement:

model[state][next] += 1

If you try to rewrite it with model being a dict of dicts, it will become much more complicated to keep track of the corner cases.

With the learning loop completed, we have in model every 4-letter string encountered in the text, mapped to its Counter of occurrences for the character immediately following it. We're ready to generate text, or "sample from the model".

We start by picking a random state that was seen in the training text. Then, we loop for an arbitrary bound and at every step we randomly select the following character, and update the current state. The following character is selected using weighted random selection - precisely the right idiom here, as we already have in each counter the "weights" - the more often some char was observed after a given state, the higher the chance to select it for sampling will be.

Starting with Python 3.6, the standard library has random.choices to implement weighted random selection. Before Python 3.6 we'd have to write that function on our own (Counter has the most_common() method that would make it easier to write an efficient version).

Summary of reading: April - June 2018

  • "Tao-teching" by Lao Tzu - since it's a very short book with many very different translations, I read a couple. One by Stephen Mitchell and one by Red Pine, which also has commentaries. It's interesting to contrast the different translations - they're very different! I found the English in Mitchell's more flowy, though Red Pine seems to approach accuracy very academically and collects from multiple sources, so his may be the closer to the original (?). In any case, it's quite a unique book which tries to convey the Tao without explicitly saying what it is. It's curious to find a bunch of advice in it that withstood the test of time because it's being told by self-help gurus and materials to this day.
  • "Ready Player One" by Ernest Cline - an original fantasy sci-fi story about a dystopian future where people find solace in an engrossing VR simulation, and a group of protagonists with an RPG-like quest. Full of references to games, movies and geek culture from the 1980s, so somewhat earlier than my time. I imagine it really speaks to folks of a certain age though. Not bad for a sci-fi book, overall.
  • "The Quest: Energy, Security, and the Remaking of the Modern World" by Daniel Yergin - an epic treatise of energy - oil, coal, natural gas, solar, wind, etc. Covers many aspects including history and geo-politics in many parts of the world, economics, the transportation market and much more. At over 800 pages it's long-winded and a bit repetitive in places, but overall a very well-written and interesting book.
  • "Gorilla, My Love" by Toni Cade Bambara - a collection of short stories about the lives of blacks in Brooklyn, told from the vantage point of a young girl. In theory, sounds great. In practice, however, there's a strong mismatch between this book's artistic style and what I prefer reading; I didn't like the writing at all. Some stories are nice, but many are borderline unreadable gibberish.
  • "Deep Learning with Python" by Francois Chollet - a very good introduction to deep learning and the Keras library. I really like the intuitions and hard-won experience shared by the author for choosing the right kinds of model for a given problem. That said, the book avoids getting into math and describing the inner workings of DNNs in too much detail; I understand it's a deliberate choice and respect it, but IMHO it makes it more suitable for complete beginners than for folks interested in getting to the next level.
  • "A Crack in Creation" by Jennifer Doudna and Samuel Sternberg - some of the pioneers of CRISPR technology discuss its origins, developments and possible future. Very interesting topic, and the book is well written. I personally found the biology parts much more interesting than the ethics parts and would have liked there to be more detail. Will have to look for more technical literature, I guess.
  • "Stranger in a Strange Land" by Robert Heinlein - finally got to read this iconic work of sci-fi and grok the origins of the word "grok". Though it's occasionally tiresome, overall it's a brilliant book. Unfortunately, the blatant sexism and chauvinism stands out and would likely get a much different reception if published today. It's always amusing to see how unimaginative future predictions of technology are, and how much they reflect the scientific know-how at the time of publishing. In this book people easily travel to Mars and have flying cars, but still use public pay phones and cassette tapes.
  • "How Linux Works, 2nd ed." by Brian Ward - an intermediate-level book describing how the Linux operating system works, aimed at power users and system administrators, with some basic detours into programming. I would be happy if it went into more depth on some topics, but that would probably blow the size up 10x so I understand the author's constraints. Chapter 8 on resource management/utilization was my favorite.
  • "Evolving Ourselves: How Unnatural Selection and Nonrandom Mutation are Changing Life on Earth" by Juan Enriquez & Steve Gullans - nice snapshot of the current state of bioscience and how recent advances in technology skew human evolution. Unfortunately, too much focus is given to speculation, ethics and other meta-topics; I'd prefer more discussion of actual science.


  • "Billions an Billions" by Carl Sagan