Eli Bendersky's website - Linux

Replacing my home desktop computer

2022-11-02T20:31:00-07:00

After 9 years, it's time to retire my old and faithful home Linux machine. It has served me extremely well, by far the longest time I've used a home computer without any replacements. The fans were starting to show their age, to the extent that the CPU fan sometimes needs a nudge to start after the machine has been off. While this can be fixed, I also wanted a somewhat faster machine with more memory.

I've long been keeping a curious eye on system76; they build and sell HW that was designed for Linux from the start. While my old machine - which was just a custom build - handled Linux fine over the years, it did exhibit some snags here and there (Bluetooth, for example). It would be nice not to deal with this and have HW that just works.

So I went ahead and ordered a "Meerkat" model, which is a really tiny 4.5"x4.5" box (with a comically large power adapter) that manages to pack 32 GiB of memory, a sizable SSD and a 4-core (8-thread) i5-1135G7 CPU. It came with Ubuntu 22.04 preinstalled and was very quick and easy to set up. I moved all my daily operations to this machine a couple of hours after unpacking it. Just the computer itself changed - it's still the same monitor, keyboard and mouse. I could finally use my Bluetooth earphones with it, without any issues.

It's about 33% faster than my older machine (at least for compiling the Go toolchain and medium-sized Rust projects), and it's nice to have twice as much memory and a larger SSD. I'm quite happy with it so far. It's not the cheapest option out there, but it is a "pay for high quality with minimum fuss" option.

Unix domain sockets in Go

2019-02-12T05:27:00-08:00

When it comes to inter-process communication (IPC) between processes on the same Linux host, there are multiple options: FIFOs, pipes, shared memory, sockets and so on. One of the most interesting options is Unix Domain Sockets that combine the convenient API of sockets with the higher performance of the other single-host methods.

This post demonstrates some basic examples of using Unix domain sockets with Go and explores some benchmarks comparing them to TCP loop-back sockets.

Unix domain sockets (UDS)

Unix domain sockets (UDS) have a long history, going back to the original BSD socket specification in the 1980s. The Wikipedia definition is:

A Unix domain socket or IPC socket (inter-process communication socket) is a data communications endpoint for exchanging data between processes executing on the same host operating system.

UDS support streams (TCP equivalent) and datagrams (UDP equivalent); this post focuses on the stream APIs.

IPC with UDS looks very similar to IPC with regular TCP sockets using the loop-back interface (localhost or 127.0.0.1), but there is a key difference: performance. While the TCP loop-back interface can skip some of the complexities of the full TCP/IP network stack, it retains many others (ACKs, TCP flow control, and so on). These complexities are designed for reliable cross-machine communication, but on a single host they're an unnecessary burden. This post will explore some of the performance advantages of UDS.

There are some additional differences. For example, since UDS use paths in the filesystem as their addresses, we can use directory and file permissions to control access to sockets, simplifying authentication. I won't list all the differences here; for more information feel free to check out the Wikipedia link and additional resources like Beej's UNIX IPC guide.

The big disadvantage of UDS compared to TCP sockets is the single-host restriction, of course. For code written to use TCP sockets we only have to change the address from local to remote and everything keeps working. That said, the performance advantages of UDS are significant enough, and the API is similar enough to TCP sockets that it's quite possible to write code that supports both (UDS on a single host, TCP for remote IPC) with very little difficulty.

Using Unix domain sockets in Go

Let's start with a basic example of a server in Go that listens on a UNIX domain socket:

const SockAddr = "/tmp/echo.sock"

func echoServer(c net.Conn) {
    log.Printf("Client connected [%s]", c.RemoteAddr().Network())
    io.Copy(c, c)
    c.Close()
}

func main() {
    if err := os.RemoveAll(SockAddr); err != nil {
        log.Fatal(err)
    }

    l, err := net.Listen("unix", SockAddr)
    if err != nil {
        log.Fatal("listen error:", err)
    }
    defer l.Close()

    for {
        // Accept new connections, dispatching them to echoServer
        // in a goroutine.
        conn, err := l.Accept()
        if err != nil {
            log.Fatal("accept error:", err)
        }

        go echoServer(conn)
    }
}

UDS are identified with paths in the file system; for our server here we use /tmp/echo.sock. The server begins by removing this file if it exists, what is that about?

When servers shut down, the file representing the socket can remain in the file system unless the server did orderly cleanup after itself. If we re-run another server with the same socket path, we may get the error:

$ go run simple-echo-server.go
2019/02/08 05:41:33 listen error:listen unix /tmp/echo.sock: bind: address already in use

To prevent that, the server begins by removing the socket file, if it exists [1].

Now that the server is running, we can interact with it using Netcat, which can be asked to connect to UDS with the -U flag:

$ nc -U /tmp/echo.sock

Whatever you type in, the server will echo back. Press ^D to terminate the session. Alternatively, we can write a simple client in Go that connects to the server, sends it a message, waits for a response and exits. The full code for the client is here, but the important part is the connection:

c, err := net.Dial("unix", "/tmp/echo.sock")

We can see that writing UDS servers and clients is very similar to writing regular socket servers and clients. The only difference is having to pass "unix" as the network parameter of net.Listen and net.Dial; the rest of the code remains the same. Obviously, this makes it very easy to write generic server and client code that's independent of the actual kind of socket it's using.

HTTP and RPC protocols over UDS

Network protocols compose by design. High-level protocols, such as HTTP and various forms of RPC, don't particularly care about how the lower levels of the stack are implemented as long as certain guarantees are maintained.

Go's standard library comes with a small and useful rpc package that makes it trivial to throw together quick RPC servers and clients. Here's a simple server that has a single procedure defined:

const SockAddr = "/tmp/rpc.sock"

type Greeter struct {
}

func (g Greeter) Greet(name *string, reply *string) error {
    *reply = "Hello, " + *name
    return nil
}

func main() {
    if err := os.RemoveAll(SockAddr); err != nil {
        log.Fatal(err)
    }

    greeter := new(Greeter)
    rpc.Register(greeter)
    rpc.HandleHTTP()
    l, e := net.Listen("unix", SockAddr)
    if e != nil {
        log.Fatal("listen error:", e)
    }
    fmt.Println("Serving...")
    http.Serve(l, nil)
}

Note that we use the HTTP version of the server. It registers a HTTP handler with the http package, and the actual serving is done with the standard http.Serve. The network stack here looks something like this:

An RPC client that can connect to the server shown above is available here. It uses the standard rpc.Client.Call method to connect to the server.

Benchmarking UDS compared to loop-back TCP sockets

Note: benchmarking is hard, so please take these results with a grain of salt. There's some more information on benchmarking different socket types on the Redis benchmarks page and in this paper, as well as many other resources online. I also found this set of benchmarks (written in C) instructive.

I'm running two kinds of benchmarks: one for latency, and one for throughput.

For latency, the full code of the benchmark is here. Run it with -help to see what the flags are, and the code should be very straightforward to grok. The idea is to ping-pong a small packet of data (128 bytes by default) between a server and a client. The client measures how long it takes to send one such message and receive one back, and takes that combined time as "twice the latency", averaging it over many messages.

On my machine, I see average latency of ~3.6 microseconds for TCP loop-back sockets, and ~2.3 microseconds for UDS.

The throughput/bandwidth benchmark is conceptually simpler than the latency benchmark. The server listens on a socket and grabs all the data it can get (and discards it). The client sends large packets (hundreds of KB or more) and measures how long each packet takes to send; the send is done synchronously and the client expects the whole message to be sent in a single call, so it's a good approximation of bandwidth if the packet size is large enough.

Obviously, the throughput measurement is more representative with larger messages. I tried increasing them until the throughput improvements tapered off.

For smaller packet sizes, I see UDS winning over TCP: 10 GB/sec compared to 9.4 GB/sec for 512K. For much larger packet sizes 16-32 MB, the difference becomes negligible (both taper off at about 13 GB/sec). Interestingly, for some packet sizes (like 64K), TCP sockets are winning on my machine.

For very small message sizes we're getting back to latency-dominated performance, so UDS is considerably faster (more than 2x the number of packets per second compared to TCP). In most cases I'd say that the latency measurements are more important - they're more applicable to things like RPC servers and databases. In some cases like streaming video or other "big data" over sockets, you may want to pick the packet sizes carefully to optimize the performance for the specific machine you're using.

This discussion has some really insightful information about why we should expect UDS to be faster. However, beware - it's from 2005 and much in Linux has changed since then.

Unix domain sockets in the real-world Go projects

I was curious to see if UDS are actually used in real-world Go projects. They sure are! A few minutes of browsing/searching GitHub quickly uncovered UDS servers in many components of the new Go-dominated cloud infrastructure: runc, moby (Docker), k8s, lstio - pretty much every project I looked at.

That makes sense - as the benchmarks demonstrate, there are significant performance advantages to using a UDS when the client and server are both on the same host. And the API of UDS and TCP sockets is so similar that the cost of supporting both interchangeably is quite small.

[1]	For internet-domain socket, the same issue exists with ports that are marked taken by processes that die without cleanup. The `SO_REUSEADDR` socket option exists to address this problem.

Measuring context switching and memory overheads for Linux threads

2018-09-04T05:35:00-07:00

In this post I want to explore the costs of threads on modern Linux machines, both in terms of time and space. The background context is designing high-load concurrent servers, where using threads is one of the common schemes.

Important disclaimer: it's not my goal here to provide an opinion in the threads vs. event-driven models debate. Ultimately, both are tools that work well in some scenarios and less well in others. That said, one of the major criticisms of a thread-based model is the cost - comments like "but context switches are expensive!" or "but a thousand threads will eat up all your RAM!", and I do intend to study the data underlying such claims in more detail here. I'll do this by presenting multiple code samples and programs that make it easy to explore and experiment with these measurements.

Linux threads and NPTL

In the dark, old ages before version 2.6, the Linux kernel didn't have much specific support for threads, and they were more-or-less hacked on top of process support. Before futexes there was no dedicated low-latency synchronization solution (it was done using signals); neither was there much good use of the capabilities of multi-core systems [1].

The Native POSIX Thread Library (NPTL) was proposed by Ulrich Drepper and Ingo Molnar from Red Hat, and integrated into the kernel in version 2.6, circa 2005. I warmly recommend reading its design paper. With NPTL, thread creation time became about 7x faster, and synchronization became much faster as well due to the use of futexes. Threads and processes became more lightweight, with strong emphasis on making good use of multi-core processors. This roughly coincided with a much more efficient scheduler, which made juggling many threads in the Linux kernel even more efficient.

Even though all of this happened 13 years ago, the spirit of NPTL is still easily observable in some system programming code. For example, many thread and synchronization-related paths in glibc have nptl in their name.

Threads, processes and the clone system call

This was originally meant to be a part of this larger article, but it was getting too long so I split off a separate post on launching Linux processes and threads with clone, where you can learn about the clone system call and some measurements of how expensive it is to launch new processes and threads.

The rest of this post will assume this is familiar information and will focus on context switching and memory usage.

What happens in a context switch?

In the Linux kernel, this question has two important parts:

When does a kernel switch happen
How it happens

The following deals mostly with (2), assuming the kernel has already decided to switch to a different user thread (for example because the currently running thread went to sleep waiting for I/O).

The first thing that happens during a context switch is a switch to kernel mode, either through an explicit system call (such as write to some file or pipe) or a timer interrupt (when the kernel preempts a user thread whose time slice has expired). This requires saving the user space thread's registers and jumping into kernel code.

Next, the scheduler kicks in to figure out which thread should run next. When we know which thread runs next, there's the important bookkeeping of virtual memory to take care of; the page tables of the new thread have to be loaded into memory, etc.

Finally, the kernel restores the new thread's registers and cedes control back to user space.

All of this takes time, but how much time, exactly? I encourage you to read some additional online resources that deal with this question, and try to run benchmarks like lm_bench; what follows is my attempt to quantify thread switching time.

How expensive are context switches?

To measure how long it takes to switch between two threads, we need a benchmark that deliberatly triggers a context switch and avoids doing too much work in addition to that. This would be measuring just the direct cost of the switch, when in reality there is another cost - the indirect one, which could even be larger. Every thread has some working set of memory, all or some of which is in the cache; when we switch to another thread, all this cache data becomes unneeded and is slowly flushed out, replaced by the new thread's data. Frequent switches back and forth between the two threads will cause a lot of such thrashing.

In my benchmarks I am not measuring this indirect cost, because it's pretty difficult to avoid in any form of multi-tasking. Even if we "switch" between different asynchronous event handlers within the same thread, they will likely have different memory working sets and will interfere with each other's cache usage if those sets are large enough. I strongly recommend watching this talk on fibers where a Google engineer explains their measurement methodology and also how to avoid too much indirect switch costs by making sure closely related tasks run with temporal locality.

These code samples measure context switching overheads using two different techniques:

A pipe which is used by two threads to ping-pong a tiny amount of data. Every read on the pipe blocks the reading thread, and the kernel switches to the writing thread, and so on.
A condition variable used by two threads to signal an event to each other.

There are additional factors context switching time depends on; for example, on a multi-core CPU, the kernel can occasionally migrate a thread between cores because the core a thread has been previously using is occupied. While this helps utilize more cores, such switches cost more than staying on the same core (again, due to cache effects). Benchmarks can try to restrict this by running with taskset pinning affinity to one core, but it's important to keep in mind this only models a lower bound.

Using the two techniques I'm getting fairly similar results: somewhere between 1.2 and 1.5 microseconds per context switch, accounting only for the direct cost, and pinning to a single core to avoid migration costs. Without pinning, the switch time goes up to ~2.2 microseconds [2]. These numbers are largely consistent with the reports in the fibers talk mentioned above, and also with other benchmarks found online (like lat_ctx from lmbench).

What does this mean in practice?

So we have the numbers now, but what do they mean? Is 1-2 us a long time? As I have mentioned in the post on launch overheads, a good comparison is memcpy, which takes 3 us for 64 KiB on the same machine. In other words, a context switch is a bit quicker than copying 64 KiB of memory from one location to another.

1-2 us is not a long time by any measure, except when you're really trying to optimize for extremely low latencies or high loads.

As an example of an artificially high load, here is another benchmark which writes a short message into a pipe and expects to read it from another pipe. On the other end of the two pipes is a thread that echoes one into the other.

Running the benchmark on the same machine I used to measure the context switch times, I get ~400,000 iterations per second (this is with taskset to pin to a single core). This makes perfect sense given the earlier measurements, because each iteration of this test performs two context switches, and at 1.2 us per switch this is 2.4 us per iteration.

You could claim that the two threads compete for the same CPU, but if I don't pin the benchmark to a single core, the number of iterations per second halves. This is because the vast majority of time in this benchmark is spent in the kernel switching from one thread to the other, and the core migrations that occur when it's not pinned greatly ouweigh the loss of (the very minimal) parallelism.

Just for fun, I rewrote the same benchmark in Go; two goroutines ping-ponging short message between themselves over a channel. The throughput this achieves is dramatically higher - around 2.8 million iterations per second, which leads to an estimate of ~170 ns switching between goroutines [3]. Since switching between goroutines doesn't require an actual kernel context switch (or even a system call), this isn't too surprising. For comparison, Google's fibers use a new Linux system call that can switch between two tasks in about the same time, including the kernel time.

A word of caution: benchmarks tend to be taken too seriously. Please take this one only for what it demonstrates - a largely synthetical workload used to poke into the cost of some fundamental concurrency primitives.

Remember - it's quite unlikely that the actual workload of your task will be negligible compared to the 1-2 us context switch; as we've seen, even a modest memcpy takes longer. Any sort of server logic such as parsing headers, updating state, etc. is likely to take orders of magnitude longer. If there's one takeaway to remember from these sections is that context switching on modern Linux systems is super fast.

Memory usage of threads

Now it's time to discuss the other overhead of a large number of threads - memory. Even though all threads in a process share their, there are still areas of memory that aren't shared. In the post about clone we've mentioned page tables in the kernel, but these are comparatively small. A much larger memory area that it private to each thread is the stack.

The default per-thread stack size on Linux is usually 8 MiB, and we can check what it is by invoking ulimit:

$ ulimit -s
8192

To see this in action, let's start a large number of threads and observe the process's memory usage. This sample launches 10,000 threads and sleeps for a bit to let us observe its memory usage with external tools. Using tools like top (or preferably htop) we see that the process uses ~80 GiB of virtual memory, with about 80 MiB of resident memory. What is the difference, and how can it use 80 GiB of memory on a machine that only has 16 available?

Virtual vs. Resident memory

A short interlude on what virtual memory means. When a Linux program allocates memory (with malloc) or otherwise, this memory initially doesn't really exist - it's just an entry in a table the OS keeps. Only when the program actually accesses the memory is the backing RAM for it found; this is what virtual memory is all about.

Therefore, the "memory usage" of a process can mean two things - how much virtual memory it uses overall, and how much actual memory it uses. While the former can grow almost without bounds - the latter is obviously limited to the system's RAM capacity (with swapping to disk being the other mechanism of virtual memory to assist here if usage grows above the side of physical memory). The actual physical memory on Linux is called "resident" memory, because it's actually resident in RAM.

There's a good StackOverflow discussion of this topic; here I'll just limit myself to a simple example:

int main(int argc, char** argv) {
  report_memory("started");

  int N = 100 * 1024 * 1024;
  int* m = malloc(N * sizeof(int));
  escape(m);
  report_memory("after malloc");

  for (int i = 0; i < N; ++i) {
    m[i] = i;
  }
  report_memory("after touch");

  printf("press ENTER\n");
  (void)fgetc(stdin);
  return 0;
}

This program starts by allocating 400 MiB of memory (assuming an int size of 4) with malloc, and later "touches" this memory by writing a number into every element of the allocated array. It reports its own memory usage at every step - see the full code sample for the reporting code [4]. Here's the output from a sample run:

$ ./malloc-memusage
started: max RSS = 4780 kB; vm size = 6524 kB
after malloc: max RSS = 4780 kB; vm size = 416128 kB
after touch: max RSS = 410916 kB; vm size = 416128 kB

The most interesting thing to note is how vm size remains the same between the second and third steps, while max RSS grows from the initial value to 400 MiB. This is precisely because until we touch the memory, it's fully "virtual" and isn't actually counted for the process's RAM usage.

Therefore, distinguishing between virtual memory and RSS in realistic usage is very important - this is why the thread launching sample from the previous section could "allocate" 80 GiB of virtual memory while having only 80 MiB of resident memory.

Back to memory overhead for threads

As we've seen, a new thread on Linux is created with 8 MiB of stack space, but this is virtual memory until the thread actually uses it. If the thread actually uses its stack, resident memory usage goes up dramatically for a large number of threads. I've added a configuration option to the sample program that launches a large number of threads; with it enabled, the thread function actually uses stack memory and from the RSS report it is easy to observe the effects. Curiously, if I make each of 10,000 threads use 400 KiB of memory, the total RSS is not 4 GiB but around 2.6 GiB [5].

How to we control the stack size of threads? One option is using the ulimit command, but a better option is with the pthread_attr_setstacksize API. The latter is invoked programatically and populates a pthread_attr_t structure that's passed to thread creation. The more interesting question is - what should the stack size be set to?

As we have seen above, just creating a large stack for a thread doesn't automatically eat up all the machine's memory - not before the stack is being used. If our threads actually use large amounts of stack memory, this is a problem, because this severely limits the number of threads we can run concurrently. Note that this is not really a problem with threads - but with concurrency; if our program uses some event-driven approach to concurrency and each handler uses a large amount of memory, we'd still have the same problem.

If the task doesn't actually use a lot of memory, what should we set the stack size to? Small stacks keep the OS safe - a deviant program may get into an infinite recursion and a small stack will make sure it's killed early. Moreover, virtual memory is large but not unlimited; especially on 32-bit OSes, we might not have 80 GiB of virtual address space for the process, so a 8 MiB stack for 10,000 threads makes no sense. There's a tradeoff here, and the default chosen by 32-bit Linux is 2 MiB; the maximal virtual address space available is 3 GiB, so this imposes a limit of ~1500 threads with the default settings. On 64-bit Linux the virtual address space is vastly larger, so this limitation is less serious (though other limits kick in - on my machine the maximal number of threads the OS lets one process start is about 32K).

Therefore I think it's more important to focus on how much actual memory each concurrent task is using than on the OS stack size limit, as the latter is simply a safety measure.

Conclusion

The numbers reported here paint an interesting picture on the state of Linux multi-threaded performance in 2018. I would say that the limits still exist - running a million threads is probably not going to make sense; however, the limits have definitely shifted since the past, and a lot of folklore from the early 2000s doesn't apply today. On a beefy multi-core machine with lots of RAM we can easily run 10,000 threads in a single process today, in production. As I've mentioned above, it's highly recommended to watch Google's talk on fibers; through careful tuning of the kernel (and setting smaller default stacks) Google is able to run an order of magnitude more threads in parallel.

Whether this is sufficient concurrency for your application is very obviously project-specific, but I'd say that for higher concurrencies you'd probably want to mix in some asynchronous processing. If 10,000 threads can provide sufficient concurrency - you're in luck, as this is a much simpler model - all the code within the threads is serial, there are no issues with blocking, etc.

[1]	For example, in order to implement POSIX semantics properly, a single thread was designated as a "manager" and managed operations like "create a new thread". This created an unfortunate serialization point and a bottleneck.

[2]	These numbers also vary greatly between CPUs. The numbers reported herein are on my Haswell i7-4771. On a different contemporary machine (a low-end Xeon) I measured switch times that were about 50-75% longer.

[3] Curiously, pinning the Go program to a single core (by means of setting GOMAXPROCS=1 and running with taskset) increases the throughput by only by 10% or so. The Go scheduler is not optimized for this strange use case of endless hammering between two goroutine, but it performs very well regardless.

[4]	Note that while for resident memory there's a convenient `getrusage` API, to report virtual memory size we have to parse `/proc/PID/status`.

[5]	According to Tom Dryer, recent Linux version only approximate this usage, which could explain the discrepancy - see this explanation.

Launching Linux threads and processes with clone

2018-08-01T06:14:00-07:00

Due to variation between operating systems and the way OS courses are taught, some programmers may have an outdated mental model about the difference between processes and threads in Linux. Even the name "thread" suggests something extremely lightweight compared to a heavy "process" - a mostly wrong intuition.

In fact, for the Linux kernel itself there's absolutely no difference between what userspace sees as processes (the result of fork) and as threads (the result of pthread_create). Both are represented by the same data structures and scheduled similarly. In kernel nomenclature this is called tasks (the main structure representing a task in the kernel is task_struct), and I'll be using this term from now on.

In Linux, threads are just tasks that share some resources, most notably their memory space; processes, on the other hand, are tasks that don't share resources. For application programmers, proceses and threads are created and managed in very different ways. For processes there's a slew of process-management APIs like fork, wait and so on. For threads there's the pthread library. However, deep in the guts of these APIs and libraries, both processes and threads come into existence through a single Linux system call - clone.

The `clone` system call

We can think of clone as the unifying implementation shared between processes and threads. Whatever perceived difference there is between processes and threads on Linux is achieved through passing different flags to clone. Therefore, it's most useful to think of processes and threads not as two completely different concepts, but rather as two variants of the same concept - starting a concurrent task. The differences are mostly about what is shared between this new task and the task that started it.

Here is a code sample demonstrating the most important sharing aspect of threads - memory. It uses clone in two ways, once with the CLONE_VM flag and once without. CLONE_VM tells clone to share the virtual memory between the calling task and the new task clone is about to create [1]. As we'll see later on, this is the flag used by pthread_create:

static int child_func(void* arg) {
  char* buf = (char*)arg;
  printf("Child sees buf = \"%s\"\n", buf);
  strcpy(buf, "hello from child");
  return 0;
}

int main(int argc, char** argv) {
  // Allocate stack for child task.
  const int STACK_SIZE = 65536;
  char* stack = malloc(STACK_SIZE);
  if (!stack) {
    perror("malloc");
    exit(1);
  }

  // When called with the command-line argument "vm", set the CLONE_VM flag on.
  unsigned long flags = 0;
  if (argc > 1 && !strcmp(argv[1], "vm")) {
    flags |= CLONE_VM;
  }

  char buf[100];
  strcpy(buf, "hello from parent");
  if (clone(child_func, stack + STACK_SIZE, flags | SIGCHLD, buf) == -1) {
    perror("clone");
    exit(1);
  }

  int status;
  if (wait(&status) == -1) {
    perror("wait");
    exit(1);
  }

  printf("Child exited with status %d. buf = \"%s\"\n", status, buf);
  return 0;
}

Some things to note when clone is invoked:

It takes a function pointer to the code the new task will run, similarly to threading APIs, and unlike the fork API. This is the glibc wrapper for clone. There's also a raw system call which is discussed below.
The stack for the new task has to be allocated by the parent and passed into clone.
The SIGCHLD flag tells the kernel to send the SIGCHLD to the parent when the child terminates, which lets the parent use the plain wait call to wait for the child to exit. This is the only flag the sample passes into clone by default.

This code sample passes a buffer into the child, and the child writes a string into it. When called without the vm command-line argument, the CLONE_VM flag is off, and the parent's virtual memory is copied into the child. The child sees the message the parent placed in buf, but whatever it writes into buf goes into its own copy and the parent can't see it. Here's the output:

$ ./clone-vm-sample
Child sees buf = "hello from parent"
Child exited with status 0. buf = "hello from parent"

But when the vm argument is passed, CLONE_VM is set and the child task shares the parent's memory. Its writing into buf will now be observable from the parent:

$ ./clone-vm-sample vm
Child sees buf = "hello from parent"
Child exited with status 0. buf = "hello from child"

A bunch of other CLONE_* flags can specify other things that will be shared with the parent: CLONE_FILES will share the open file descriptors, CLONE_SIGHAND will share the signal dispositions, and so on.

Other flags are there to implement the semantics required by POSIX threads. For example, CLONE_THREAD asks the kernel to assign the same thread group id to the child as to the parent, in order to comply with POSIX's requirement of all threads in a process sharing a single process ID [2].

Calling `clone` in process and thread creation

Let's dig through some code in glibc to see how clone is invoked, starting with fork, which is routed to __libc_fork in sysdeps/nptl/fork.c. The actual implementation is specific to the threading library, hence the location in the nptl folder. The first thing __libc_fork does is invoke the fork handlers potentially registered beforehead with pthread_atfork.

The actual cloning happens with:

pid = ARCH_FORK ();

Where ARCH_FORK is a macro defined per architecture (exact syscall ABIs are architecture-specific). For x86_64 it maps to:

#define ARCH_FORK() \
  INLINE_SYSCALL (clone, 4,                                                   \
                  CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0,     \
                  NULL, &THREAD_SELF->tid)

The CLONE_CHILD_* flags are useful for some threading libraries (though not the default on Linux today - NPTL). Otherwise, the invocation is very similar to the clone code sample shown in the previous section.

You may wonder where is the function pointer in this call. Nice catch! This is the raw call version of clone, where execution continues from the point of the call in both parent and child - close to the usual semantics of fork.

Now let's turn to pthread_create. Through a dizzying chain of macros it reaches a function named create_thread (defined in sysdeps/unix/sysv/linux/createthread.c) that calls clone with:

const int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SYSVSEM
                       | CLONE_SIGHAND | CLONE_THREAD
                       | CLONE_SETTLS | CLONE_PARENT_SETTID
                       | CLONE_CHILD_CLEARTID
                       | 0);

ARCH_CLONE (&start_thread, STACK_VARIABLES_ARGS,
            clone_flags, pd, &pd->tid, tp, &pd->tid)

Browse through man 2 clone to understand the flags passed into the call. Briefly, it is asked to share the virtual memory, file system, open files, shared memory and signal handlers with the parent thread/process. Additional flags are passed to implement proper identification - all threads launched from a single process have to share its process ID to be POSIX compliant.

Reading the glibc source code is quite an exercise in mental resilience, but it's really interesting to see how everything fits together "in the real world".

Benchmarking process vs. thread creation

Given the information presented earlier in the post, I would expect process creation to be somewhat more expensive than thread creation, but not dramatically so. Since fork and pthread_create route to the same system call in Linux, the difference would come from the different flags they pass in. When pthread_create passes all these CLONE_* flags, it tells the kernel there's no need to copy the virtual memory image, the open files, the signal handlers, and so on. Obviously, this saves time.

For processes, there's a bit of copying to be done when fork is invoked, which costs time. The biggest chunk of time probably goes to copying the memory image due to the lack of CLONE_VM. Note, however, that it's not just copying the whole memory; Linux has an important optimization by using COW (Copy On Write) pages. The child's memory pages are initially mapped to the same pages shared by the parent, and only when we modify them the copy happens. This is very important because processes will often use a lot of shared read-only memory (think of the global structures used by the standard library, for example).

That said, the page tables still have to be copied. The size of a process's page tables can be observed by looking in /proc/<pid>/status - the VmPTE indicator. These can be around tens of kilobytes for small processes, and higher for larger processes. Not a lot of data to copy, but definitely some extra work for the CPU.

I wrote a benchmark that times process and threads launches, as a function of the virtual memory allocated before fork or pthread_create. The launch is averaged over 10,000 instances to remove warm-up effects and jitter:

Several things to note:

Indeed, launching processes is slower than threads, 35 vs. 5 microseconds for a 2-MB heap. But it's still very fast! 35 micro-seconds is not a lot of time at all. If your latency budget is willing to tolerate a 5 us overhead, it will almost certainly be fine with a 35 us overhead, unless you're working on some super-tight hard realtime system (in which case you shouldn't be using Linux!)
As expected, the time to launch a process when the heap is larger grows. The time delta is the time needed to copy the extra page table entries. For threads, on the other hand, there is absolutely no difference since the memory is completely shared.

Interestingly, it's easy to observe from these numbers that not the whole memory image is being copied. On the same machine this benchmark was run on, just a simple memcpy of 2 MB takes over 60 us, so it couldn't have copied 2 MB of heap to the child in the 30 us difference. Copying 64K (a reasonable size for a page table) takes 3 us, which makes sense because the cloning involves more than a simple memcpy. To me this is another sign of how fast these launches are, since we're in the same ballpark of performance with modestly sized memory copies.

Creation time is not the only performance benchmark of importance. It's also interesting to measure how long it takes to switch context between tasks when using threads or processes. This is covered in another post.

[1]	It may be just me, but I find this terminology a bit confusing. In my mind the word clone is synonymous to copy, so when we turn on a flag named "clone the VM" I'd expect the VM to be copied rather than shared. IMHO it would be clearer if this flag was named `SHARE_VM`.

[2]

It's certainly interesting to see this evolution of concepts over time. Thread APIs were defined in times where there was a real difference between processes and threads and their design reflects that. In modern Linux the kernel has to bend over backwards to provide the illusion of the difference although very little of it exists.

Basics of Futexes

2018-07-13T05:53:00-07:00

The futex (short for "Fast userspace mutex") mechanism was proposed by Linux contributors from IBM in 2002 [1]; it was integrated into the kernel in late 2003. The main idea is to enable a more efficient way for userspace code to synchronize multiple threads, with minimal kernel involvement.

In this post I want to provide a basic overview of futexes, how they work, and how they're used to implement the more familiar synchronization primitives in higher-level APIs and languages.

An important disclaimer: futexes are a very low-level feature of the Linux kernel, suitable for use in foundational runtime components like the C/C++ standard libraries. It is extremely unlikely that you will ever need to use them in application code.

Motivation

Before the introduction of futexes, system calls were required for locking and unlocking shared resources (for example semop). System calls are relatively expensive, however, requiring a context switch from userspace to kernel space; as programs became increasingly concurrent, locks started showing up on profiles as a significant percentage of the run time. This is very unfortunate, given that locks accomplish no real work ("business logic") but are only there to guarantee that access to shared resources is safe.

The futex proposal is based on a clever observation: in most cases, locks are actually not contended. If a thread comes upon a free lock, locking it can be cheap because most likely no other thread is trying to lock it at the exact same time. So we can get by without a system call, attemping much cheaper atomic operations first [2]. There's a very high chance that the atomic instruction will succeed.

However, in the unlikely event that another thread did try to take the lock at the same time, the atomic approach may fail. In this case there are two options. We can busy-loop using the atomic until the lock is cleared; while this is 100% userspace, it can also be extremely wasteful since looping can significantly occupy a core, and the lock can be held for a long time. The alternative is to "sleep" until the lock is free (or at least there's a high chance that it's free); we need the kernel to help with that, and this is where futexes come in.

Simple futex use - waiting and waking

The futex(2) system call multiplexes a lot of functionality on top of a single interface. I will not discuss any of the advanced options here (some of them are so esoteric they're not even officially documented) but will focus on just FUTEX_WAIT and FUTEX_WAKE. The man page description starts with a good introduction:

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.

Simply stated, a futex is a kernel construct that helps userspace code synchronize on shared events. Some userspace processes (or threads) can wait on an event (FUTEX_WAIT), while another userspace process can signal the event (FUTEX_WAKE) to notify waiters. The waiting is efficient - the waiters are suspended by the kernel and are only scheduled anew when there's a wake-up signal.

Be sure to read the futex man page beyond the introduction; blog posts are not a substitute for documentation! At the very least read about the FUTEX_WAIT and FUTEX_WAKE calls, the arguments they take, their return values and possible errors.

Let's study a simple example demonstrating basic usage of futexes to coordinate two processes. The main function sets up the machinery and launches a child process that:

Waits for 0xA to be written into a shared memory slot.
Writes 0xB into the same memory slot.

Meanwhile, the parent:

Writes 0xA into the shared memory slot.
Waits for 0xB to be written into the slot.

This is a simple handshake between two processes. Here's the code:

int main(int argc, char** argv) {
  int shm_id = shmget(IPC_PRIVATE, 4096, IPC_CREAT | 0666);
  if (shm_id < 0) {
    perror("shmget");
    exit(1);
  }
  int* shared_data = shmat(shm_id, NULL, 0);
  *shared_data = 0;

  int forkstatus = fork();
  if (forkstatus < 0) {
    perror("fork");
    exit(1);
  }

  if (forkstatus == 0) {
    // Child process

    printf("child waiting for A\n");
    wait_on_futex_value(shared_data, 0xA);

    printf("child writing B\n");
    // Write 0xB to the shared data and wake up parent.
    *shared_data = 0xB;
    wake_futex_blocking(shared_data);
  } else {
    // Parent process.

    printf("parent writing A\n");
    // Write 0xA to the shared data and wake up child.
    *shared_data = 0xA;
    wake_futex_blocking(shared_data);

    printf("parent waiting for B\n");
    wait_on_futex_value(shared_data, 0xB);

    // Wait for the child to terminate.
    wait(NULL);
    shmdt(shared_data);
  }

  return 0;
}

Note that we use POSIX shared memory APIs to create a memory location mapped into both processes. We can't just use a regular pointer here, because the address spaces of the two processes will be different [3].

Note that this is not a canonical usage of futex, which would be better employed to wait until a value changes from something rather than to something. It's just here to show the various possibilities in return values from futex. Later in the post a more canonical usage is demonstrated when we implement a mutex.

Here is wait_on_futex_value:

void wait_on_futex_value(int* futex_addr, int val) {
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAIT, val, NULL, NULL, 0);
    if (futex_rc == -1) {
      if (errno != EAGAIN) {
        perror("futex");
        exit(1);
      }
    } else if (futex_rc == 0) {
      if (*futex_addr == val) {
        // This is a real wakeup.
        return;
      }
    } else {
      abort();
    }
  }
}

This function's main added value on top of the futex system call is looping around when the wakeup is spurious. This can happen when val is not the expected value (yet) and also when another process was woken up before this one (can't really happen in this code sample, but is a real possibility in other scenarios).

Futex semantics are tricky [4]! FUTEX_WAIT will immediately return if the value at the futex address is not equal to val. In our case this can happen if the child issued a wait before the parent wrote 0xA, for example. The futex call will return an error with EAGAIN in this case.

Here is wake_futex_blocking:

void wake_futex_blocking(int* futex_addr) {
  while (1) {
    int futex_rc = futex(futex_addr, FUTEX_WAKE, 1, NULL, NULL, 0);
    if (futex_rc == -1) {
      perror("futex wake");
      exit(1);
    } else if (futex_rc > 0) {
      return;
    }
  }
}

It's a blocking wrapper around FUTEX_WAKE, which will normally return quickly regardless of how many waiters it has woken up. In our sample, this waiting is part of the handshake, but in many cases you won't see it.

Futexes are kernel queues for userspace code

Simply stated, a futex is a queue the kernel manages for userspace convenience. It lets userspace code ask the kernel to suspend until a certain condition is satisfied, and lets other userspace code signal that condition and wake up waiting processes. Earlier we've menioned busy-looping as one approach to wait on success of atomic operations; a kernel-managed queue is the much more efficient alternative, absolving userspace code from the need to burn billions of CPU cycles on pointless spinning.

Here's a diagram from LWN's "A futex overview and update":

In the Linux kernel, futexes are implemented in kernel/futex.c. The kernel keeps a hash table keyed by the address to quickly find the proper queue data structure and adds the calling process to the wait queue. There's quite a bit of complication, of course, due to using fine-grained locking within the kernel itself and the various advanced options of futexes.

Timed blocking with FUTEX_WAIT

The futex system call has a timeout parameter which lets user code implement waiting with a time-out.

The futex-wait-timeout sample shows this in action. Here is the relevant part of the child process which waits on a futex:

printf("child waiting for A\n");
struct timespec timeout = {.tv_sec = 0, .tv_nsec = 500000000};
while (1) {
  unsigned long long t1 = time_ns();
  int futex_rc = futex(shared_data, FUTEX_WAIT, 0xA, &timeout, NULL, 0);
  printf("child woken up rc=%d errno=%s, elapsed=%llu\n", futex_rc,
         futex_rc ? strerror(errno) : "", time_ns() - t1);
  if (futex_rc == 0 && *shared_data == 0xA) {
    break;
  }
}

If the wait takes longer than 500 ms, the process will loop and wait again. The sample lets you configure the length of time the parent process keeps the child waiting and observe the effects.

Using a futex to implement a simple mutex

In the motivation section that started this post, I explained how futexes help implement efficient locking in the common low-contention case. It's time to show a realistic implementation of a mutex using futexes and atomics. This is based on the second implementation in Ulrich Drepper's "Futexes are Tricky" paper.

For this sample I'm switching to C++, to use its standardized atomics (available since C++11). The full code is here; here is the important part:

class Mutex {
public:
  Mutex() : atom_(0) {}

  void lock() {
    int c = cmpxchg(&atom_, 0, 1);
    // If the lock was previously unlocked, there's nothing else for us to do.
    // Otherwise, we'll probably have to wait.
    if (c != 0) {
      do {
        // If the mutex is locked, we signal that we're waiting by setting the
        // atom to 2. A shortcut checks is it's 2 already and avoids the atomic
        // operation in this case.
        if (c == 2 || cmpxchg(&atom_, 1, 2) != 0) {
          // Here we have to actually sleep, because the mutex is actually
          // locked. Note that it's not necessary to loop around this syscall;
          // a spurious wakeup will do no harm since we only exit the do...while
          // loop when atom_ is indeed 0.
          syscall(SYS_futex, (int*)&atom_, FUTEX_WAIT, 2, 0, 0, 0);
        }
        // We're here when either:
        // (a) the mutex was in fact unlocked (by an intervening thread).
        // (b) we slept waiting for the atom and were awoken.
        //
        // So we try to lock the atom again. We set teh state to 2 because we
        // can't be certain there's no other thread at this exact point. So we
        // prefer to err on the safe side.
      } while ((c = cmpxchg(&atom_, 0, 2)) != 0);
    }
  }

  void unlock() {
    if (atom_.fetch_sub(1) != 1) {
      atom_.store(0);
      syscall(SYS_futex, (int*)&atom_, FUTEX_WAKE, 1, 0, 0, 0);
    }
  }

private:
  // 0 means unlocked
  // 1 means locked, no waiters
  // 2 means locked, there are waiters in lock()
  std::atomic<int> atom_;
};

Where cmpxhg is a simple wrapper to subdue C++'s atomic primitive to the expected interface:

// An atomic_compare_exchange wrapper with semantics expected by the paper's
// mutex - return the old value stored in the atom.
int cmpxchg(std::atomic<int>* atom, int expected, int desired) {
  int* ep = &expected;
  std::atomic_compare_exchange_strong(atom, ep, desired);
  return *ep;
}

The code snippet is heavily commented to explain how it works; reading Drepper's paper is recommended in any case, as it builds up to this implementation by first examining a simpler one which is subtly incorrect. One slightly non-kosher thing this code does is access the internal representation of std::atomic by casting the address of atom_ to int* when passing it to the futex syscall. This is because futex expects a simple address, while C++ atomics wrap their actual data in opaque types. This works on Linux on x64, but isn't generally portable. To make std::atomic play well with futex in a portable we'd have to add a portability layer. But it's not a need that comes up in practice - mixing futex with C++11 is not something anyone should do - these snippets are just demonstrational!

An interesting observation is about the meaning of the value sitting in the atom_ member. Recall that the futex syscall doesn't assign any meaning to the value - it's up to the user to do that. The 0,1,2 convention is useful for mutexes, and also the one used by the glibc implementation for low-level locks.

glibc mutex and low-level lock

This brings us to the glibc implementation of POSIX threads, which have the pthread_mutex_t type. As I've mentioned in the beginning of the post, futexes are not really for regular user code; rather, they are used by low-level runtimes and libraries to implement other, higher-level primitives. In this context, it's interesting to see how a mutex is implemented for NPTL. In the glibc source tree, this code is in nptl/pthread_mutex_lock.c

The code is significantly complicated by all the different types of mutexes it has to support, but we can discover some familiar building blocks if we dig deep enough. In addition to the file mentioned above, other files to look at (for x86) are sysdeps/unix/sysv/linux/x86_64/lowlevellock.h and nptl/lowlevellock.c. The code is dense, but the combination of atomic compare-and-exchange operations and futex invocations is apparent. The low-level lock machinery (lll_ or LLL_ prefixes) is used throughout the glibc code-base, not just in the implementation of POSIX threads.

The beginning of the comment at the top of sysdeps/nptl/lowlevellock.h should be familiar by now:

/* Low-level locks use a combination of atomic operations (to acquire and
   release lock ownership) and futex operations (to block until the state
   of a lock changes).  A lock can be in one of three states:
   0:  not acquired,
   1:  acquired with no waiters; no other threads are blocked or about to block
       for changes to the lock state,
   >1: acquired, possibly with waiters; there may be other threads blocked or
       about to block for changes to the lock state.

   We expect that the common case is an uncontended lock, so we just need
   to transition the lock between states 0 and 1; releasing the lock does
   not need to wake any other blocked threads.  If the lock is contended
   and a thread decides to block using a futex operation, then this thread
   needs to first change the state to >1; if this state is observed during
   lock release, the releasing thread will wake one of the potentially
   blocked threads.
 ..
 */

Futexes in the Go runtime

The Go runtime does not use libc, in most cases. Therefore, it cannot rely on the POSIX thread implementation in its own code. It invokes the underlying OS's system calls directly instead.

That makes it a good alternative candidate to study for its use of futexes. Since it can't just use a pthread_mutex_t for its locking, it has to roll its own lock. Let's see how this is done, by starting with the user-visible sync.Mutex type (in src/sync/mutex.go).

The Lock method of sync.Mutex is quite involved, as you might imagine. It first tries to use an atomic swap to quickly acquire a lock. If it turns out it has to wait, it defers to runtime_SemacquireMutex, which in turn calls runtime.lock. That function is defined in src/runtime/lock_futex.go [5], and defines some constants that will appear familiar:

const (
  mutex_unlocked = 0
  mutex_locked   = 1
  mutex_sleeping = 2

...
)

// Possible lock states are mutex_unlocked, mutex_locked and mutex_sleeping.
// mutex_sleeping means that there is presumably at least one sleeping thread.

runtime.lock also tries to speculatively grab a lock with an atomic; this function is used in a bunch of places in the Go runtime, so that makes sense, but I wonder if they couldn't have optimized the two consecutive atomics that occur when it's called by Mutex.lock, somehow.

If it discovers it has to sleep, it defers to futexsleep, which is OS-specific and lives in src/runtime/os_linux.go. This function calls invokes the futex system call directly with FUTEX_WAIT_PRIVATE (recall that this is sufficient for a single process, which the Go runtime fulfills).

[1]	See "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux" by Franke, Russell, Kirkwood. Published in 2002 for the Ottawa Linux Symposium.

[2]	Most modern processors have built-in atomic instructions implemented in HW. For example on Intel architectures `cmpxhg` is an instruction. While it's not as cheap as non-atomic instructions (especially in multi-core systems), it's significantly cheaper than system calls.

[3]	The code repository for this post also contains an equivalent sample using threads instead of processes. There we don't need to use shared memory but can instead use the address of a stack variable.

[4]	There's a paper written by Ulrich Drepper named "Futexes are Tricky" that explores some of the nuances. I'll be using it later on for the mutex discussion. It's a very good paper - please read it if you're interested in the topic.

[5]	For OSes that expose the `futex(2)` system call. The Go runtime has a fallback onto the semaphore system calls if `futex` is not supported.

Persistent history in Bash - redux

2016-01-26T05:07:00-08:00

A couple of years ago I wrote about saving all the commands I ever ran in the terminal into a "persistent history" file, for later lookup. Since some people asked me whether this ended up being worthwhile, here's a short redux.

The TL;DR version is - keeping persistent history has been one of the best productivity hacks I ever put to use; I rely on it daily, and would be much less productive without it.

Before doing this, the only way I had to remember which commands/flags are needed to run something was to write it down in all kinds of notes files, personal wikis and so on. It was cumbersome, unorganized and time-consuming to reuse. With the .persistent_history file automatically populated by Bash from any terminal I'm typing into, and being kept in a Git repository for safekeeping, I have quick access to any command I ever ran. It's a life safer for someone who spends as much time in the terminal as me. I warmly recommend it, or some equivalent approach, to anyone who is using Linux daily.

Interestingly, at the time of the original post I was worried that with time this file will grow too long and will have to be trimmed. That turned out to be a completely needless worry. In over two years of using it at work, my .persistent_history is somewhat over 6 MB long, with ~60000 lines [1]. It takes a negligible amount of time to append to it and to search within it (15 milliseconds for a full search is the most I was able to measure). It doesn't even matter if you have a SSD or a hard drive as your main storage device; since the file is continously written to, it's almost certainly paged into memory most of the time anyway.

Also, I posted a histogram of the 10 most commonly used commands on my home machine for hobby hacking, so it's interesting to revisit that. Here's a histogram for the past year:

git          : 1564
ls           : 861
gs           : 669
cd           : 546
vi           : 543
make         : 538
ll           : 388
pssc         : 379
PYTHONPATH=. : 337
python       : 286

As the original post foresaw, the impending switch from Mercurial to Git for my personal projects, along with spending much less time on CPython core development have pushed hg to the fringes, and Git is certainly the most used command now (gs is my alias for git status). Python should be higher than it appears because commands starting with PYTHONPATH=. always precede python. The rest is a fairly expected bunch from a terminal hermit. pssc is one of the aliases I use for pss, which is why you don't see grep or find in the list.

I placed the Bash code enabling persistent history, along with the Python script I used to compute the command usage histogram shown above on GitHub.

[1] In reality there were likely many more commands, but the script does some amount of de-duplication - it won't write down a command if it's exactly the same as the last one written. For example, if you spend the whole day hacking in an editor and rerunning python foo.py every couple of minutes, the only commands that will be written in the history are opening the editor and then a single instance of python foo.py.

Redirecting all kinds of stdout in Python

2015-02-20T05:46:00-08:00

A common task in Python (especially while testing or debugging) is to redirect sys.stdout to a stream or a file while executing some piece of code. However, simply "redirecting stdout" is sometimes not as easy as one would expect; hence the slightly strange title of this post. In particular, things become interesting when you want C code running within your Python process (including, but not limited to, Python modules implemented as C extensions) to also have its stdout redirected according to your wish. This turns out to be tricky and leads us into the interesting world of file descriptors, buffers and system calls.

But let's start with the basics.

Pure Python

The simplest case arises when the underlying Python code writes to stdout, whether by calling print, sys.stdout.write or some equivalent method. If the code you have does all its printing from Python, redirection is very easy. With Python 3.4 we even have a built-in tool in the standard library for this purpose - contextlib.redirect_stdout. Here's how to use it:

from contextlib import redirect_stdout

f = io.StringIO()
with redirect_stdout(f):
    print('foobar')
    print(12)
print('Got stdout: "{0}"'.format(f.getvalue()))

When this code runs, the actual print calls within the with block don't emit anything to the screen, and you'll see their output captured by in the stream f. Incidentally, note how perfect the with statement is for this goal - everything within the block gets redirected; once the block is done, things are cleaned up for you and redirection stops.

If you're stuck on an older and uncool Python, prior to 3.4 [1], what then? Well, redirect_stdout is really easy to implement on your own. I'll change its name slightly to avoid confusion:

from contextlib import contextmanager

@contextmanager
def stdout_redirector(stream):
    old_stdout = sys.stdout
    sys.stdout = stream
    try:
        yield
    finally:
        sys.stdout = old_stdout

So we're back in the game:

f = io.StringIO()
with stdout_redirector(f):
    print('foobar')
    print(12)
print('Got stdout: "{0}"'.format(f.getvalue()))

Redirecting C-level streams

Now, let's take our shiny redirector for a more challenging ride:

import ctypes
libc = ctypes.CDLL(None)

f = io.StringIO()
with stdout_redirector(f):
    print('foobar')
    print(12)
    libc.puts(b'this comes from C')
    os.system('echo and this is from echo')
print('Got stdout: "{0}"'.format(f.getvalue()))

I'm using ctypes to directly invoke the C library's puts function [2]. This simulates what happens when C code called from within our Python code prints to stdout - the same would apply to a Python module using a C extension. Another addition is the os.system call to invoke a subprocess that also prints to stdout. What we get from this is:

this comes from C
and this is from echo
Got stdout: "foobar
12
"

Err... no good. The prints got redirected as expected, but the output from puts and echo flew right past our redirector and ended up in the terminal without being caught. What gives?

To grasp why this didn't work, we have to first understand what sys.stdout actually is in Python.

Detour - on file descriptors and streams

This section dives into some internals of the operating system, the C library, and Python [3]. If you just want to know how to properly redirect printouts from C in Python, you can safely skip to the next section (though understanding how the redirection works will be difficult).

Files are opened by the OS, which keeps a system-wide table of open files, some of which may point to the same underlying disk data (two processes can have the same file open at the same time, each reading from a different place, etc.)

File descriptors are another abstraction, which is managed per-process. Each process has its own table of open file descriptors that point into the system-wide table. Here's a schematic, taken from The Linux Programming Interface:

File descriptors allow sharing open files between processes (for example when creating child processes with fork). They're also useful for redirecting from one entry to another, which is relevant to this post. Suppose that we make file descriptor 5 a copy of file descriptor 4. Then all writes to 5 will behave in the same way as writes to 4. Coupled with the fact that the standard output is just another file descriptor on Unix (usually index 1), you can see where this is going. The full code is given in the next section.

File descriptors are not the end of the story, however. You can read and write to them with the read and write system calls, but this is not the way things are typically done. The C runtime library provides a convenient abstraction around file descriptors - streams. These are exposed to the programmer as the opaque FILE structure with a set of functions that act on it (for example fprintf and fgets).

FILE is a fairly complex structure, but the most important things to know about it is that it holds a file descriptor to which the actual system calls are directed, and it provides buffering, to ensure that the system call (which is expensive) is not called too often. Suppose you emit stuff to a binary file, a byte or two at a time. Unbuffered writes to the file descriptor with write would be quite expensive because each write invokes a system call. On the other hand, using fwrite is much cheaper because the typical call to this function just copies your data into its internal buffer and advances a pointer. Only occasionally (depending on the buffer size and flags) will an actual write system call be issued.

With this information in hand, it should be easy to understand what stdout actually is for a C program. stdout is a global FILE object kept for us by the C library, and it buffers output to file descriptor number 1. Calls to functions like printf and puts add data into this buffer. fflush forces its flushing to the file descriptor, and so on.

But we're talking about Python here, not C. So how does Python translate calls to sys.stdout.write to actual output?

Python uses its own abstraction over the underlying file descriptor - a file object. Moreover, in Python 3 this file object is further wrapped in an io.TextIOWrapper, because what we pass to print is a Unicode string, but the underlying write system calls accept binary data, so encoding has to happen en route.

The important take-away from this is: Python and a C extension loaded by it (this is similarly relevant to C code invoked via ctypes) run in the same process, and share the underlying file descriptor for standard output. However, while Python has its own high-level wrapper around it - sys.stdout, the C code uses its own FILE object. Therefore, simply replacing sys.stdout cannot, in principle, affect output from C code. To make the replacement deeper, we have to touch something shared by the Python and C runtimes - the file descriptor.

Redirecting with file descriptor duplication

Without further ado, here is an improved stdout_redirector that also redirects output from C code [4]:

from contextlib import contextmanager
import ctypes
import io
import os, sys
import tempfile

libc = ctypes.CDLL(None)
c_stdout = ctypes.c_void_p.in_dll(libc, 'stdout')

@contextmanager
def stdout_redirector(stream):
    # The original fd stdout points to. Usually 1 on POSIX systems.
    original_stdout_fd = sys.stdout.fileno()

    def _redirect_stdout(to_fd):
        """Redirect stdout to the given file descriptor."""
        # Flush the C-level buffer stdout
        libc.fflush(c_stdout)
        # Flush and close sys.stdout - also closes the file descriptor (fd)
        sys.stdout.close()
        # Make original_stdout_fd point to the same file as to_fd
        os.dup2(to_fd, original_stdout_fd)
        # Create a new sys.stdout that points to the redirected fd
        sys.stdout = io.TextIOWrapper(os.fdopen(original_stdout_fd, 'wb'))

    # Save a copy of the original stdout fd in saved_stdout_fd
    saved_stdout_fd = os.dup(original_stdout_fd)
    try:
        # Create a temporary file and redirect stdout to it
        tfile = tempfile.TemporaryFile(mode='w+b')
        _redirect_stdout(tfile.fileno())
        # Yield to caller, then redirect stdout back to the saved fd
        yield
        _redirect_stdout(saved_stdout_fd)
        # Copy contents of temporary file to the given stream
        tfile.flush()
        tfile.seek(0, io.SEEK_SET)
        stream.write(tfile.read())
    finally:
        tfile.close()
        os.close(saved_stdout_fd)

There are a lot of details here (such as managing the temporary file into which output is redirected) that may obscure the key approach: using dup and dup2 to manipulate file descriptors. These functions let us duplicate file descriptors and make any descriptor point at any file. I won't spend more time on them - go ahead and read their documentation, if you're interested. The detour section should provide enough background to understand it.

Let's try this:

f = io.BytesIO()

with stdout_redirector(f):
    print('foobar')
    print(12)
    libc.puts(b'this comes from C')
    os.system('echo and this is from echo')
print('Got stdout: "{0}"'.format(f.getvalue().decode('utf-8')))

Gives us:

Got stdout: "and this is from echo
this comes from C
foobar
12
"

Success! A few things to note:

The output order may not be what we expected. This is due to buffering. If it's important to preserve order between different kinds of output (i.e. between C and Python), further work is required to disable buffering on all relevant streams.
You may wonder why the output of echo was redirected at all? The answer is that file descriptors are inherited by subprocesses. Since we rigged fd 1 to point to our file instead of the standard output prior to forking to echo, this is where its output went.
We use a BytesIO here. This is because on the lowest level, the file descriptors are binary. It may be possible to do the decoding when copying from the temporary file into the given stream, but that can hide problems. Python has its in-memory understanding of Unicode, but who knows what is the right encoding for data printed out from underlying C code? This is why this particular redirection approach leaves the decoding to the caller.
The above also makes this code specific to Python 3. There's no magic involved, and porting to Python 2 is trivial, but some assumptions made here don't hold (such as sys.stdout being a io.TextIOWrapper).

Redirecting the stdout of a child process

We've just seen that the file descriptor duplication approach lets us grab the output from child processes as well. But it may not always be the most convenient way to achieve this task. In the general case, you typically use the subprocess module to launch child processes, and you may launch several such processes either in a pipe or separately. Some programs will even juggle multiple subprocesses launched this way in different threads. Moreover, while these subprocesses are running you may want to emit something to stdout and you don't want this output to be captured.

So, managing the stdout file descriptor in the general case can be messy; it is also unnecessary, because there's a much simpler way.

The subprocess module's swiss knife Popen class (which serve as the basis for much of the rest of the module) accepts a stdout parameter, which we can use to ask it to get access to the child's stdout:

import subprocess

echo_cmd = ['echo', 'this', 'comes', 'from', 'echo']
proc = subprocess.Popen(echo_cmd, stdout=subprocess.PIPE)
output = proc.communicate()[0]
print('Got stdout:', output)

The subprocess.PIPE argument can be used to set up actual child process pipes (a la the shell), but in its simplest incarnation it captures the process's output.

If you only launch a single child process at a time and are interested in its output, there's an even simpler way:

output = subprocess.check_output(echo_cmd)
print('Got stdout:', output)

check_output will capture and return the child's standard output to you; it will also raise an exception if the child exist with a non-zero return code.

Conclusion

I hope I covered most of the common cases where "stdout redirection" is needed in Python. Naturally, all of the same applies to the other standard output stream - stderr. Also, I hope the background on file descriptors was sufficiently clear to explain the redirection code; squeezing this topic in such a short space is challenging. Let me know if any questions remain or if there's something I could have explained better.

Finally, while it is conceptually simple, the code for the redirector is quite long; I'll be happy to hear if you find a shorter way to achieve the same effect.

[1]	Do not despair. As of February 2015, a sizable chunk of the worldwide Python programmers are in the same boat.

[2]	Note that bytes passed to `puts`. This being Python 3, we have to be careful since `libc` doesn't understand Python's unicode strings.

[3]	The following description focuses on Unix/POSIX systems; also, it's necessarily partial. Large book chapters have been written on this topic - I'm just trying to present some key concepts relevant to stream redirection. For a variant that works on Windows, take a look at this gist.

[4]	The approach taken here is inspired by this Stack Overflow answer.

Highlighting the active tab in GNOME terminal

2014-09-17T22:54:00-07:00

On Ubuntu, I like using the default GNOME terminal for all my command-line needs, and I'm a big fan of its tabs. One of the problems with tabs, however, is that it's not always easy to tell which tab you're currently in - which tab is the active one. By default, the terminal application makes a very slight visual distinction between the active and inactive tabs, and it would be really nice if it was more prominent.

Luckily, it isn't difficult to configure GNOME to do this:

For Ubuntu 14.04, create a file named ~/.config/gtk-3.0/gtk.css, and place the following into it:

TerminalWindow .notebook tab:active {
    background-color: #b0c0f0;
}

The color itself can be customized, of course. This technique supposedly works on all the latest Ubuntu versions starting with 12.10; it doesn't work on 12.04, though. What did work on 12.04 for me is the technique described in this forum post.

pyelftools 0.22 released

2014-03-30T06:57:45-07:00

Version 0.22 of pyelftools was released today (on PyPI too). Some nice new features in this release - most notably initial support for DWARF v4, as well as better support for ARM objects (including 64-bit ARM). By the way, I recently noticed that pyelftools appears in the open source credits page of Chrome (chrome://credits). I knew the Chrome and Chrome OS teams are using it for some of their tools & testing, but hadn't bothered to check the credits page until recently. This is probably the most high-profile (at least publicly visible) user of pyelftools.

Building gcc 4.8 from source on Ubunu 12.04

2014-01-16T06:08:15-08:00

The default gcc (4.6.x) on Ubuntu 12.04 is quite old, especially given the quick advance in C++11 capabilities in gcc 4.7 and 4.8 (and, importantly, their respective libstdc++ libraries).

The LLVM project has recently decided (and implemented, earlier this week) to set gcc 4.7 and Clang 3.1 as the minimal versions LLVM & Clang themselves would build with, in order to be able to use C++11 capabilities in the implementation. Therefore, if you want to build trunk LLVM & Clang on Ubuntu 12.04, you need a newer gcc (even if you use Clang to self-build, still libstdc++ version 4.7 or later is required).

Luckily, building a new gcc and installing it locally (to not mess with the system installation) is fairly easy. Here's a short sequences of steps.

First, create a place to hold the installation, like $HOME/install/gcc-4.8.2.

Install some dependencies needed to build gcc:

$ sudo apt-get install libmpfr-dev libgmp3-dev libmpc-dev flex bison

Then, get a gcc 4.8 tarball and unpack it:

$ wget http://ftp.gnu.org/gnu/gcc/gcc-4.8.2/gcc-4.8.2.tar.bz2
$ bunzip2  gcc-4.8.2.tar.bz2
$ tar xvf gcc-4.8.2.tar

Enter the untarred gcc directory:

$ cd gcc-4.8.2/
$ mkdir build
$ cd build

Now, configure and build:

$ ../configure --disable-checking --enable-languages=c,c++ \
  --enable-multiarch --enable-shared --enable-threads=posix \
  --program-suffix=4.8 --with-gmp=/usr/local/lib --with-mpc=/usr/lib \
  --with-mpfr=/usr/lib --without-included-gettext --with-system-zlib \
  --with-tune=generic \
  --prefix=$HOME/install/gcc-4.8.2
$ make -j8
$ make install

This places an installation of gcc 4.8.2 in $HOME/install/gcc-4.8.2. For example, you should be able to see this:

$ cd $HOME/install/gcc-4.8.2
$ find . -name gcc4.8
./bin/gcc4.8

That's it. You can try it:

$ cd /tmp
$ cat > test.c
int main() { return 42; }
$ $HOME/install/gcc-4.8.2/bin/gcc4.8 test.c
$ ./a.out; echo $?
42

You can also run gcc4.8 with the -### flag to see the exact compilation steps and notice which libraries get picked up, etc. (note that to see -lstdc++ a C++ source compiled with g++4.8 is needed instead).

As a bonus, trunk LLVM & Clang can be now build by passing this gcc to the configure script (the exact same idea works with the CMake-based configuration flow too):

$ CC=$HOME/install/gcc-4.8.2/bin/gcc4.8 \
  CXX=$HOME/install/gcc-4.8.2/bin/g++4.8 \
  ../../llvm-src/configure

Eli Bendersky's website - Linux

Replacing my home desktop computer

Unix domain sockets in Go

Unix domain sockets (UDS)

Using Unix domain sockets in Go

HTTP and RPC protocols over UDS

Benchmarking UDS compared to loop-back TCP sockets

Unix domain sockets in the real-world Go projects

Measuring context switching and memory overheads for Linux threads

Linux threads and NPTL

Threads, processes and the clone system call

What happens in a context switch?

How expensive are context switches?

What does this mean in practice?

Memory usage of threads

Virtual vs. Resident memory

Back to memory overhead for threads

Conclusion

Launching Linux threads and processes with clone

The clone system call

Calling clone in process and thread creation

Benchmarking process vs. thread creation

Basics of Futexes

Motivation

Simple futex use - waiting and waking

Futexes are kernel queues for userspace code

Timed blocking with FUTEX_WAIT

Using a futex to implement a simple mutex

glibc mutex and low-level lock

Futexes in the Go runtime

Persistent history in Bash - redux

Redirecting all kinds of stdout in Python

Pure Python

Redirecting C-level streams

Detour - on file descriptors and streams

Redirecting with file descriptor duplication

Redirecting the stdout of a child process

Conclusion

Highlighting the active tab in GNOME terminal

pyelftools 0.22 released

Building gcc 4.8 from source on Ubunu 12.04

The `clone` system call

Calling `clone` in process and thread creation