<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Eli Bendersky's website - Linux</title><link href="https://eli.thegreenplace.net/" rel="alternate"></link><link href="https://eli.thegreenplace.net/feeds/linux.atom.xml" rel="self"></link><id>https://eli.thegreenplace.net/</id><updated>2024-11-04T14:08:15-08:00</updated><entry><title>Building static binaries with Go on Linux</title><link href="https://eli.thegreenplace.net/2024/building-static-binaries-with-go-on-linux/" rel="alternate"></link><published>2024-07-30T14:35:00-07:00</published><updated>2024-07-30T21:35:34-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2024-07-30:/2024/building-static-binaries-with-go-on-linux/</id><summary type="html">&lt;p&gt;One of Go's advantages is being able to produce statically-linked
binaries &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;. This doesn't mean that Go always produces such binaries by default,
however; in some scenarios it requires extra work to make this happen.
Specifics here are OS-dependent; here we focus on Unix systems.&lt;/p&gt;
&lt;div class="section" id="basics-hello-world"&gt;
&lt;h2&gt;Basics - hello world&lt;/h2&gt;
&lt;p&gt;This post …&lt;/p&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;One of Go's advantages is being able to produce statically-linked
binaries &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;. This doesn't mean that Go always produces such binaries by default,
however; in some scenarios it requires extra work to make this happen.
Specifics here are OS-dependent; here we focus on Unix systems.&lt;/p&gt;
&lt;div class="section" id="basics-hello-world"&gt;
&lt;h2&gt;Basics - hello world&lt;/h2&gt;
&lt;p&gt;This post goes over a series of experiments: we take simple programs and use
&lt;tt class="docutils literal"&gt;go build&lt;/tt&gt; to produce binaries on a Linux machine. We then examine whether
the produced binary is statically or dynamically linked. The first example is
a simple &amp;quot;hello, world&amp;quot;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;package&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;fmt&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;hello world&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;After building it with &lt;tt class="docutils literal"&gt;go build&lt;/tt&gt;, we get a binary. There are a few ways on
Linux to determine whether a binary is statically or dynamically linked. One
is the &lt;tt class="docutils literal"&gt;file&lt;/tt&gt; tool:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ file ./helloworld
helloworld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=Flm7stIXKLPfvBhTgXmR/PPwdjFUEkc9NCSPRC7io/PofU_qoulSqJ0Ktvgx5g/eQXbAL15zCEIXOBSPZgY, with debug_info, not stripped
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You can see it says &amp;quot;statically linked&amp;quot;. Another way is to use &lt;tt class="docutils literal"&gt;ldd&lt;/tt&gt;, which
prints the shared object dependencies of a given binary:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ ldd ./helloworld
  not a dynamic executable
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Alternatively, we can also use the ubiquitous &lt;tt class="docutils literal"&gt;nm&lt;/tt&gt; tool, asking it to list the
undefined symbols in a binary (these are symbols the binary expects the dynamic
linker to provide at run-time from shared objects):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ nm -u ./helloworld
&amp;lt;empty output&amp;gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;All of these tell us that a simple &lt;tt class="docutils literal"&gt;helloworld&lt;/tt&gt; is a statically-linked binary.
Throughout the post I'll mostly be using &lt;tt class="docutils literal"&gt;ldd&lt;/tt&gt; (out of habit), but you can
use any approach you like.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="dns-and-user-groups"&gt;
&lt;h2&gt;DNS and user groups&lt;/h2&gt;
&lt;p&gt;There are two pieces of functionality the Go standard library defers to the
system's &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt; on Unix machines, when some conditions are met. When cgo
is enabled (as it often - but not always - is on Unix machines), Go will call
the C library for DNS lookups in the &lt;tt class="docutils literal"&gt;net&lt;/tt&gt; package and for user and group
ID lookups in the &lt;tt class="docutils literal"&gt;os/user&lt;/tt&gt; package.&lt;/p&gt;
&lt;p&gt;Let's observe this with an experiment:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;package&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;fmt&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;net&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;net&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;LookupHost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;go.dev&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If we build this program, we notice it's &lt;em&gt;dynamically&lt;/em&gt; linked, expecting to
load a &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt; shared object at run-time:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ go build lookuphost.go
$ ldd ./lookuphost
  linux-vdso.so.1 (0x00007b50cb22a000)
  libc.so.6 =&amp;gt; /lib/x86_64-linux-gnu/libc.so.6 (0x00007b50cae00000)
  /lib64/ld-linux-x86-64.so.2 (0x00007b50cb22c000)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This is explained in the &lt;a class="reference external" href="https://pkg.go.dev/net#hdr-Name_Resolution"&gt;net package documentation&lt;/a&gt; in some detail. The Go
standard library does have a pure Go implementation of this functionality
(although it may lack some advanced features). We can ask the toolchain to use
it in a couple of ways. First, we can set the &lt;tt class="docutils literal"&gt;netgo&lt;/tt&gt; build tag:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ go build -tags netgo lookuphost.go
$ ldd ./lookuphost
  not a dynamic executable
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Second, we can disable cgo entirely with the &lt;tt class="docutils literal"&gt;CGO_ENABLED&lt;/tt&gt; env var. This env
var is usually on by default on Unix systems:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ go env CGO_ENABLED
1
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If we disable it explicitly for our build, we'll get a static binary again:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ CGO_ENABLED=0 go build lookuphost.go
$ ldd ./lookuphost
  not a dynamic executable
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Similarly, some of the functionality of the &lt;tt class="docutils literal"&gt;os/user&lt;/tt&gt; package uses &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt;
by default. Here's an example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;package&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;encoding/json&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;log&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;os&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;os/user&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Lookup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bob&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;je&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;NewEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;je&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This produces a dynamically-linked binary:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ go build userlookup.go
$ ldd ./userlookup
  linux-vdso.so.1 (0x0000708301084000)
  libc.so.6 =&amp;gt; /lib/x86_64-linux-gnu/libc.so.6 (0x0000708300e00000)
  /lib64/ld-linux-x86-64.so.2 (0x0000708301086000)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As with &lt;tt class="docutils literal"&gt;net&lt;/tt&gt;, we can ask the Go toolchain to use the pure Go implementation
of this user lookup functionality. The build tag for this is &lt;tt class="docutils literal"&gt;osusergo&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ go build -tags osusergo userlookup.go
$ ldd ./userlookup
  not a dynamic executable
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Or, we can disable cgo:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ CGO_ENABLED=0 go build userlookup.go
$ ldd ./userlookup
  not a dynamic executable
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="section" id="linking-c-into-our-go-binary"&gt;
&lt;h2&gt;Linking C into our go binary&lt;/h2&gt;
&lt;p&gt;We've seen that the standard library has some functionality that may require
dynamic linking by default, but this is relatively easy to override. What
happens when we actually have C code as part of our Go program, though?&lt;/p&gt;
&lt;p&gt;Go supports C extensions and FFI using &lt;a class="reference external" href="https://pkg.go.dev/cmd/cgo"&gt;cgo&lt;/a&gt;.
For example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;package&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="c1"&gt;// #include &amp;lt;stdio.h&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="c1"&gt;// void helloworld() {&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="c1"&gt;//   printf(&amp;quot;hello, world from C\n&amp;quot;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="c1"&gt;// }&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;C&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;C&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;helloworld&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;A program built from this source will be dynamically linked, due to cgo:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ go build cstdio.go
$ ldd ./cstdio
  linux-vdso.so.1 (0x00007bc6d68e3000)
  libc.so.6 =&amp;gt; /lib/x86_64-linux-gnu/libc.so.6 (0x00007bc6d6600000)
  /lib64/ld-linux-x86-64.so.2 (0x00007bc6d68e5000)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In our C code, &lt;tt class="docutils literal"&gt;printf&lt;/tt&gt; is a call to &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt;; even if we don't explicitly
call into the C runtime in our C code, cgo may do it in the scaffolding code
it generates.&lt;/p&gt;
&lt;p&gt;Note that cgo may be involved even if your project has no C code of its own;
several dependencies may bring in cgo. Some popular packages - like the
&lt;a class="reference external" href="https://pkg.go.dev/github.com/mattn/go-sqlite3"&gt;go-sqlite3&lt;/a&gt; driver - depend
on cgo, and importing them will impose a cgo requirement on a program.&lt;/p&gt;
&lt;p&gt;Obviously, building with &lt;tt class="docutils literal"&gt;CGO_ENABLED=0&lt;/tt&gt; is no longer an option.
So what's the recourse?&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="linking-a-libc-statically"&gt;
&lt;h2&gt;Linking a &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt; statically&lt;/h2&gt;
&lt;p&gt;To recap, once we have C code as part of our Go binary, it's going to be
dynamically linked on Unix, because:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;The C code calls into &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt; (the C runtime)&lt;/li&gt;
&lt;li&gt;The &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt; typically used on Unix systems is &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Glibc"&gt;glibc&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The recommended way to link to &lt;tt class="docutils literal"&gt;glibc&lt;/tt&gt; is dynamically (for various
technical and license-related reasons that are outside the scope of this
post)&lt;/li&gt;
&lt;li&gt;Therefore, &lt;tt class="docutils literal"&gt;go build&lt;/tt&gt; produces dynamically-linked Go binaries&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To change this flow of events, we can interpose at step (2) - use a &lt;em&gt;different&lt;/em&gt;
&lt;tt class="docutils literal"&gt;libc&lt;/tt&gt; implementation, one that's statically linked. Luckily, such an
implementation exists and is well used and tested - &lt;a class="reference external" href="https://wiki.musl-libc.org/"&gt;musl&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To follow along, start by installing musl. The standard instructions using
&lt;tt class="docutils literal"&gt;./configure &lt;span class="pre"&gt;--prefix=&amp;lt;MUSLDIR&amp;gt;&lt;/span&gt;&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;make&lt;/tt&gt; / &lt;tt class="docutils literal"&gt;make install&lt;/tt&gt; work well.
We'll use &lt;tt class="docutils literal"&gt;$MUSLDIR&lt;/tt&gt; to refer to the directory where musl is installed.
musl comes with a &lt;tt class="docutils literal"&gt;gcc&lt;/tt&gt; wrapper that makes it easy to pass all the right
flags. To re-build our &lt;tt class="docutils literal"&gt;cstdio&lt;/tt&gt; example using musl, run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ CC=$MUSLDIR/bin/musl-gcc go build --ldflags &amp;#39;-linkmode external -extldflags &amp;quot;-static&amp;quot;&amp;#39; cstdio.go
$ ldd ./cstdio
  not a dynamic executable
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;tt class="docutils literal"&gt;CC&lt;/tt&gt; env var tells &lt;tt class="docutils literal"&gt;go build&lt;/tt&gt; which C compiler to use for cgo; the
linker flags instruct it to use an external linker for the final build
(&lt;a class="reference external" href="https://cs.opensource.google/go/go/+/refs/tags/go1.22.0:src/cmd/cgo/doc.go;l=830"&gt;read this for the gory details&lt;/a&gt;)
and then to perform a static link.&lt;/p&gt;
&lt;p&gt;This approach works for more complex use cases as well! I won't paste the code
here, but the &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/tree/main/2024/go-static-linking"&gt;sample repository accompanying this post&lt;/a&gt; has a file
called &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;use-sqlite.go&lt;/span&gt;&lt;/tt&gt;; it uses the &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;go-sqlite3&lt;/span&gt;&lt;/tt&gt; package. Try
&lt;tt class="docutils literal"&gt;go build&lt;/tt&gt;-ing it normally and observe the dynamically linked binary produced;
next, try to build it with the flags shown above to use musl, and observe
that the produced binary will be statically linked.&lt;/p&gt;
&lt;p&gt;Another curious tidbit is that we now have another way to build a statically-linked
&lt;tt class="docutils literal"&gt;lookuphost&lt;/tt&gt; program - by linking it with musl:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ CC=$MUSLDIR/bin/musl-gcc go build --ldflags &amp;#39;-linkmode external -extldflags &amp;quot;-static&amp;quot;&amp;#39; lookuphost.go
$ ldd ./lookuphost
  not a dynamic executable
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Since we didn't provide &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-tags&lt;/span&gt; netgo&lt;/tt&gt; and didn't disable cgo, the Go toolchain
uses calls into &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt; to implement DNS lookup; however, since these calls
end up in the statically-linked musl, the final binary is statically linked!&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="using-zig-as-our-c-compiler"&gt;
&lt;h2&gt;Using Zig as our C compiler&lt;/h2&gt;
&lt;p&gt;Another alternative emerged recently to achieve what we want: using the Zig
toolchain. &lt;a class="reference external" href="https://ziglang.org/"&gt;Zig&lt;/a&gt; is a new systems programming language,
which uses a bundled toolchain approach similar to Go. Its toolchain bundles
together a Zig compiler, C/C++ compiler, linker and &lt;tt class="docutils literal"&gt;libc&lt;/tt&gt; for static linking.
Therefore, Zig can actually be used to link Go binaries statically with C code!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Instead&lt;/em&gt; of installing musl, we could instead install Zig and use its
&lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;x86_64-linux-musl&lt;/span&gt;&lt;/tt&gt; target (adjust the architecture if needed). This is
done by pointing to the &lt;tt class="docutils literal"&gt;zig&lt;/tt&gt; binary as our &lt;tt class="docutils literal"&gt;CC=&lt;/tt&gt; env var; assuming Zig
is installed in &lt;tt class="docutils literal"&gt;$ZIGDIR&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ CC=&amp;quot;$ZIGDIR/zig cc -target x86_64-linux-musl&amp;quot; go build cstdio.go
$ CC=&amp;quot;$ZIGDIR/zig cc -target x86_64-linux-musl&amp;quot; go build use-sqlite.go
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;These will produce statically-linked Go binaries; the &lt;tt class="docutils literal"&gt;zig&lt;/tt&gt; driver takes
care of setting the right linker flags automatically, so the command-line ends
up being slightly simpler than invoking &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;musl-gcc&lt;/span&gt;&lt;/tt&gt;. Another advantage of Zig
here is that enables cross-compilation of Go programs that include C code &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I did find some issues with this approach, however; for example, attempting to
link the &lt;tt class="docutils literal"&gt;lookuphost.go&lt;/tt&gt; sample fails with a slew of linker errors.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="summary"&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Making sure Go produces a statically-linked binary on Linux takes a little
bit of effort, but works well overall.&lt;/p&gt;
&lt;p&gt;There's a &lt;a class="reference external" href="https://github.com/golang/go/issues/26492"&gt;long standing accepted proposal&lt;/a&gt;
about adding a &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-static&lt;/span&gt;&lt;/tt&gt; flag to &lt;tt class="docutils literal"&gt;go build&lt;/tt&gt; that would take care of setting
up all the flags required for a static build. AFAICT, the proposal is just
waiting for someone with enough grit and dedication to implement and test it
in all the interesting scenarios.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="code"&gt;
&lt;h2&gt;Code&lt;/h2&gt;
&lt;p&gt;The code for all the experiments described in this post
&lt;a class="reference external" href="https://github.com/eliben/code-for-blog/tree/main/2024/go-static-linking"&gt;is available on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;A &lt;em&gt;statically-linked&lt;/em&gt; binary doesn't have run-time dependencies on
other libraries (typically in the form of shared objects), not even
the C runtime library (&lt;tt class="docutils literal"&gt;libc&lt;/tt&gt;). I wrote much more about this topic
&lt;a class="reference external" href="https://eli.thegreenplace.net/2012/08/13/how-statically-linked-programs-run-on-linux"&gt;in the past&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Go is well-known for its cross-compilation capabilities, but it
depends on the C toolchain to compile C code. Therefore, when cgo is
involved, cross-compilation is challenging. Zig can help with this
because &lt;em&gt;its&lt;/em&gt; toolchain supports cross compilation for Zig &lt;em&gt;and&lt;/em&gt; C! It
does so by bundling LLVM with a bunch of targets linked in.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Go"></category><category term="Compilation"></category><category term="Linkers and Loaders"></category><category term="Linux"></category></entry><entry><title>Replacing my home desktop computer</title><link href="https://eli.thegreenplace.net/2022/replacing-my-home-desktop-computer/" rel="alternate"></link><published>2022-11-02T20:31:00-07:00</published><updated>2024-11-02T13:43:25-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2022-11-02:/2022/replacing-my-home-desktop-computer/</id><summary type="html">&lt;p&gt;After 9 years, it's time to retire my &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/11/23/a-new-ubuntu-machine-for-home/"&gt;old and faithful home Linux machine&lt;/a&gt;. It
has served me extremely well, by far the longest time I've used a home computer
without any replacements. The fans were starting to show their age, to the
extent that the CPU fan sometimes needs …&lt;/p&gt;</summary><content type="html">&lt;p&gt;After 9 years, it's time to retire my &lt;a class="reference external" href="https://eli.thegreenplace.net/2013/11/23/a-new-ubuntu-machine-for-home/"&gt;old and faithful home Linux machine&lt;/a&gt;. It
has served me extremely well, by far the longest time I've used a home computer
without any replacements. The fans were starting to show their age, to the
extent that the CPU fan sometimes needs a nudge to start after the machine has
been off. While this can be fixed, I also wanted a somewhat faster machine with
more memory.&lt;/p&gt;
&lt;p&gt;I've long been keeping a curious eye on &lt;a class="reference external" href="https://system76.com/"&gt;system76&lt;/a&gt;;
they build and sell HW that was designed for Linux from the start. While my
old machine - which was just a custom build - handled Linux fine over the years,
it did exhibit some snags here and there (Bluetooth, for example). It would be
nice not to deal with this and have HW that &lt;em&gt;just works&lt;/em&gt;.&lt;/p&gt;
&lt;img alt="System76 meerkat machine" class="align-center" src="https://eli.thegreenplace.net/images/2022/meerkat.webp" style="width: 400px;" /&gt;
&lt;p&gt;So I went ahead and ordered a &amp;quot;Meerkat&amp;quot; model, which is a really tiny 4.5&amp;quot;x4.5&amp;quot;
box (with a comically large power adapter) that manages to pack 32 GiB of
memory, a sizable SSD and a 4-core (8-thread) i5-1135G7 CPU. It came with
Ubuntu 22.04 preinstalled and was very quick and easy to set up. I
moved all my daily operations to this machine a couple of hours after unpacking
it. Just the computer itself changed - it's still the same monitor, keyboard
and mouse. I could finally use my Bluetooth earphones with it, without any
issues.&lt;/p&gt;
&lt;p&gt;It's about 33% faster than my older machine (at least for compiling the Go
toolchain and medium-sized Rust projects), and it's nice to have twice as much
memory and a larger SSD. I'm quite happy with it so far. It's not
the cheapest option out there, but it &lt;em&gt;is&lt;/em&gt; a &amp;quot;pay for high quality with minimum
fuss&amp;quot; option.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update 2024.11.02&lt;/strong&gt;: I replaced this computer after two years. It's still
working great, but was becoming a bit underpowered for my needs, especially
because it lacks a GPU. Being pretty happy with System76, I went for a &lt;a class="reference external" href="https://system76.com/desktops/thelio-mira"&gt;Thelio Mira&lt;/a&gt;
with an i7-13700K CPI (16 Cores - 24 Threads 8P+8E), 64 GiB of memory and
an inexpensive gaming-class Nvidia GPU.&lt;/p&gt;
</content><category term="misc"></category><category term="Hardware &amp; Gadgets"></category><category term="Linux"></category></entry><entry><title>Unix domain sockets in Go</title><link href="https://eli.thegreenplace.net/2019/unix-domain-sockets-in-go/" rel="alternate"></link><published>2019-02-12T05:27:00-08:00</published><updated>2024-11-04T14:08:15-08:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2019-02-12:/2019/unix-domain-sockets-in-go/</id><summary type="html">&lt;p&gt;When it comes to inter-process communication (IPC) between processes on the same
Linux host, there are multiple options: FIFOs, pipes, shared memory, sockets and
so on. One of the most interesting options is &lt;em&gt;Unix Domain Sockets&lt;/em&gt; that combine
the convenient API of sockets with the higher performance of the other …&lt;/p&gt;</summary><content type="html">&lt;p&gt;When it comes to inter-process communication (IPC) between processes on the same
Linux host, there are multiple options: FIFOs, pipes, shared memory, sockets and
so on. One of the most interesting options is &lt;em&gt;Unix Domain Sockets&lt;/em&gt; that combine
the convenient API of sockets with the higher performance of the other
single-host methods.&lt;/p&gt;
&lt;p&gt;This post demonstrates some basic examples of using Unix domain sockets with Go
and explores some benchmarks comparing them to TCP loop-back sockets.&lt;/p&gt;
&lt;div class="section" id="unix-domain-sockets-uds"&gt;
&lt;h2&gt;Unix domain sockets (UDS)&lt;/h2&gt;
&lt;p&gt;Unix domain sockets (UDS) have a long history, going back to the original BSD
socket specification in the 1980s. The &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Unix_domain_socket"&gt;Wikipedia definition&lt;/a&gt; is:&lt;/p&gt;
&lt;blockquote&gt;
A Unix domain socket or IPC socket (inter-process communication socket) is a
data communications endpoint for exchanging data between processes executing
on the same host operating system.&lt;/blockquote&gt;
&lt;p&gt;UDS support streams (TCP equivalent) and datagrams (UDP equivalent); this
post focuses on the stream APIs.&lt;/p&gt;
&lt;p&gt;IPC with UDS looks very similar to IPC with regular TCP sockets using the
loop-back interface (&lt;tt class="docutils literal"&gt;localhost&lt;/tt&gt; or &lt;tt class="docutils literal"&gt;127.0.0.1&lt;/tt&gt;), but there is a key
difference: performance. While the TCP loop-back interface can skip some of the
complexities of the full TCP/IP network stack, it retains many others (ACKs, TCP
flow control, and so on). These complexities are designed for reliable
cross-machine communication, but on a single host they're an unnecessary burden.
This post will explore some of the performance advantages of UDS.&lt;/p&gt;
&lt;p&gt;There are some additional differences. For example, since UDS use paths in the
filesystem as their addresses, we can use directory and file permissions to
control access to sockets, simplifying authentication. I won't list all the
differences here; for more information feel free to check out the Wikipedia link
and additional resources like &lt;a class="reference external" href="https://beej.us/guide/bgipc/html/split/unixsock.html"&gt;Beej's UNIX IPC guide&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The big disadvantage of UDS compared to TCP sockets is the single-host
restriction, of course. For code written to use TCP sockets we only have to
change the address from local to remote and everything keeps working. That said,
the performance advantages of UDS are significant enough, and the API is similar
enough to TCP sockets that it's quite possible to write code that supports both
(UDS on a single host, TCP for remote IPC) with very little difficulty.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="using-unix-domain-sockets-in-go"&gt;
&lt;h2&gt;Using Unix domain sockets in Go&lt;/h2&gt;
&lt;p&gt;Let's start with a basic example of a server in Go that listens on a UNIX
domain socket:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;SockAddr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/tmp/echo.sock&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;echoServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;net&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Client connected [%s]&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RemoteAddr&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;Network&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RemoveAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;SockAddr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;net&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;unix&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;SockAddr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;listen error:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;defer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// Accept new connections, dispatching them to echoServer&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c1"&gt;// in a goroutine.&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Accept&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;accept error:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;go&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;echoServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;UDS are identified with paths in the file system; for our server here we use
&lt;tt class="docutils literal"&gt;/tmp/echo.sock&lt;/tt&gt;. The server begins by removing this file if it exists, what
is that about?&lt;/p&gt;
&lt;p&gt;When servers shut down, the file representing the socket can remain in the
file system unless the server did orderly cleanup after itself. If we re-run
another server with the same socket path, we may get the error:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ go run simple-echo-server.go
2019/02/08 05:41:33 listen error:listen unix /tmp/echo.sock: bind: address already in use
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;To prevent that, the server begins by removing the socket file, if it exists
&lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Now that the server is running, we can interact with it using Netcat, which can
be asked to connect to UDS with the &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-U&lt;/span&gt;&lt;/tt&gt; flag:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ nc -U /tmp/echo.sock
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Whatever you type in, the server will echo back. Press &lt;tt class="docutils literal"&gt;^D&lt;/tt&gt; to terminate the
session. Alternatively, we can write a simple client in Go that connects to the
server, sends it a message, waits for a response and exits. The full code for
the client &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2019/unix-domain-sockets-go/simple-client.go"&gt;is here&lt;/a&gt;,
but the important part is the connection:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;net&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Dial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;unix&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/tmp/echo.sock&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We can see that writing UDS servers and clients is very similar to writing
regular socket servers and clients. The only difference is having to pass
&lt;tt class="docutils literal"&gt;&amp;quot;unix&amp;quot;&lt;/tt&gt; as the &lt;tt class="docutils literal"&gt;network&lt;/tt&gt; parameter of &lt;tt class="docutils literal"&gt;net.Listen&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;net.Dial&lt;/tt&gt;; the
rest of the code remains the same. Obviously, this makes it very easy to
write generic server and client code that's independent of the actual kind of
socket it's using.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="http-and-rpc-protocols-over-uds"&gt;
&lt;h2&gt;HTTP and RPC protocols over UDS&lt;/h2&gt;
&lt;p&gt;Network protocols compose by design. High-level protocols, such as HTTP and
various forms of RPC, don't particularly care about how the lower levels of the
stack are implemented as long as certain guarantees are maintained.&lt;/p&gt;
&lt;p&gt;Go's standard library comes with a small and useful &lt;tt class="docutils literal"&gt;rpc&lt;/tt&gt; package that makes
it trivial to throw together quick RPC servers and clients. Here's a simple
server that has a single procedure defined:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;SockAddr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/tmp/rpc.sock&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Greeter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;struct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;g&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Greeter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Greet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;reply&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;reply&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Hello, &amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RemoveAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;SockAddr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;greeter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Greeter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;rpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;greeter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;rpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;HandleHTTP&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;net&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;unix&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;SockAddr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;listen error:&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;fmt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Serving...&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Serve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;l&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Note that we use the HTTP version of the server. It registers a HTTP handler
with the &lt;tt class="docutils literal"&gt;http&lt;/tt&gt; package, and the actual serving is done with the standard
&lt;tt class="docutils literal"&gt;http.Serve&lt;/tt&gt;. The network stack here looks something like this:&lt;/p&gt;
&lt;img alt="RPC / HTTP / Unix domain socket stack" class="align-center" src="https://eli.thegreenplace.net/images/2019/rpc-http-uds.png" /&gt;
&lt;p&gt;An RPC client that can connect to the server shown above is &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2019/unix-domain-sockets-go/rpc-client.go"&gt;available here&lt;/a&gt;.
It uses the standard &lt;tt class="docutils literal"&gt;rpc.Client.Call&lt;/tt&gt; method to connect to the server.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="benchmarking-uds-compared-to-loop-back-tcp-sockets"&gt;
&lt;h2&gt;Benchmarking UDS compared to loop-back TCP sockets&lt;/h2&gt;
&lt;p&gt;Note: benchmarking is hard, so please take these results with a grain of salt.
There's some more information on benchmarking different socket types
on the &lt;a class="reference external" href="https://redis.io/topics/benchmarks"&gt;Redis benchmarks page&lt;/a&gt; and in
&lt;a class="reference external" href="http://osnet.cs.binghamton.edu/publications/TR-20070820.pdf"&gt;this paper&lt;/a&gt;, as
well as many other resources online. I also found &lt;a class="reference external" href="https://github.com/rigtorp/ipc-bench/"&gt;this set of benchmarks&lt;/a&gt; (written in C) instructive.&lt;/p&gt;
&lt;p&gt;I'm running two kinds of benchmarks: one for latency, and one for throughput.&lt;/p&gt;
&lt;p&gt;For latency, the &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2019/unix-domain-sockets-go/local-latency-benchmark.go"&gt;full code of the benchmark is here&lt;/a&gt;.
Run it with &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;-help&lt;/span&gt;&lt;/tt&gt; to see what the flags are, and the code should be very
straightforward to grok. The idea is to ping-pong a small packet of data (128
bytes by default) between a server and a client. The client measures how long it
takes to send one such message and receive one back, and takes that combined
time as &amp;quot;twice the latency&amp;quot;, averaging it over many messages.&lt;/p&gt;
&lt;p&gt;On my machine, I see average latency of ~3.6 microseconds for TCP loop-back
sockets, and ~2.3 microseconds for UDS.&lt;/p&gt;
&lt;p&gt;The &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2019/unix-domain-sockets-go/local-throughput-benchmark.go"&gt;throughput/bandwidth benchmark&lt;/a&gt;
is conceptually simpler than the latency benchmark. The server listens on a
socket and grabs all the data it can get (and discards it). The client sends
&lt;em&gt;large&lt;/em&gt; packets (hundreds of KB or more) and measures how long each packet takes
to send; the send is done synchronously and the client expects the whole message
to be sent in a single call, so it's a good approximation of bandwidth if the
packet size is large enough.&lt;/p&gt;
&lt;p&gt;Obviously, the throughput measurement is more representative with larger
messages. I tried increasing them until the throughput improvements tapered off.&lt;/p&gt;
&lt;p&gt;For smaller packet sizes, I see UDS winning over TCP: 10 GB/sec compared to 9.4
GB/sec for 512K. For much larger packet sizes 16-32 MB, the difference becomes
negligible (both taper off at about 13 GB/sec). Interestingly, for some packet
sizes (like 64K), TCP sockets are winning on my machine.&lt;/p&gt;
&lt;p&gt;For very small message sizes we're getting back to latency-dominated
performance, so UDS is considerably faster (more than 2x the number of packets
per second compared to TCP). In most cases I'd say that the latency measurements
are more important - they're more applicable to things like RPC servers
and databases. In some cases like streaming video or other &amp;quot;big data&amp;quot; over
sockets, you may want to pick the packet sizes carefully to optimize the
performance for the specific machine you're using.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://lists.freebsd.org/pipermail/freebsd-performance/2005-February/001143.html"&gt;This discussion&lt;/a&gt;
has some really insightful information about why we should expect UDS to be
faster. However, beware - it's from 2005 and much in Linux has changed since
then.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="unix-domain-sockets-in-the-real-world-go-projects"&gt;
&lt;h2&gt;Unix domain sockets in the real-world Go projects&lt;/h2&gt;
&lt;p&gt;I was curious to see if UDS are actually used in real-world Go projects. They
sure are! A few minutes of browsing/searching GitHub quickly uncovered UDS
servers in many components of the new Go-dominated cloud infrastructure:
runc, moby (Docker), k8s, lstio - pretty much every project I looked at.&lt;/p&gt;
&lt;p&gt;That makes sense - as the benchmarks demonstrate, there are significant
performance advantages to using a UDS when the client and server are both on the
same host. And the API of UDS and TCP sockets is so similar that the cost of
supporting both interchangeably is quite small.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;For internet-domain socket, the same issue exists with ports that are
marked taken by processes that die without cleanup. The &lt;tt class="docutils literal"&gt;SO_REUSEADDR&lt;/tt&gt;
socket option exists to address this problem.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Go"></category><category term="Linux"></category><category term="Network Programming"></category></entry><entry><title>Measuring context switching and memory overheads for Linux threads</title><link href="https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/" rel="alternate"></link><published>2018-09-04T05:35:00-07:00</published><updated>2024-05-04T19:46:23-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2018-09-04:/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/</id><summary type="html">&lt;p&gt;In this post I want to explore the costs of threads on modern Linux machines,
both in terms of time and space. The background context is designing high-load
concurrent servers, where &lt;a class="reference external" href="https://eli.thegreenplace.net/2017/concurrent-servers-part-2-threads/"&gt;using threads is one of the common schemes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Important disclaimer: it's not my goal here to provide an opinion …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this post I want to explore the costs of threads on modern Linux machines,
both in terms of time and space. The background context is designing high-load
concurrent servers, where &lt;a class="reference external" href="https://eli.thegreenplace.net/2017/concurrent-servers-part-2-threads/"&gt;using threads is one of the common schemes&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Important disclaimer: it's not my goal here to provide an opinion in the threads
vs. event-driven models debate. Ultimately, both are tools that work well in
some scenarios and less well in others. That said, one of the major criticisms
of a thread-based model is the cost - comments like &amp;quot;but context switches are
expensive!&amp;quot; or &amp;quot;but a thousand threads will eat up all your RAM!&amp;quot;, and I do
intend to study the data underlying such claims in more detail here. I'll do
this by presenting multiple code samples and programs that make it easy to
explore and experiment with these measurements.&lt;/p&gt;
&lt;div class="section" id="linux-threads-and-nptl"&gt;
&lt;h2&gt;Linux threads and NPTL&lt;/h2&gt;
&lt;p&gt;In the dark, old ages before version 2.6, the Linux kernel didn't have much
specific support for threads, and they were more-or-less hacked on top of
process support. Before &lt;a class="reference external" href="https://eli.thegreenplace.net/2018/basics-of-futexes/"&gt;futexes&lt;/a&gt; there was no dedicated
low-latency synchronization solution (it was done using signals); neither was
there much good use of the capabilities of multi-core systems &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Native POSIX Thread Library (NPTL) was proposed by Ulrich Drepper and Ingo
Molnar from Red Hat, and integrated into the kernel in version 2.6, circa 2005.
I warmly recommend reading its &lt;a class="reference external" href="https://www.akkadia.org/drepper/nptl-design.pdf"&gt;design paper&lt;/a&gt;. With NPTL, thread creation
time became about 7x faster, and synchronization became much faster as well due
to the use of futexes. Threads and processes became more lightweight, with
strong emphasis on making good use of multi-core processors. This roughly
coincided with a &lt;a class="reference external" href="https://en.wikipedia.org/wiki/O(1)_scheduler"&gt;much more efficient scheduler&lt;/a&gt;, which made juggling many
threads in the Linux kernel even more efficient.&lt;/p&gt;
&lt;p&gt;Even though all of this happened 13 years ago, the spirit of NPTL is still
easily observable in some system programming code. For example, many thread and
synchronization-related paths in &lt;tt class="docutils literal"&gt;glibc&lt;/tt&gt; have &lt;tt class="docutils literal"&gt;nptl&lt;/tt&gt; in their name.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="threads-processes-and-the-clone-system-call"&gt;
&lt;h2&gt;Threads, processes and the clone system call&lt;/h2&gt;
&lt;p&gt;This was originally meant to be a part of this larger article, but it was
getting too long so I split off a separate post on &lt;a class="reference external" href="https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/"&gt;launching Linux processes
and threads with clone&lt;/a&gt;,
where you can learn about the &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; system call and some measurements of how
expensive it is to launch new processes and threads.&lt;/p&gt;
&lt;p&gt;The rest of this post will assume this is familiar information and will focus
on context switching and memory usage.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="what-happens-in-a-context-switch"&gt;
&lt;h2&gt;What happens in a context switch?&lt;/h2&gt;
&lt;p&gt;In the Linux kernel, this question has two important parts:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;&lt;em&gt;When&lt;/em&gt; does a kernel switch happen&lt;/li&gt;
&lt;li&gt;&lt;em&gt;How&lt;/em&gt; it happens&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The following deals mostly with (2), assuming the kernel has already decided to
switch to a different user thread (for example because the currently running
thread went to sleep waiting for I/O).&lt;/p&gt;
&lt;p&gt;The first thing that happens during a context switch is a switch to kernel mode,
either through an explicit system call (such as &lt;tt class="docutils literal"&gt;write&lt;/tt&gt; to some file or pipe)
or a timer interrupt (when the kernel preempts a user thread whose time slice
has expired). This requires saving the user space thread's registers and
jumping into kernel code.&lt;/p&gt;
&lt;p&gt;Next, the scheduler kicks in to figure out which thread should run next. When
we know which thread runs next, there's the important bookkeeping of virtual
memory to take care of; the page tables of the new thread have to be loaded into
memory, etc.&lt;/p&gt;
&lt;p&gt;Finally, the kernel restores the new thread's registers and cedes control back
to user space.&lt;/p&gt;
&lt;p&gt;All of this takes time, but how much time, exactly? I encourage you to read some
additional online resources that deal with this question, and try to run
benchmarks like &lt;a class="reference external" href="http://www.bitmover.com/lmbench/"&gt;lm_bench&lt;/a&gt;; what follows is
my attempt to quantify thread switching time.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="how-expensive-are-context-switches"&gt;
&lt;h2&gt;How expensive are context switches?&lt;/h2&gt;
&lt;p&gt;To measure how long it takes to switch between two threads, we need a benchmark
that deliberatly triggers a context switch and avoids doing too much work in
addition to that. This would be measuring just the &lt;em&gt;direct&lt;/em&gt; cost of the switch,
when in reality there is another cost - the &lt;em&gt;indirect&lt;/em&gt; one, which could even
be larger. Every thread has some working set of memory, all or some of which
is in the cache; when we switch to another thread, all this cache data becomes
unneeded and is slowly flushed out, replaced by the new thread's data. Frequent
switches back and forth between the two threads will cause a lot of such
thrashing.&lt;/p&gt;
&lt;p&gt;In my benchmarks I am not measuring this indirect cost, because it's pretty
difficult to avoid in any form of multi-tasking. Even if we &amp;quot;switch&amp;quot; between
different asynchronous event handlers within the same thread, they will likely
have different memory working sets and will interfere with each other's cache
usage if those sets are large enough. I strongly recommend watching &lt;a class="reference external" href="https://youtu.be/KXuZi9aeGTw"&gt;this
talk on fibers&lt;/a&gt; where a Google engineer explains
their measurement methodology and also how to avoid too much indirect switch
costs by making sure closely related tasks run with temporal locality.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/eliben/code-for-blog/tree/main/2018/threadoverhead"&gt;These code samples&lt;/a&gt;
measure context switching overheads using two different techniques:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;A pipe which is used by two threads to ping-pong a tiny amount of data.
Every &lt;tt class="docutils literal"&gt;read&lt;/tt&gt; on the pipe blocks the reading thread, and the kernel switches
to the writing thread, and so on.&lt;/li&gt;
&lt;li&gt;A condition variable used by two threads to signal an event to each other.&lt;/li&gt;
&lt;/ol&gt;
&lt;img alt="Ping pong paddles and ball" class="align-center" src="https://eli.thegreenplace.net/images/2018/ping-pong.png" /&gt;
&lt;p&gt;There are additional factors context switching time depends on; for example,
on a multi-core CPU, the kernel can occasionally migrate a thread between cores
because the core a thread has been previously using is occupied. While this
helps utilize more cores, such switches cost more than staying on the same core
(again, due to cache effects). Benchmarks can try to restrict this by running
with &lt;tt class="docutils literal"&gt;taskset&lt;/tt&gt; pinning affinity to one core, but it's important to keep in
mind this only models a lower bound.&lt;/p&gt;
&lt;p&gt;Using the two techniques I'm getting fairly similar results: somewhere between
1.2 and 1.5 microseconds per context switch, accounting only for the direct
cost, and pinning to a single core to avoid migration costs. Without pinning,
the switch time goes up to ~2.2 microseconds &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;. These numbers are largely
consistent with the reports in the fibers talk mentioned above, and also with
other benchmarks found online (like &lt;tt class="docutils literal"&gt;lat_ctx&lt;/tt&gt; from &lt;tt class="docutils literal"&gt;lmbench&lt;/tt&gt;).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="what-does-this-mean-in-practice"&gt;
&lt;h2&gt;What does this mean in practice?&lt;/h2&gt;
&lt;p&gt;So we have the numbers now, but what do they mean? Is 1-2 us a long time? As I
have mentioned in &lt;a class="reference external" href="https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/"&gt;the post on launch overheads&lt;/a&gt;,
a good comparison is &lt;tt class="docutils literal"&gt;memcpy&lt;/tt&gt;, which takes 3 us for 64 KiB on the same
machine. In other words, a context switch is a bit quicker than copying 64 KiB
of memory from one location to another.&lt;/p&gt;
&lt;img alt="Plot of thread/process launch and context switch" class="align-center" src="https://eli.thegreenplace.net/images/2018/plot-launch-switch.png" /&gt;
&lt;p&gt;1-2 us is not a long time by any measure, except when you're really trying to
optimize for extremely low latencies or high loads.&lt;/p&gt;
&lt;p&gt;As an example of an artificially high load, &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2018/threadoverhead/thread-pipe-msgpersec.c"&gt;here is another benchmark&lt;/a&gt;
which writes a short message into a pipe and expects to read it from another
pipe. On the other end of the two pipes is a thread that echoes one into the
other.&lt;/p&gt;
&lt;p&gt;Running the benchmark on the same machine I used to measure the context switch
times, I get ~400,000 iterations per second (this is with &lt;tt class="docutils literal"&gt;taskset&lt;/tt&gt; to pin
to a single core). This makes perfect sense given the earlier measurements,
because each iteration of this test performs two context switches, and at 1.2 us
per switch this is 2.4 us per iteration.&lt;/p&gt;
&lt;p&gt;You could claim that the two threads compete for the same CPU, but if I don't
pin the benchmark to a single core, the number of iterations per second
&lt;em&gt;halves&lt;/em&gt;. This is because the vast majority of time in this benchmark is spent
in the kernel switching from one thread to the other, and the core migrations
that occur when it's not pinned greatly ouweigh the loss of (the very minimal)
parallelism.&lt;/p&gt;
&lt;p&gt;Just for fun, I rewrote the &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2018/threadoverhead/channel-msgpersec.go"&gt;same benchmark in Go&lt;/a&gt;;
two goroutines ping-ponging short message between themselves over a channel. The
throughput this achieves is &lt;em&gt;dramatically&lt;/em&gt; higher - around 2.8 million
iterations per second, which leads to an estimate of ~170 ns switching between
goroutines &lt;a class="footnote-reference" href="#footnote-3" id="footnote-reference-3"&gt;[3]&lt;/a&gt;. Since switching between goroutines doesn't require an actual
kernel context switch (or even a system call), this isn't too surprising. For
comparison, &lt;a class="reference external" href="https://youtu.be/KXuZi9aeGTw"&gt;Google's fibers&lt;/a&gt; use a new Linux
system call that can switch between two tasks in about the same time,
&lt;em&gt;including&lt;/em&gt; the kernel time.&lt;/p&gt;
&lt;p&gt;A word of caution: benchmarks tend to be taken too seriously. Please take this
one only for what it demonstrates - a largely synthetical workload used to
poke into the cost of some fundamental concurrency primitives.&lt;/p&gt;
&lt;p&gt;Remember - it's quite unlikely that the actual workload of your task will be
negligible compared to the 1-2 us context switch; as we've seen, even a modest
&lt;tt class="docutils literal"&gt;memcpy&lt;/tt&gt; takes longer. Any sort of server logic such as parsing headers,
updating state, etc. is likely to take orders of magnitude longer. If there's
one takeaway to remember from these sections is that context switching on modern
Linux systems is &lt;em&gt;super fast&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="memory-usage-of-threads"&gt;
&lt;h2&gt;Memory usage of threads&lt;/h2&gt;
&lt;p&gt;Now it's time to discuss the other overhead of a large number of threads -
memory. Even though all threads in a process share their, there are
still areas of memory that aren't shared. In the &lt;a class="reference external" href="https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/"&gt;post about clone&lt;/a&gt;
we've mentioned &lt;em&gt;page tables&lt;/em&gt; in the kernel, but these are comparatively small.
A much larger memory area that it private to each thread is the &lt;em&gt;stack&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The default per-thread stack size on Linux is usually 8 MiB, and we can check
what it is by invoking &lt;tt class="docutils literal"&gt;ulimit&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ ulimit -s
8192
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;To see this in action, let's start a large number of threads and observe the
process's memory usage. &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2018/threadoverhead/threadspammer.c"&gt;This sample&lt;/a&gt;
launches 10,000 threads and sleeps for a bit to let us observe its memory usage
with external tools. Using tools like &lt;tt class="docutils literal"&gt;top&lt;/tt&gt; (or preferably &lt;tt class="docutils literal"&gt;htop&lt;/tt&gt;) we see
that the process uses ~80 GiB of &lt;em&gt;virtual&lt;/em&gt; memory, with about 80 MiB of
&lt;em&gt;resident&lt;/em&gt; memory. What is the difference, and how can it use 80 GiB of memory
on a machine that only has 16 available?&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="virtual-vs-resident-memory"&gt;
&lt;h2&gt;Virtual vs. Resident memory&lt;/h2&gt;
&lt;p&gt;A short interlude on what virtual memory means. When a Linux program allocates
memory (with &lt;tt class="docutils literal"&gt;malloc&lt;/tt&gt;) or otherwise, this memory initially doesn't really
exist - it's just an entry in a table the OS keeps. Only when the program
actually accesses the memory is the backing RAM for it found; this is what
virtual memory is all about.&lt;/p&gt;
&lt;p&gt;Therefore, the &amp;quot;memory usage&amp;quot; of a process can mean two things - how much
&lt;em&gt;virtual&lt;/em&gt; memory it uses overall, and how much &lt;em&gt;actual&lt;/em&gt; memory it uses. While
the former can grow almost without bounds - the latter is obviously limited to
the system's RAM capacity (with swapping to disk being the other mechanism of
virtual memory to assist here if usage grows above the side of physical memory).
The actual physical memory on Linux is called &amp;quot;resident&amp;quot; memory, because it's
actually resident in RAM.&lt;/p&gt;
&lt;p&gt;There's a &lt;a class="reference external" href="https://stackoverflow.com/q/7880784"&gt;good StackOverflow discussion&lt;/a&gt;
of this topic; here I'll just limit myself to a simple example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;argc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;report_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;started&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;escape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;report_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;after malloc&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;report_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;after touch&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;press ENTER&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;fgetc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This program starts by allocating 400 MiB of memory (assuming an &lt;tt class="docutils literal"&gt;int&lt;/tt&gt; size of
4) with &lt;tt class="docutils literal"&gt;malloc&lt;/tt&gt;, and later &amp;quot;touches&amp;quot; this memory by writing a number into
every element of the allocated array. It reports its own memory usage at every
step - see &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2018/threadoverhead/malloc-memusage.c"&gt;the full code sample&lt;/a&gt;
for the reporting code &lt;a class="footnote-reference" href="#footnote-4" id="footnote-reference-4"&gt;[4]&lt;/a&gt;. Here's the output from a sample run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ ./malloc-memusage
started: max RSS = 4780 kB; vm size = 6524 kB
after malloc: max RSS = 4780 kB; vm size = 416128 kB
after touch: max RSS = 410916 kB; vm size = 416128 kB
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The most interesting thing to note is how &lt;tt class="docutils literal"&gt;vm size&lt;/tt&gt; remains the same between
the second and third steps, while &lt;tt class="docutils literal"&gt;max RSS&lt;/tt&gt; grows from the initial value to
400 MiB. This is precisely because until we touch the memory, it's fully
&amp;quot;virtual&amp;quot; and isn't actually counted for the process's RAM usage.&lt;/p&gt;
&lt;p&gt;Therefore, distinguishing between virtual memory and RSS in realistic usage is
very important - this is why the thread launching sample from the previous section
could &amp;quot;allocate&amp;quot; 80 GiB of virtual memory while having only 80 MiB of resident
memory.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="back-to-memory-overhead-for-threads"&gt;
&lt;h2&gt;Back to memory overhead for threads&lt;/h2&gt;
&lt;p&gt;As we've seen, a new thread on Linux is created with 8 MiB of stack space, but
this is virtual memory until the thread actually uses it. If the thread actually
uses its stack, resident memory usage goes up dramatically for a large number of
threads. I've added a configuration option to the sample program that launches a
large number of threads; with it enabled, the thread function actually &lt;em&gt;uses&lt;/em&gt;
stack memory and from the RSS report it is easy to observe the effects.
Curiously, if I make each of 10,000 threads use 400 KiB of memory, the total RSS
is not 4 GiB but around 2.6 GiB &lt;a class="footnote-reference" href="#footnote-5" id="footnote-reference-5"&gt;[5]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;How to we control the stack size of threads? One option is using the &lt;tt class="docutils literal"&gt;ulimit&lt;/tt&gt;
command, but a better option is with the &lt;tt class="docutils literal"&gt;pthread_attr_setstacksize&lt;/tt&gt; API. The
latter is invoked programatically and populates a &lt;tt class="docutils literal"&gt;pthread_attr_t&lt;/tt&gt; structure
that's passed to thread creation. The more interesting question is - what should
the stack size be set to?&lt;/p&gt;
&lt;p&gt;As we have seen above, just creating a large stack for a thread doesn't
automatically eat up all the machine's memory - not before the stack is
being used. If our threads actually &lt;em&gt;use&lt;/em&gt; large amounts of stack memory, this is
a problem, because this severely limits the number of threads we can run
concurrently. Note that this is not really a problem with threads - but with
concurrency; if our program uses some event-driven approach to concurrency and
each handler uses a large amount of memory, we'd still have the same problem.&lt;/p&gt;
&lt;p&gt;If the task doesn't actually use a lot of memory, what should we set the stack
size to? Small stacks keep the OS safe - a deviant program may get into an
infinite recursion and a small stack will make sure it's killed early. Moreover,
virtual memory is large but not unlimited; especially on 32-bit OSes, we might
not have 80 GiB of virtual address space for the process, so a 8 MiB stack for
10,000 threads makes no sense. There's a tradeoff here, and the default chosen
by 32-bit Linux is 2 MiB; the maximal virtual address space available is 3 GiB,
so this imposes a limit of ~1500 threads with the default settings. On 64-bit
Linux the virtual address space is vastly larger, so this limitation is less
serious (though other limits kick in - on my machine the maximal number of
threads the OS lets one process start is about 32K).&lt;/p&gt;
&lt;p&gt;Therefore I think it's more important to focus on how much actual memory each
concurrent task is using than on the OS stack size limit, as the latter is
simply a safety measure.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="conclusion"&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The numbers reported here paint an interesting picture on the state of Linux
multi-threaded performance in 2018. I would say that the limits still
exist - running a million threads is probably not going to make sense; however,
the limits have definitely shifted since the past, and a lot of folklore from
the early 2000s doesn't apply today. On a beefy multi-core machine with lots
of RAM we can easily run 10,000 threads in a single process today, in
production. As I've mentioned above, it's highly recommended to watch Google's
&lt;a class="reference external" href="https://youtu.be/KXuZi9aeGTw"&gt;talk on fibers&lt;/a&gt;; through careful tuning of
the kernel (and setting smaller default stacks) Google is able to run an order
of magnitude more threads in parallel.&lt;/p&gt;
&lt;p&gt;Whether this is sufficient concurrency for your application is very obviously
project-specific, but I'd say that for higher concurrencies you'd probably want
to mix in some asynchronous processing. If 10,000 threads can provide sufficient
concurrency - you're in luck, as this is a much simpler model - all the code
within the threads is serial, there are no issues with blocking, etc.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;For example, in order to implement POSIX semantics properly, a
single thread was designated as a &amp;quot;manager&amp;quot; and managed operations like
&amp;quot;create a new thread&amp;quot;. This created an unfortunate serialization point
and a bottleneck.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;These numbers also vary greatly between CPUs. The numbers reported herein
are on my Haswell i7-4771. On a different contemporary machine (a low-end
Xeon) I measured switch times that were about 50-75% longer.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-3" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-3"&gt;[3]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Curiously, pinning the Go program to a single core (by means of
setting &lt;tt class="docutils literal"&gt;GOMAXPROCS=1&lt;/tt&gt; and running with &lt;tt class="docutils literal"&gt;taskset&lt;/tt&gt;) increases the
throughput by only by 10% or so. The Go scheduler is not optimized for
this strange use case of endless hammering between two goroutine, but it
performs very well regardless.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-4" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-4"&gt;[4]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Note that while for resident memory there's a convenient &lt;tt class="docutils literal"&gt;getrusage&lt;/tt&gt;
API, to report virtual memory size we have to parse &lt;tt class="docutils literal"&gt;/proc/PID/status&lt;/tt&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-5" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-5"&gt;[5]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;According to Tom Dryer, recent Linux version only approximate this usage,
which could explain the discrepancy - see &lt;a class="reference external" href="https://gist.github.com/tdryer/7ef02a89169252552978b6773c731109"&gt;this explanation&lt;/a&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Concurrency"></category><category term="C &amp; C++"></category><category term="Linux"></category></entry><entry><title>Launching Linux threads and processes with clone</title><link href="https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/" rel="alternate"></link><published>2018-08-01T06:14:00-07:00</published><updated>2024-05-04T19:46:23-07:00</updated><author><name>Eli Bendersky</name></author><id>tag:eli.thegreenplace.net,2018-08-01:/2018/launching-linux-threads-and-processes-with-clone/</id><summary type="html">&lt;p&gt;Due to variation between operating systems and the way OS courses are taught,
some programmers may have an outdated mental model about the difference between
processes and threads in Linux. Even the name &amp;quot;thread&amp;quot; suggests something
extremely lightweight compared to a heavy &amp;quot;process&amp;quot; - a mostly wrong intuition.&lt;/p&gt;
&lt;p&gt;In fact, for …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Due to variation between operating systems and the way OS courses are taught,
some programmers may have an outdated mental model about the difference between
processes and threads in Linux. Even the name &amp;quot;thread&amp;quot; suggests something
extremely lightweight compared to a heavy &amp;quot;process&amp;quot; - a mostly wrong intuition.&lt;/p&gt;
&lt;p&gt;In fact, for the Linux kernel itself there's absolutely no difference between
what userspace sees as processes (the result of &lt;tt class="docutils literal"&gt;fork&lt;/tt&gt;) and as threads (the
result of &lt;tt class="docutils literal"&gt;pthread_create&lt;/tt&gt;). Both are represented by the same data structures
and scheduled similarly. In kernel nomenclature this is called &lt;em&gt;tasks&lt;/em&gt; (the
main structure representing a task in the kernel is
&lt;a class="reference external" href="https://github.com/torvalds/linux/blob/master/include/linux/sched.h"&gt;task_struct&lt;/a&gt;),
and I'll be using this term from now on.&lt;/p&gt;
&lt;p&gt;In Linux, threads are just tasks that share some resources, most notably their
memory space; processes, on the other hand, are tasks that don't share
resources. For application programmers, proceses and threads are created and
managed in very different ways. For processes there's a slew of
process-management APIs like &lt;tt class="docutils literal"&gt;fork&lt;/tt&gt;, &lt;tt class="docutils literal"&gt;wait&lt;/tt&gt; and so on. For threads there's
the &lt;tt class="docutils literal"&gt;pthread&lt;/tt&gt; library. However, deep in the guts of these APIs and libraries,
both processes and threads come into existence through a single Linux system
call - &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt;.&lt;/p&gt;
&lt;div class="section" id="the-clone-system-call"&gt;
&lt;h2&gt;The &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; system call&lt;/h2&gt;
&lt;p&gt;We can think of &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; as the unifying implementation shared between
processes and threads. Whatever perceived difference there is between
processes and threads on Linux is achieved through passing different flags to
&lt;tt class="docutils literal"&gt;clone&lt;/tt&gt;. Therefore, it's most useful to think of processes and threads not
as two completely different concepts, but rather as two variants of the same
concept - starting a concurrent task. The differences are mostly about what is
shared between this new task and the task that started it.&lt;/p&gt;
&lt;p&gt;&lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2018/clone/clone-vm-sample.c"&gt;Here is a code sample&lt;/a&gt;
demonstrating the most important sharing aspect of threads - memory. It uses
&lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; in two ways, once with the &lt;tt class="docutils literal"&gt;CLONE_VM&lt;/tt&gt; flag and once without.
&lt;tt class="docutils literal"&gt;CLONE_VM&lt;/tt&gt; tells &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; to share the virtual memory between the calling
task and the new task &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; is about to create &lt;a class="footnote-reference" href="#footnote-1" id="footnote-reference-1"&gt;[1]&lt;/a&gt;. As we'll see later on,
this is the flag used by &lt;tt class="docutils literal"&gt;pthread_create&lt;/tt&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;static&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;child_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Child sees buf = &lt;/span&gt;&lt;span class="se"&gt;\&amp;quot;&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\&amp;quot;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;strcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;hello from child&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;argc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// Allocate stack for child task.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;STACK_SIZE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;65536&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STACK_SIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;perror&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;malloc&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// When called with the command-line argument &amp;quot;vm&amp;quot;, set the CLONE_VM flag on.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;unsigned&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;long&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;strcmp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;vm&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_VM&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;char&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;strcpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;hello from parent&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child_func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;STACK_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SIGCHLD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;perror&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;clone&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;-1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;perror&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;wait&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Child exited with status %d. buf = &lt;/span&gt;&lt;span class="se"&gt;\&amp;quot;&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\&amp;quot;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Some things to note when &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; is invoked:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;It takes a function pointer to the code the new task will run, similarly
to threading APIs, and unlike the &lt;tt class="docutils literal"&gt;fork&lt;/tt&gt; API. This is the glibc
wrapper for &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt;. There's also a raw system call which is discussed
below.&lt;/li&gt;
&lt;li&gt;The stack for the new task has to be allocated by the parent and passed into
&lt;tt class="docutils literal"&gt;clone&lt;/tt&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;tt class="docutils literal"&gt;SIGCHLD&lt;/tt&gt; flag tells the kernel to send the &lt;tt class="docutils literal"&gt;SIGCHLD&lt;/tt&gt; to the parent
when the child terminates, which lets the parent use the plain &lt;tt class="docutils literal"&gt;wait&lt;/tt&gt; call
to wait for the child to exit. This is the only flag the sample passes into
&lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; by default.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This code sample passes a buffer into the child, and the child writes a string
into it. When called without the &lt;tt class="docutils literal"&gt;vm&lt;/tt&gt; command-line argument, the &lt;tt class="docutils literal"&gt;CLONE_VM&lt;/tt&gt;
flag is off, and the parent's virtual memory is copied into the child. The child
sees the message the parent placed in &lt;tt class="docutils literal"&gt;buf&lt;/tt&gt;, but whatever it writes into
&lt;tt class="docutils literal"&gt;buf&lt;/tt&gt; goes into its own copy and the parent can't see it. Here's the output:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ ./clone-vm-sample
Child sees buf = &amp;quot;hello from parent&amp;quot;
Child exited with status 0. buf = &amp;quot;hello from parent&amp;quot;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;But when the &lt;tt class="docutils literal"&gt;vm&lt;/tt&gt; argument is passed, &lt;tt class="docutils literal"&gt;CLONE_VM&lt;/tt&gt; is set and the child
task shares the parent's memory. Its writing into &lt;tt class="docutils literal"&gt;buf&lt;/tt&gt; will now be observable
from the parent:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;$ ./clone-vm-sample vm
Child sees buf = &amp;quot;hello from parent&amp;quot;
Child exited with status 0. buf = &amp;quot;hello from child&amp;quot;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;A bunch of other &lt;tt class="docutils literal"&gt;CLONE_*&lt;/tt&gt; flags can specify other things that will be shared
with the parent: &lt;tt class="docutils literal"&gt;CLONE_FILES&lt;/tt&gt; will share the open file descriptors,
&lt;tt class="docutils literal"&gt;CLONE_SIGHAND&lt;/tt&gt; will share the signal dispositions, and so on.&lt;/p&gt;
&lt;p&gt;Other flags are there to implement the semantics required by POSIX threads. For
example, &lt;tt class="docutils literal"&gt;CLONE_THREAD&lt;/tt&gt; asks the kernel to assign the same &lt;em&gt;thread group id&lt;/em&gt;
to the child as to the parent, in order to comply with POSIX's requirement of
all threads in a process sharing a single process ID &lt;a class="footnote-reference" href="#footnote-2" id="footnote-reference-2"&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="calling-clone-in-process-and-thread-creation"&gt;
&lt;h2&gt;Calling &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; in process and thread creation&lt;/h2&gt;
&lt;p&gt;Let's dig through some code in glibc to see how &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; is invoked, starting
with &lt;tt class="docutils literal"&gt;fork&lt;/tt&gt;, which is routed to &lt;tt class="docutils literal"&gt;__libc_fork&lt;/tt&gt; in &lt;tt class="docutils literal"&gt;sysdeps/nptl/fork.c&lt;/tt&gt;. The
actual implementation is specific to the threading library, hence the location
in the &lt;tt class="docutils literal"&gt;nptl&lt;/tt&gt; folder. The first thing &lt;tt class="docutils literal"&gt;__libc_fork&lt;/tt&gt; does is invoke the
&lt;em&gt;fork handlers&lt;/em&gt; potentially registered beforehead with &lt;tt class="docutils literal"&gt;pthread_atfork&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;The actual cloning happens with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ARCH_FORK&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Where &lt;tt class="docutils literal"&gt;ARCH_FORK&lt;/tt&gt; is a macro defined per architecture (exact syscall ABIs are
architecture-specific). For &lt;tt class="docutils literal"&gt;x86_64&lt;/tt&gt; it maps to:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;#define ARCH_FORK() \
  INLINE_SYSCALL (clone, 4,                                                   \
                  CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, 0,     \
                  NULL, &amp;amp;THREAD_SELF-&amp;gt;tid)
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The &lt;tt class="docutils literal"&gt;CLONE_CHILD_*&lt;/tt&gt; flags are useful for some threading libraries (though not
the default on Linux today - NPTL). Otherwise, the invocation is very similar
to the &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; code sample shown in the previous section.&lt;/p&gt;
&lt;p&gt;You may wonder where is the function pointer in this call. Nice catch! This is
the &lt;em&gt;raw call&lt;/em&gt; version of &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt;, where execution continues from the point
of the call in both parent and child - close to the usual semantics of &lt;tt class="docutils literal"&gt;fork&lt;/tt&gt;.&lt;/p&gt;
&lt;p&gt;Now let's turn to &lt;tt class="docutils literal"&gt;pthread_create&lt;/tt&gt;. Through a dizzying chain of macros it
reaches a function named &lt;tt class="docutils literal"&gt;create_thread&lt;/tt&gt; (defined in
&lt;tt class="docutils literal"&gt;sysdeps/unix/sysv/linux/createthread.c&lt;/tt&gt;) that calls &lt;tt class="docutils literal"&gt;clone&lt;/tt&gt; with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;clone_flags&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CLONE_VM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_FS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_FILES&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_SYSVSEM&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                       &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_SIGHAND&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_THREAD&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                       &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_SETTLS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_PARENT_SETTID&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                       &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CLONE_CHILD_CLEARTID&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;                       &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;

&lt;span class="n"&gt;ARCH_CLONE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;start_thread&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;STACK_VARIABLES_ARGS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;clone_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Browse through &lt;tt class="docutils literal"&gt;man 2 clone&lt;/tt&gt; to understand the flags passed into the call.
Briefly, it is asked to share the virtual memory, file system, open files,
shared memory and signal handlers with the parent thread/process. Additional
flags are passed to implement proper identification - all threads launched from
a single process have to share its &lt;em&gt;process ID&lt;/em&gt; to be POSIX compliant.&lt;/p&gt;
&lt;p&gt;Reading the glibc source code is quite an exercise in mental resilience, but
it's really interesting to see how everything fits together &amp;quot;in the real world&amp;quot;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="benchmarking-process-vs-thread-creation"&gt;
&lt;h2&gt;Benchmarking process vs. thread creation&lt;/h2&gt;
&lt;p&gt;Given the information presented earlier in the post, I would expect process
creation to be somewhat more expensive than thread creation, but not
dramatically so. Since &lt;tt class="docutils literal"&gt;fork&lt;/tt&gt; and &lt;tt class="docutils literal"&gt;pthread_create&lt;/tt&gt; route to the same system
call in Linux, the difference would come from the different
flags they pass in. When &lt;tt class="docutils literal"&gt;pthread_create&lt;/tt&gt; passes all these &lt;tt class="docutils literal"&gt;CLONE_*&lt;/tt&gt; flags,
it tells the kernel there's no need to copy the virtual memory image, the open
files, the signal handlers, and so on. Obviously, this saves time.&lt;/p&gt;
&lt;p&gt;For processes, there's a bit of copying to be done when &lt;tt class="docutils literal"&gt;fork&lt;/tt&gt; is invoked,
which costs time. The biggest chunk of time probably goes to copying the memory
image due to the lack of &lt;tt class="docutils literal"&gt;CLONE_VM&lt;/tt&gt;. Note, however, that it's not just copying
the whole memory; Linux has an important optimization by using COW (Copy On
Write) pages. The child's memory pages are initially mapped to the same pages
shared by the parent, and only when we modify them the copy happens. This is
very important because processes will often use a lot of shared read-only memory
(think of the global structures used by the standard library, for example).&lt;/p&gt;
&lt;p&gt;That said, the &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Page_table"&gt;page tables&lt;/a&gt; still
have to be copied. The size of a process's page tables can be observed by
looking in &lt;tt class="docutils literal"&gt;&lt;span class="pre"&gt;/proc/&amp;lt;pid&amp;gt;/status&lt;/span&gt;&lt;/tt&gt; - the &lt;tt class="docutils literal"&gt;VmPTE&lt;/tt&gt; indicator. These can be around
tens of kilobytes for small processes, and higher for larger processes. Not a
lot of data to copy, but definitely some extra work for the CPU.&lt;/p&gt;
&lt;p&gt;I wrote a &lt;a class="reference external" href="https://github.com/eliben/code-for-blog/blob/main/2018/clone/launch-benchmark.c"&gt;benchmark&lt;/a&gt;
that times process and threads launches, as a function of the virtual memory
allocated before &lt;tt class="docutils literal"&gt;fork&lt;/tt&gt; or &lt;tt class="docutils literal"&gt;pthread_create&lt;/tt&gt;. The launch is averaged over
10,000 instances to remove warm-up effects and jitter:&lt;/p&gt;
&lt;img alt="Launch time for fork/thread as function of memory image" class="align-center" src="https://eli.thegreenplace.net/images/2018/launch-fork-thread.png" /&gt;
&lt;p&gt;Several things to note:&lt;/p&gt;
&lt;ol class="arabic simple"&gt;
&lt;li&gt;Indeed, launching processes is slower than threads, 35 vs. 5 microseconds for
a 2-MB heap. But it's still very fast! 35 &lt;em&gt;micro&lt;/em&gt;-seconds is not a lot of
time at all. If your latency budget is willing to tolerate a 5 us overhead,
it will almost certainly be fine with a 35 us overhead, unless you're working
on some super-tight hard realtime system (in which case you shouldn't be
using Linux!)&lt;/li&gt;
&lt;li&gt;As expected, the time to launch a process when the heap is larger grows. The
time delta is the time needed to copy the extra page table entries. For
threads, on the other hand, there is absolutely no difference since the
memory is completely shared.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Interestingly, it's easy to observe from these numbers that not the whole memory
image is being copied. On the same machine this benchmark was run on, just a
simple &lt;tt class="docutils literal"&gt;memcpy&lt;/tt&gt; of 2 MB takes over 60 us, so it couldn't have copied 2 MB of
heap to the child in the 30 us difference. Copying 64K (a reasonable size for a
page table) takes 3 us, which makes sense because the cloning involves more than
a simple &lt;tt class="docutils literal"&gt;memcpy&lt;/tt&gt;. To me this is another sign of how fast these launches are,
since we're in the same ballpark of performance with modestly sized memory
copies.&lt;/p&gt;
&lt;p&gt;Creation time is not the only performance benchmark of importance. It's also
interesting to measure how long it takes to switch context between tasks when
using threads or processes. &lt;a class="reference external" href="https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/"&gt;This is covered in another post&lt;/a&gt;.&lt;/p&gt;
&lt;hr class="docutils" /&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-1" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-1"&gt;[1]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;It may be just me, but I find this terminology a bit confusing. In my
mind the word &lt;em&gt;clone&lt;/em&gt; is synonymous to &lt;em&gt;copy&lt;/em&gt;, so when we turn on
a flag named &amp;quot;clone the VM&amp;quot; I'd expect the VM to be copied rather than
shared. IMHO it would be clearer if this flag was named &lt;tt class="docutils literal"&gt;SHARE_VM&lt;/tt&gt;.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table class="docutils footnote" frame="void" id="footnote-2" rules="none"&gt;
&lt;colgroup&gt;&lt;col class="label" /&gt;&lt;col /&gt;&lt;/colgroup&gt;
&lt;tbody valign="top"&gt;
&lt;tr&gt;&lt;td class="label"&gt;&lt;a class="fn-backref" href="#footnote-reference-2"&gt;[2]&lt;/a&gt;&lt;/td&gt;&lt;td&gt;It's certainly interesting to see this evolution of concepts over time.
Thread APIs were defined in times where there was a real difference
between processes and threads and their design reflects that. In modern
Linux the kernel has to bend over backwards to provide the &lt;em&gt;illusion&lt;/em&gt;
of the difference although very little of it exists.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
</content><category term="misc"></category><category term="Concurrency"></category><category term="C &amp; C++"></category><category term="Linux"></category></entry></feed>