Welcome


... to my place on the web - a personal website where I keep articles, freeware programs and code snippets I've written, and a weblog where I share my ideas and insights on programming, technology, books, and life in general.

To browse the website, use the site map on the right-hand side of the screen. Below, you will find the ten most recent posts.

pyelftools ported to Python 3

January 27th, 2012 at 11:18 am

I’ve released version 0.20 of pyelftools, with support for Python 3. Now pyelftools supports Python 2.6, 2.7 and 3.2 in a single code-base. The new version is also available on PyPI.

This was surprisingly painless for a project that has close to 6KLOC of Python code, probably because I planned for eventual Python 3 support from the start, and the minimal required version of Python 2 is 2.6, which is much more compatible with Python 3 than earlier versions in the 2.x line.

If you’re curious, here’s a direct link to the py3compat module which pyelftools uses for various compatibility issues between Python 2 and 3. It’s partially based on the six library.

By far the hardest part of this port to Python 3 was porting the construct library on which pyelftools relies for the low-level binary stream parsing. I’ve forked the construct Github repository to perform the port – it’s available here. After construct was ported, it took around 2 hours to port the rest of pyelftools. construct is also being distributed with pyelftools, so there are no actual external dependencies.

Distributed computing in Python with multiprocessing

January 24th, 2012 at 5:23 am

In the previous post, I discussed how the multiprocessing package can be used to run CPU-bound computation tasks in parallel on a multi-core machine. But the utility of multiprocessing doesn’t end here. It can also be used to run computations distributed over several machines.

This enters the exciting domain of distributed computing. There are many tools available for addressing various aspects of this domain, but here I want to specifically focus on what Python offers right in the standard library, with multiprocessing. The part of the package that makes distributed computing possible is called "managers".

The documentation of multiprocessing.managers leaves something to be desired. It’s not entirely clear what the full capabilities of this tool are from just skimming the docs. For example, it starts by saying:

Managers provide a way to create data which can be shared between different processes. A manager object controls a server process which manages shared objects. Other processes can access the shared objects by using proxies.

Which is somewhat confusing, since multiprocessing already has synchronization primitives available without using managers (for example Value and Lock). So why are managers needed?

For two main reasons:

  1. Managers provide additional synchronization tools, such as a list or a dictionary that can be shared between processes.
  2. Managers allow their synchronized objects to be used between processes running across a network, and not just on the same machine. This is why, for example, managers also provide a Lock, which at first sight appears to be a duplication of the multiprocessing.Lock. It isn’t a duplication, because multiprocessing.Lock is only available for processes running on the same machine, while the multiprocessing.SyncManager.Lock can be shared across machines (which is why it’s also slower).

I don’t want to delve too far into the synchronization primitives, and instead focus on the distributing computing made possible by managers.

The task will be the same as before – factoring lists of numbers. The worker function is:

def factorizer_worker(job_q, result_q):
    """ A worker function to be launched in a separate process. Takes jobs from
        job_q - each job a list of numbers to factorize. When the job is done,
        the result (dict mapping number -> list of factors) is placed into
        result_q. Runs until job_q is empty.
    """
    while True:
        try:
            job = job_q.get_nowait()
            outdict = {n: factorize_naive(n) for n in job}
            result_q.put(outdict)
        except Queue.Empty:
            return

This worker runs in a single process on some machine. All it cares about is that it has a queue of "jobs" to look at, and a queue of results to write into. Each job in the queue is a list of numbers to factorize. Once the worker has finished factoring these numbers, it puts a result dict into the result queue. The worker stops and returns when it notices that the job queue is empty. That’s it.

Adding an abstraction level, here’s how this worker can be used:

def mp_factorizer(shared_job_q, shared_result_q, nprocs):
    """ Split the work with jobs in shared_job_q and results in
        shared_result_q into several processes. Launch each process with
        factorizer_worker as the worker function, and wait until all are
        finished.
    """
    procs = []
    for i in range(nprocs):
        p = multiprocessing.Process(
                target=factorizer_worker,
                args=(shared_job_q, shared_result_q))
        procs.append(p)
        p.start()

    for p in procs:
        p.join()

mp_factorizer takes the same pair of queues and the number of processes to create. It then uses multiprocessing.Process to spawn several workers, each into a process of its own. When all the workers are done, mp_factorizer exits. Note how this code is still independent of where it actually executes – its interface with the world is via the job and result queues.

There’s nothing really new here, so let’s get to the interesting stuff – starting a server that manages the shared queues:

def runserver():
    # Start a shared manager server and access its queues
    manager = make_server_manager(PORTNUM, AUTHKEY)
    shared_job_q = manager.get_job_q()
    shared_result_q = manager.get_result_q()

    N = 999
    nums = make_nums(N)

    # The numbers are split into chunks. Each chunk is pushed into the job
    # queue.
    chunksize = 43
    for i in range(0, len(nums), chunksize):
        shared_job_q.put(nums[i:i + chunksize])

    # Wait until all results are ready in shared_result_q
    numresults = 0
    resultdict = {}
    while numresults < N:
        outdict = shared_result_q.get()
        resultdict.update(outdict)
        numresults += len(outdict)

    # Sleep a bit before shutting down the server - to give clients time to
    # realize the job queue is empty and exit in an orderly way.
    time.sleep(2)
    manager.shutdown()

What this code does is:

  1. Create the manager (which actually starts the server running in the background) – more on this step later
  2. Generate some input numbers and break them to chunks
  3. Feed the job queue with chunks of numbers for the workers to churn on
  4. Wait until the expected amount of results has been placed in the result queue
  5. Shut down the server and exit

Note that no computation is actually performed in the server – it just manages the sharing for clients. And this is make_server_manager:

def make_server_manager(port, authkey):
    """ Create a manager for the server, listening on the given port.
        Return a manager object with get_job_q and get_result_q methods.
    """
    job_q = Queue.Queue()
    result_q = Queue.Queue()

    # This is based on the examples in the official docs of multiprocessing.
    # get_{job|result}_q return synchronized proxies for the actual Queue
    # objects.
    class JobQueueManager(SyncManager):
        pass

    JobQueueManager.register('get_job_q', callable=lambda: job_q)
    JobQueueManager.register('get_result_q', callable=lambda: result_q)

    manager = JobQueueManager(address=('', port), authkey=authkey)
    manager.start()
    print 'Server started at port %s' % port
    return manager

I won’t explain what each line of this code does – it’s all available in the documentation page of multiprocessing. I’ll just note that the manager starts a TCP server at the given port, running in the background, and uses this server to let clients access its internal objects – in this case a couple of queues.

Finally, to complete the puzzle here’s the make_nums utility function. Nothing smart about it:

def make_nums(N):
    """ Create N large numbers to factorize.
    """
    nums = [999999999999]
    for i in xrange(N):
        nums.append(nums[-1] + 2)
    return nums

Alright, so this is the server. It will run, put input into the job queue and then wait for results to start trickling into the result queue. How would they get there though? From clients. Here’s a simple client:

def runclient():
    manager = make_client_manager(IP, PORTNUM, AUTHKEY)
    job_q = manager.get_job_q()
    result_q = manager.get_result_q()
    mp_factorizer(job_q, result_q, 4)

The client accesses the server by means of another manager object. It then asks for the queues and just runs mp_factorizer (with nprocs=4). The client’s manager is this:

def make_client_manager(ip, port, authkey):
    """ Create a manager for a client. This manager connects to a server on the
        given address and exposes the get_job_q and get_result_q methods for
        accessing the shared queues from the server.
        Return a manager object.
    """
    class ServerQueueManager(SyncManager):
        pass

    ServerQueueManager.register('get_job_q')
    ServerQueueManager.register('get_result_q')

    manager = ServerQueueManager(address=(ip, port), authkey=authkey)
    manager.connect()

    print 'Client connected to %s:%s' % (ip, port)
    return manager

This manager is simpler. Instead of starting a server, it connects to one (given an IP address, port and authorization key). A similar method has to be used to register the get_*_q methods, just to let the manager know they are part of the protocol. Think of it as a kind of RPC.

This client can be executed on the same machine with the server, or on a different machine, which can be located anywhere as long as it can reach the server by IP address. It will connect to the server and start pulling work from the job queue, placing results into the result queue. Theoretically, any amount of clients can connect simultaneously. The beauty of this method is that it only uses the Python standard library, so the code is very much platform independent. I had a Windows client machine connecting a Linux server, which also had a client running, happily sharing the work between them. It just works.

To summarize, I want to stress once again the goal of this post. Lest there be any misunderstanding, I’m not claiming this is the best way to do distributed programming in Python. It wouldn’t be easy to find the "best way", since it’s a complex problem domain with many tradeoffs, and many solutions that optimize for different things.

However, it is useful to know that such capabilities exist in the Python standard library. The multiprocessing package provides many useful building blocks. These can be used together or separately to implement all kinds of interesting solutions both for paralellizing work across multiple processes and distributing it across different machines. All of this, as you saw above, without writing too much code.

Python – parallelizing CPU-bound tasks with multiprocessing

January 16th, 2012 at 7:51 pm

In a previous post on Python threads, I briefly mentioned that threads are unsuitable for CPU-bound tasks, and multiprocessing should be used instead. Here I want to demonstrate this with benchmark numbers, also showing that creating multiple processes in Python is just as simple as creating multiple threads.

First, let’s pick a simple computation to use for the benchmarking. I don’t want it to be completely artificial, so I’ll use a dumbed-down version of factorization – breaking a number to its prime factors. Here is a very naive and un-optimized function that takes a number and returns a list of factors:

def factorize_naive(n):
    """ A naive factorization method. Take integer 'n', return list of
        factors.
    """
    if n < 2:
        return []
    factors = []
    p = 2

    while True:
        if n == 1:
            return factors

        r = n % p
        if r == 0:
            factors.append(p)
            n = n / p
        elif p * p >= n:
            factors.append(n)
            return factors
        elif p > 2:
            # Advance in steps of 2 over odd numbers
            p += 2
        else:
            # If p == 2, get to 3
            p += 1
    assert False, "unreachable"

Now, as the base for benchmarking, I’ll be using the following serial (single-thread) factorizer that takes a list of numbers to factorize, and returns a dict mapping a number to its list of factors:

def serial_factorizer(nums):
    return {n: factorize_naive(n) for n in nums}

The threaded version follows. It also takes a list of numbers to factorize, as well as the amount of threads to create. It then divides the list into chunks and assigns each chunk to a separate thread:

def threaded_factorizer(nums, nthreads):
    def worker(nums, outdict):
        """ The worker function, invoked in a thread. 'nums' is a
            list of numbers to factor. The results are placed in
            outdict.
        """
        for n in nums:
            outdict[n] = factorize_naive(n)

    # Each thread will get 'chunksize' nums and its own output dict
    chunksize = int(math.ceil(len(nums) / float(nthreads)))
    threads = []
    outs = [{} for i in range(nthreads)]

    for i in range(nthreads):
        # Create each thread, passing it its chunk of numbers to factor
        # and output dict.
        t = threading.Thread(
                target=worker,
                args=(nums[chunksize * i:chunksize * (i + 1)],
                      outs[i]))
        threads.append(t)
        t.start()

    # Wait for all threads to finish
    for t in threads:
        t.join()

    # Merge all partial output dicts into a single dict and return it
    return {k: v for out_d in outs for k, v in out_d.iteritems()}

Note that the interface between the main and worker threads is very simple. Each worker thread has some amount of work to do, after which it simply returns. Thus the only thing the main thread does is fire up nthreads threads with suitable arguments and then waits for them to finish.

I ran a benchmark of the serial vs. threaded factorizer with 2, 4 and 8 threads. The benchmark was to factorize a constant set of large numbers, to minimize differences due to random chance. All the tests were run on my Ubuntu 10.04 laptop with a Intel Core i7-2820MQ CPU (4 physical cores, hyper-threaded).

Here are the results:

http://eli.thegreenplace.net/wp-content/uploads/2012/01/serial_vs_threaded.png

The horizontal axis is time in seconds, hence shorter bars mean faster execution. Yes, splitting the computation to several threads is actually slower than the serial implementation, and the more threads are used, the slower it gets.

This may be a bit surprising if you’re not familiar with the way Python threads are implemented and the GIL (Global Interpreter Lock). To understand why this is happening, you can hardly do better than read Dave Beazley’s articles and presentations on this topic. His work is so comprehensive and accessible that I see absolutely no point repeating any of it (except the conclusions) here.

Let’s now do the same, just with processes instead of threads. Python’s excellent multiprocessing module makes processes as simple to launch and manage as threads. In fact, it provides very similar APIs to the threading module. Here’s multi-process factorizer:

def mp_factorizer(nums, nprocs):
    def worker(nums, out_q):
        """ The worker function, invoked in a process. 'nums' is a
            list of numbers to factor. The results are placed in
            a dictionary that's pushed to a queue.
        """
        outdict = {}
        for n in nums:
            outdict[n] = factorize_naive(n)
        out_q.put(outdict)

    # Each process will get 'chunksize' nums and a queue to put his out
    # dict into
    chunksize = int(math.ceil(len(nums) / float(nprocs)))
    procs = []
    out_qs = [Queue() for i in range(nprocs)]

    for i in range(nprocs):
        p = multiprocessing.Process(
                target=worker,
                args=(nums[chunksize * i:chunksize * (i + 1)],
                      out_qs[i]))
        procs.append(p)
        p.start()

    # Wait for all processes to finish
    for p in procs:
        p.join()

    # Merge all partial output dicts into a single dict. At this point we
    # know that all processes have finished running, so we can just take
    # their output from the queues.
    return {k: v for out_q in out_qs for k,v in out_q.get().iteritems()}

The only real difference here vs. the thread solution is the way output is passed back from the worker to the main thread/process. With multiprocessing, we can’t simply pass a dict to the sub-process and expect its modifications to be visible in another process. There are several approaches to solve this problem. One is use a synchronized dictionary from multiprocessing.managers.SyncManager. The one I’ve chosen is to simply create a Queue per process. A worker process puts its output dict into its queue, and this queue is visible from the main process. I chose to create one queue per worker process as opposed to just a single queue shared between all processes to minimize possible contention.

I’ve run the same benchmark, adding the runtimes from mp_factorizer to the chart:

http://eli.thegreenplace.net/wp-content/uploads/2012/01/serial_vs_threaded_vs_mp.png

As you can see, there are nice speedups. The fastest multi-process version (split to 8 processes) runs 3.1 times as fast as the serial version. Although my CPU only has 4 physical cores (and the pair of hardware "threads" in each core share a lot of execution resources), the 8-process version runs faster, which is probably due to the fact that the OS doesn’t allocate the CPUs optimally between "heavy" tasks. Another reason for the speedup being a bit far from 4x is that the work isn’t evenly divided between sub-processes. Some numbers are dramatically faster to factorize than others, and currently no attention is being paid to load-balancing the tasks between the workers.

These are interesting topics to explore, but way beyond the scope of this post. For our needs the best advice is to run benchmarks and decide on the best parallellization strategy according to the results.

The goal of this post was two-fold. One, to provide an easy demonstration of how Python threads are bad for speeding up CPU-bound computations (they’re actually pretty good for slowing them down!), while multiprocessing does use the multi-core CPU in a parallel manner, as expected. Two, to show that multiprocessing makes writing parallel code as easy as using threading. There’s a bit more work to be done when synchronizing objects between processes than between threads, but otherwise it’s very similar code. And if you ask me, that object synchronization is more difficult is a good thing, because the less objects are shared the better. This is the main reason why multi-process programming is often considered safer and less bug-prone than multi-threaded programming.

A Vim plugin for pss

January 12th, 2012 at 10:01 am

If you enjoy using pss and belong to the cult of Vi, you may like the the pss.vim plugin by Bernhard Leiner. It’s a very simple-to-use and functional plugin for running pss searches from inside Vim and showing the results in the Quickfix window.

If you find yourself running greps a lot from inside Vi, and/or using plugins like EasyGrep, give pss.vim a try – it’s a much better searching experience for source code.

pyelftools – Python library for parsing ELF and DWARF

January 6th, 2012 at 8:59 am

I’m happy and proud to announce the release of a new open-source Python package to the world. pyelftools is a pure-Python library for parsing and analyzing ELF files and DWARF debugging information. It provides both low-level and high-level APIs for querying ELF and DWARF, and is mostly feature-complete. As a proof of capability, pyelftools ships with a fairly powerful clone of readelf.

Some basic information:

  • Website – managed as an open-source project on Bitbucket. It’s the place for documentation, opening issues and closely following the development in general.
  • Downloading – from PyPI, or the Bitbucket site.
  • Documentation – there’s a detailed user guide, and the source distribution contains several examples.
  • License – public domain.
  • Pre-requisites – Python version 2.6 or 2.7; 3.x support is in the works.

The goal of this project was two-fold. First, to better understand the ELF and DWARF formats. Second, to have a feature-complete pure-Python parser for these formats with a sufficiently high-level API to be generally useful.

Although the initial release of pyelftools (version 0.10) is formally "beta", it’s quite well validated with a comprehensive test-suite. It should also be simple to learn and tweak, due to the detailed user’s and hacker’s guides on the Bitbucket site Wiki, along with several functioning examples that ship with the library.

There are some existing tools with overlapping functionality:

  • pydevtools – an ambitious project by Emilio Monti. I initially planned to build my project on top of it (I wanted a much higher-level API than pydevtools aims to provide), but was deterred by its lack of support for the 64-bit ELF format. Adding 64-bit support appeared like a large work, so I preferred to go my own way. pyelftools is designed from scratch to support both 32-bit and 64-bit formats (as well as endianness).
  • libelf and libdwarf – very authoritative and complete implementations, but are essentially C libraries with a C API, which leaves much to be desired in terms of usability and convenience.
  • The LLVM project has an ELF reader (part of its libObject library) and recently started adding partial DWARF parsing capabilities. These parts of the project are still rather experimental and evolving, and I’m following its progress with interest.

Shared counter with Python’s multiprocessing

January 4th, 2012 at 5:52 am

One of the methods of exchanging data between processes with the multiprocessing module is directly shared memory via multiprocessing.Value. As any method that’s very general, it can sometimes be tricky to use. I’ve seen a variation of this question asked a couple of times on StackOverflow:

I have some processes that do work, and I want them to increment some shared counter because [... some irrelevant reason ...] – how can this be done?

The wrong way

And surprisingly enough, some answers given to this question are wrong, since they use multiprocessing.Value incorrectly, as follows:

import time
from multiprocessing import Process, Value

def func(val):
    for i in range(50):
        time.sleep(0.01)
        val.value += 1

if __name__ == '__main__':
    v = Value('i', 0)
    procs = [Process(target=func, args=(v,)) for i in range(10)]

    for p in procs: p.start()
    for p in procs: p.join()

    print v.value

This code is a demonstration of the problem, distilling only the usage of the shared counter. A "pool" of 10 processes is created to run the func function. All processes share a Value and increment it 50 times. You would expect this code to eventually print 500, but in all likeness it won’t. Here’s some output taken from 10 runs of that code:

> for i in {1..10}; do python sync_nolock_wrong.py; done
435
464
484
448
491
481
490
471
497
494

Why does this happen?

I must admit that the documentation of multiprocessing.Value can be a bit confusing here, especially for beginners. It states that by default, a lock is created to synchronize access to the value, so one may be falsely led to believe that it would be OK to modify this value in any way imaginable from multiple processes. But it’s not.

Explanation – the default locking done by Value

This section is advanced and isn’t strictly required for the overall flow of the post. If you just want to understand how to synchronize the counter correctly, feel free to skip it.

The locking done by multiprocessing.Value is very fine-grained. Value is a wrapper around a ctypes object, which has an underlying value attribute representing the actual object in memory. All Value does is ensure that only a single process or thread may read or write this value attribute simultaneously. This is important, since (for some types, on some architectures) writes and reads may not be atomic. I.e. to actually fill up the object’s memory, the CPU may need several instructions, and another process reading the same (shared) memory at the same time could see some intermediate, invalid state. The built-in lock of Value prevents this from happening.

However, when we do this:

val.value +=1

What Python actually performs is the following (disassembled bytecode with the dis module). I’ve annotated the locking done by Value in #<-- comments:

 0 LOAD_FAST                0 (val)
 3 DUP_TOP
                                     #<--- Value lock acquired
 4 LOAD_ATTR                0 (value)
                                     #<--- Value lock released
 7 LOAD_CONST               1 (1)
10 INPLACE_ADD
11 ROT_TWO
                                     #<--- Value lock acquired
12 STORE_ATTR               0 (value)
                                     #<--- Value lock released

So it’s obvious that while process #1 is now at instruction 7 (LOAD_CONST), nothing prevents process #2 from also loading the (old) value attribute and be on instruction 7 too. Both processes will proceed incrementing their private copy and writing it back. The result: the actual value got incremented only once, not twice.

The right way

Fortunately, this problem is very easy to fix. A separate Lock is needed to guarantee the atomicity of modifications to the Value:

import time
from multiprocessing import Process, Value, Lock

def func(val, lock):
    for i in range(50):
        time.sleep(0.01)
        with lock:
            val.value += 1

if __name__ == '__main__':
    v = Value('i', 0)
    lock = Lock()
    procs = [Process(target=func, args=(v, lock)) for i in range(10)]

    for p in procs: p.start()
    for p in procs: p.join()

    print v.value

Now we get the expected result:

> for i in {1..10}; do python sync_lock_right.py; done
500
500
500
500
500
500
500
500
500
500

A value and a lock may appear like too much baggage to carry around at all times. So, we can create a simple "synchronized shared counter" object to encapsulate this functionality:

import time
from multiprocessing import Process, Value, Lock

class Counter(object):
    def __init__(self, initval=0):
        self.val = Value('i', initval)
        self.lock = Lock()

    def increment(self):
        with self.lock:
            self.val.value += 1

    def value(self):
        with self.lock:
            return self.val.value

def func(counter):
    for i in range(50):
        time.sleep(0.01)
        counter.increment()

if __name__ == '__main__':
    counter = Counter(0)
    procs = [Process(target=func, args=(counter,)) for i in range(10)]

    for p in procs: p.start()
    for p in procs: p.join()

    print counter.value()

Bonus: since we’ve now placed a more coarse-grained lock on the modification of the value, we may throw away Value with its fine-grained lock altogether, and just use multiprocessing.RawValue, that simply wraps a shared object without any locking.

Understanding the x64 code models

January 3rd, 2012 at 5:46 am

An interesting issue that comes up when writing code for the x64 architecture is which code model to use. This probably isn’t a very well-known topic, but if one wants to understand the x64 machine code generated by compilers, it’s educational to be familiar with code models. There are also implications for optimization, for those who really care about performance down to the smallest instruction.

There’s very little information on this topic online or anywhere. By far the most important resource is the official x64 ABI, which you can obtain from the x86-64.org page (from now on I’m going to refer to it simply as "the ABI"). There’s also a bit of information in the gcc man-pages. The aim of this article is to provide an approachable reference, with some discussion of the topic and concrete examples to demonstrate the concepts in real-life code.

An important disclaimer: this is not a tutorial for beginners. The prerequisites are a solid understanding of C and assembly language, plus a basic familiarity with the x64 architecture.

Code models – motivation

References to both code and data on x64 are done with instruction-relative (RIP-relative in x64 parlance) addressing modes. The offset from RIP in these instructions is limited to 32 bits. So what do we do when 32 bits are not enough? What if the program is larger than 2 GB? Then, a case can arise when an instruction attempting to address some piece of code (or data) just can’t do it with its 32-bit offset from RIP.

One solution to this problem is to give up the RIP-relative addressing modes, and use absolute 64-bit offsets for all code and data references. But this has a high cost – more instructions are required to perform the simplest operations. It’s a high cost to pay in all code just for the sake of the (very rare) case of extremely huge programs or libraries.

So, the compromise is code models [1]. A code model is a formal agreement between the programmer and the compiler, in which the programmer states his intentions for the size of the eventual program(s) the object file that’s being currently compiled will get into [2].

Code models exist for the programmer to be able to tell the compiler: don’t worry, this object will only get into non-huge programs, so you can use the fast RIP-relative addressing modes. Conversely, he can tell the compiler: this object is expected to be linked into huge programs, so please use the slow but safe absolute addressing modes with full 64-bit offsets.

What will be covered here

The two scenarios described above have names: the small code model promises to the compiler that 32-bit relative offsets should be enough for all code and data references in the compiled object. The large code model, on the other hand, tells it not to make any assumptions and use absolute 64-bit addressing modes for code and data references. To make things more interesting, there’s also a middle road, called the medium code model.

These code models exist separately for non-PIC and PIC code. The article is going to discuss all 6 variations.

Example C source

I’ll be using the following C program compiled with different code models to demonstrate the concepts discussed in the article. In this code, the main function accesses 4 different global arrays and one global function. The arrays differ by two parameters: size and visibility. The size is important to explain the medium code model and won’t be used for the small and large models. Visibility is either static (visible only in this source file) or completely global (visible by all other objects linked into the program). This distinction is important for the PIC code models.

int global_arr[100] = {2, 3};
static int static_arr[100] = {9, 7};
int global_arr_big[50000] = {5, 6};
static int static_arr_big[50000] = {10, 20};

int globlal_func(int param)
{
    return param * 10;
}

int main(int argc, const char* argv[])
{
    int t = global_func(argc);
    t += global_arr[7];
    t += static_arr[7];
    t += global_arr_big[7];
    t += static_arr_big[7];
    return t;
}

gcc takes the code model as the value of the -mcmodel option. Additionally, PIC compilation can be specified with the -fpic flag.

For example, compiling it into an object file with the large code model and PIC enabled:

> gcc -g -O0 -c codemodel1.c -fpic -mcmodel=large -o codemodel1_large_pic.o

Small code model

Here’s what man gcc has to say about the small code model:

-mcmodel=small

Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model.

In other words, the compiler is free to assume that all code and data can be accessed with 32-bit RIP-relative offsets from any instruction in the code. Let’s see the disassembly of the example C program compiled in non-PIC small code model:

> objdump -dS codemodel1_small.o
[...]
int main(int argc, const char* argv[])
{
  15: 55                      push   %rbp
  16: 48 89 e5                mov    %rsp,%rbp
  19: 48 83 ec 20             sub    $0x20,%rsp
  1d: 89 7d ec                mov    %edi,-0x14(%rbp)
  20: 48 89 75 e0             mov    %rsi,-0x20(%rbp)
    int t = global_func(argc);
  24: 8b 45 ec                mov    -0x14(%rbp),%eax
  27: 89 c7                   mov    %eax,%edi
  29: b8 00 00 00 00          mov    $0x0,%eax
  2e: e8 00 00 00 00          callq  33 <main+0x1e>
  33: 89 45 fc                mov    %eax,-0x4(%rbp)
    t += global_arr[7];
  36: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
  3c: 01 45 fc                add    %eax,-0x4(%rbp)
    t += static_arr[7];
  3f: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
  45: 01 45 fc                add    %eax,-0x4(%rbp)
    t += global_arr_big[7];
  48: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
  4e: 01 45 fc                add    %eax,-0x4(%rbp)
    t += static_arr_big[7];
  51: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
  57: 01 45 fc                add    %eax,-0x4(%rbp)
    return t;
  5a: 8b 45 fc                mov    -0x4(%rbp),%eax
}
  5d: c9                      leaveq
  5e: c3                      retq

As we can see, all arrays are accessed in exactly the same manner – by using a simple RIP-relative offset. However, the offset in the code is 0, because the compiler doesn’t know where the data section will be placed. So it also creates a relocation for each such access:

> readelf -r codemodel1_small.o

Relocation section '.rela.text' at offset 0x62bd8 contains 5 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000002f  001500000002 R_X86_64_PC32     0000000000000000 global_func - 4
000000000038  001100000002 R_X86_64_PC32     0000000000000000 global_arr + 18
000000000041  000300000002 R_X86_64_PC32     0000000000000000 .data + 1b8
00000000004a  001200000002 R_X86_64_PC32     0000000000000340 global_arr_big + 18
000000000053  000300000002 R_X86_64_PC32     0000000000000000 .data + 31098

Let’s fully decode the access to global_arr as an example. Here’s the relevant part of the disassembly again:

  t += global_arr[7];
36:       8b 05 00 00 00 00       mov    0x0(%rip),%eax
3c:       01 45 fc                add    %eax,-0x4(%rbp)

RIP-relative addressing is relative to the next instruction. So the offset that should be patched into the mov instruction should be relative to 0×3c. The relevant relocation is the second one, pointing to the operand of mov at 0×38. It’s R_X86_64_PC32, which means: take the symbol value, add the addend and subtract the offset this relocation points to. If you do the math you see this ends up placing the relative offset between the next instruction and global_arr, plus 0×1c. This relative offset is just what we need, since 0×1c simply means "the 7th int in the array" (each int is 4 bytes long on x64). So the instruction correctly references global_arr[7] using RIP relative addressing.

Another interesting thing to note here is that although the instructions for accessing static_arr are similar, its relocation has a different symbol, pointing to the .data section instead of the specific symbol. This is because the static array is placed by the linker in the .data section in a known location – it can’t be shared with other shared libraries. This relocation will eventually get fully resolved by the linker. On the other hand, the reference to global_arr will be left to the dynamic loader to resolve, since global_arr can actually be used (or overridden by) a different shared library [3].

Finally, let’s look at the reference to global_func:

  int t = global_func(argc);
24:       8b 45 ec                mov    -0x14(%rbp),%eax
27:       89 c7                   mov    %eax,%edi
29:       b8 00 00 00 00          mov    $0x0,%eax
2e:       e8 00 00 00 00          callq  33 <main+0x1e>
33:       89 45 fc                mov    %eax,-0x4(%rbp)

The operand of a callq is also RIP-relative, so the R_X86_64_PC32 relocation here works similarly to place the actual relative offset to global_func into the operand.

To conclude, since the small code model promises the compiler that all code and data in the eventual program can be accessible with 32-bit RIP-relative offsets, the compiler can generate simple and efficient code for accessing all kinds of objects.

Large code model

From man gcc:

-mcmodel=large

Generate code for the large model: This model makes no assumptions about addresses and sizes of sections.

Here’s the disassembled code of main when compiled with the non-PIC large code model:

int main(int argc, const char* argv[])
{
  15: 55                      push   %rbp
  16: 48 89 e5                mov    %rsp,%rbp
  19: 48 83 ec 20             sub    $0x20,%rsp
  1d: 89 7d ec                mov    %edi,-0x14(%rbp)
  20: 48 89 75 e0             mov    %rsi,-0x20(%rbp)
    int t = global_func(argc);
  24: 8b 45 ec                mov    -0x14(%rbp),%eax
  27: 89 c7                   mov    %eax,%edi
  29: b8 00 00 00 00          mov    $0x0,%eax
  2e: 48 ba 00 00 00 00 00    movabs $0x0,%rdx
  35: 00 00 00
  38: ff d2                   callq  *%rdx
  3a: 89 45 fc                mov    %eax,-0x4(%rbp)
    t += global_arr[7];
  3d: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  44: 00 00 00
  47: 8b 40 1c                mov    0x1c(%rax),%eax
  4a: 01 45 fc                add    %eax,-0x4(%rbp)
    t += static_arr[7];
  4d: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  54: 00 00 00
  57: 8b 40 1c                mov    0x1c(%rax),%eax
  5a: 01 45 fc                add    %eax,-0x4(%rbp)
    t += global_arr_big[7];
  5d: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  64: 00 00 00
  67: 8b 40 1c                mov    0x1c(%rax),%eax
  6a: 01 45 fc                add    %eax,-0x4(%rbp)
    t += static_arr_big[7];
  6d: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  74: 00 00 00
  77: 8b 40 1c                mov    0x1c(%rax),%eax
  7a: 01 45 fc                add    %eax,-0x4(%rbp)
    return t;
  7d: 8b 45 fc                mov    -0x4(%rbp),%eax
}
  80: c9                      leaveq
  81: c3                      retq

Again, looking at the relocations will be useful:

Relocation section '.rela.text' at offset 0x62c18 contains 5 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000030  001500000001 R_X86_64_64       0000000000000000 global_func + 0
00000000003f  001100000001 R_X86_64_64       0000000000000000 global_arr + 0
00000000004f  000300000001 R_X86_64_64       0000000000000000 .data + 1a0
00000000005f  001200000001 R_X86_64_64       0000000000000340 global_arr_big + 0
00000000006f  000300000001 R_X86_64_64       0000000000000000 .data + 31080

The large code model is also quite uniform – no assumptions can be made about the size of the code and data sections, so all data is accessed similarly. Let’s pick global_arr once again:

  t += global_arr[7];
3d:       48 b8 00 00 00 00 00    movabs $0x0,%rax
44:       00 00 00
47:       8b 40 1c                mov    0x1c(%rax),%eax
4a:       01 45 fc                add    %eax,-0x4(%rbp)

Here two instructions are needed to pull the desired value from the array. The first places an absolute 64-bit address into rax. This is the address of global_arr, as we shall soon see. The second loads the word at (rax) + 0x1c into eax.

So, let’s focus on the instruction at 0×3d. It’s a movabs – the absolute 64-bit version of mov on x64. It can swing a full 64-bit immediate into a register. The value of this immediate in the disassembled code is 0, so we have to turn to the relocation table for the answer. It has a R_X86_64_64 relocation for the operand at 0×3f. This is an absolute relocation, which simply means – place the symbol value + addend back into the offset. In other words, rax will hold the absolute address of global_arr.

What about the function call?

  int t = global_func(argc);
24:       8b 45 ec                mov    -0x14(%rbp),%eax
27:       89 c7                   mov    %eax,%edi
29:       b8 00 00 00 00          mov    $0x0,%eax
2e:       48 ba 00 00 00 00 00    movabs $0x0,%rdx
35:       00 00 00
38:       ff d2                   callq  *%rdx
3a:       89 45 fc                mov    %eax,-0x4(%rbp)

After a famililar movabs, we have a call instruction that calls a function whose address is in rdx. From a glance at the relevant relocation it’s obvious that this is very similar to the data access.

Evidently, the large code model makes absolutely no assumptions about the sizes of code and data sections, or where symbols might end up. It just takes the "safe road" everywhere, using absolute 64-bit moves to refer to symbols. This has a cost, of course. Notice that it now takes one extra instruction to access any symbol, when compared to the small model.

So, we’ve just witnessed two extremes. The small model happily assumes everything fits into the lower 2GB of memory, and the large model assumes everything is possible and any symbol can reside anywhere in the full 64-bit address space. The medium code model is a compromise.

Medium code model

As before, let’s start with a quote from man gcc:

-mcmodel=medium

Generate code for the medium model: The program is linked in the lower 2 GB of the address space. Small symbols are also placed there. Symbols with sizes larger than -mlarge-data-threshold are put into large data or bss sections and can be located above 2GB. Programs can be statically or dynamically linked.

Similarly to the small code model, the medium code model assumes all code is linked into the low 2GB. Data, on the other hand, is divided into "large data" and "small data". Small data is also assumed to be linked into the low 2GB. Large data, on the other hand, is not restricted in its memory placement. Data is considered large when it’s larger than a given threshold option, which is 64KB by default.

It is also interesting to note that in the medium code model, special sections will be created for the large data – .ldata and .lbss (parallel to .data and .bss). It’s not really important for the sake of this article, however, so I’m going to sidestep the topic. Read the ABI for more details.

Now it should be clear why the sample C code has those _big arrays. These are meant for the medium code model to be considered as "large data" (which they certainly are, at 200KB each). Here’s the disassembly:

int main(int argc, const char* argv[])
{
  15: 55                      push   %rbp
  16: 48 89 e5                mov    %rsp,%rbp
  19: 48 83 ec 20             sub    $0x20,%rsp
  1d: 89 7d ec                mov    %edi,-0x14(%rbp)
  20: 48 89 75 e0             mov    %rsi,-0x20(%rbp)
    int t = global_func(argc);
  24: 8b 45 ec                mov    -0x14(%rbp),%eax
  27: 89 c7                   mov    %eax,%edi
  29: b8 00 00 00 00          mov    $0x0,%eax
  2e: e8 00 00 00 00          callq  33 <main+0x1e>
  33: 89 45 fc                mov    %eax,-0x4(%rbp)
    t += global_arr[7];
  36: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
  3c: 01 45 fc                add    %eax,-0x4(%rbp)
    t += static_arr[7];
  3f: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
  45: 01 45 fc                add    %eax,-0x4(%rbp)
    t += global_arr_big[7];
  48: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  4f: 00 00 00
  52: 8b 40 1c                mov    0x1c(%rax),%eax
  55: 01 45 fc                add    %eax,-0x4(%rbp)
    t += static_arr_big[7];
  58: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  5f: 00 00 00
  62: 8b 40 1c                mov    0x1c(%rax),%eax
  65: 01 45 fc                add    %eax,-0x4(%rbp)
    return t;
  68: 8b 45 fc                mov    -0x4(%rbp),%eax
}
  6b: c9                      leaveq
  6c: c3                      retq

Note that the _big arrays are accessed as in the large model, and the other arrays are accessed as in the small model. The function is also accessed as in the small model. I won’t even show the relocations since there’s nothing new in them either.

The medium model is a clever compromise between the small and large models. The program’s code is unlikely to be terribly big [4], so what might push it over the 2GB threshold is large pieces of data statically linked into it (perhaps for some sort of big lookup tables). The medium code model separates these large chunks of data from the rest and handles them specially. All code just calling functions and accessing the other, smaller symbols will be as efficient as in the small code model. Only the code actually accessing the large symbols will have to go the whole 64-bit way similarly to the large code model.

Small PIC code model

Let us now turn to the code models for PIC, starting once again with the small model [5]. Here’s the sample code, compiled with PIC and the small code model:

int main(int argc, const char* argv[])
{
  15:   55                      push   %rbp
  16:   48 89 e5                mov    %rsp,%rbp
  19:   48 83 ec 20             sub    $0x20,%rsp
  1d:   89 7d ec                mov    %edi,-0x14(%rbp)
  20:   48 89 75 e0             mov    %rsi,-0x20(%rbp)
    int t = global_func(argc);
  24:   8b 45 ec                mov    -0x14(%rbp),%eax
  27:   89 c7                   mov    %eax,%edi
  29:   b8 00 00 00 00          mov    $0x0,%eax
  2e:   e8 00 00 00 00          callq  33 <main+0x1e>
  33:   89 45 fc                mov    %eax,-0x4(%rbp)
    t += global_arr[7];
  36:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
  3d:   8b 40 1c                mov    0x1c(%rax),%eax
  40:   01 45 fc                add    %eax,-0x4(%rbp)
    t += static_arr[7];
  43:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
  49:   01 45 fc                add    %eax,-0x4(%rbp)
    t += global_arr_big[7];
  4c:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
  53:   8b 40 1c                mov    0x1c(%rax),%eax
  56:   01 45 fc                add    %eax,-0x4(%rbp)
    t += static_arr_big[7];
  59:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
  5f:   01 45 fc                add    %eax,-0x4(%rbp)
    return t;
  62:   8b 45 fc                mov    -0x4(%rbp),%eax
}
  65:   c9                      leaveq
  66:   c3                      retq

And the relocations:

Relocation section '.rela.text' at offset 0x62ce8 contains 5 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000002f  001600000004 R_X86_64_PLT32    0000000000000000 global_func - 4
000000000039  001100000009 R_X86_64_GOTPCREL 0000000000000000 global_arr - 4
000000000045  000300000002 R_X86_64_PC32     0000000000000000 .data + 1b8
00000000004f  001200000009 R_X86_64_GOTPCREL 0000000000000340 global_arr_big - 4
00000000005b  000300000002 R_X86_64_PC32     0000000000000000 .data + 31098

Since the small vs. big data distinction plays no role in the small model, we’re going to focus on the difference between local (static) and global symbols, which does play a role when PIC is generated.

As you can see, the code generated for the static arrays is exactly equivalent to the code generated in the non-PIC case. This is one of the boons of the x64 architecture – unless symbols have to be accessed externally, you get PIC for free because of the RIP-relative addressing for data. The instructions and relocations used are the same, so we won’t go over them again.

The interesting case here is the global arrays. Recall that in PIC, global data has to go through GOT, because it may be eventually found or used in other shared libraries [6]. Here’s the code generated for accessing global_arr:

  t += global_arr[7];
36:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
3d:   8b 40 1c                mov    0x1c(%rax),%eax
40:   01 45 fc                add    %eax,-0x4(%rbp)

And the relevant relocation is a R_X86_64_GOTPCREL, which means: the location of the entry for the symbol in the GOT + addend, minus the offset for applying the relocation. In other words, the relative offset between RIP (of the next instruction) and the slot reserved for global_arr in GOT is patched into the instruction. So what’s put into rax in the instruction at 0×36 is the actual address of global_arr. This is followed by dereferncing the address of global_arr plus an offset to its 7th element into eax.

Now let’s examine the function call:

  int t = global_func(argc);
24:   8b 45 ec                mov    -0x14(%rbp),%eax
27:   89 c7                   mov    %eax,%edi
29:   b8 00 00 00 00          mov    $0x0,%eax
2e:   e8 00 00 00 00          callq  33 <main+0x1e>
33:   89 45 fc                mov    %eax,-0x4(%rbp)

There’s a R_X86_64_PLT32 relocation for the operand of callq at 0×2e. This relocation means: the address of the PLT entry for the symbol + addend, minus the offset for applying the relocation. In other words, the callq should correctly call the PLT trampoline for global_func.

Note the implicit assumptions made by the compiler – that the GOT and PLT could be accessed with RIP-relative addresing. This will be important when comparing this model to the other PIC code models.

Large PIC code model

Here’s the disassembly:

int main(int argc, const char* argv[])
{
  15: 55                      push   %rbp
  16: 48 89 e5                mov    %rsp,%rbp
  19: 53                      push   %rbx
  1a: 48 83 ec 28             sub    $0x28,%rsp
  1e: 48 8d 1d f9 ff ff ff    lea    -0x7(%rip),%rbx
  25: 49 bb 00 00 00 00 00    movabs $0x0,%r11
  2c: 00 00 00
  2f: 4c 01 db                add    %r11,%rbx
  32: 89 7d dc                mov    %edi,-0x24(%rbp)
  35: 48 89 75 d0             mov    %rsi,-0x30(%rbp)
    int t = global_func(argc);
  39: 8b 45 dc                mov    -0x24(%rbp),%eax
  3c: 89 c7                   mov    %eax,%edi
  3e: b8 00 00 00 00          mov    $0x0,%eax
  43: 48 ba 00 00 00 00 00    movabs $0x0,%rdx
  4a: 00 00 00
  4d: 48 01 da                add    %rbx,%rdx
  50: ff d2                   callq  *%rdx
  52: 89 45 ec                mov    %eax,-0x14(%rbp)
    t += global_arr[7];
  55: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  5c: 00 00 00
  5f: 48 8b 04 03             mov    (%rbx,%rax,1),%rax
  63: 8b 40 1c                mov    0x1c(%rax),%eax
  66: 01 45 ec                add    %eax,-0x14(%rbp)
    t += static_arr[7];
  69: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  70: 00 00 00
  73: 8b 44 03 1c             mov    0x1c(%rbx,%rax,1),%eax
  77: 01 45 ec                add    %eax,-0x14(%rbp)
    t += global_arr_big[7];
  7a: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  81: 00 00 00
  84: 48 8b 04 03             mov    (%rbx,%rax,1),%rax
  88: 8b 40 1c                mov    0x1c(%rax),%eax
  8b: 01 45 ec                add    %eax,-0x14(%rbp)
    t += static_arr_big[7];
  8e: 48 b8 00 00 00 00 00    movabs $0x0,%rax
  95: 00 00 00
  98: 8b 44 03 1c             mov    0x1c(%rbx,%rax,1),%eax
  9c: 01 45 ec                add    %eax,-0x14(%rbp)
    return t;
  9f: 8b 45 ec                mov    -0x14(%rbp),%eax
}
  a2: 48 83 c4 28             add    $0x28,%rsp
  a6: 5b                      pop    %rbx
  a7: c9                      leaveq
  a8: c3                      retq

And the relocations:

Relocation section '.rela.text' at offset 0x62c70 contains 6 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000027  00150000001d R_X86_64_GOTPC64  0000000000000000 _GLOBAL_OFFSET_TABLE_ + 9
000000000045  00160000001f R_X86_64_PLTOFF64 0000000000000000 global_func + 0
000000000057  00110000001b R_X86_64_GOT64    0000000000000000 global_arr + 0
00000000006b  000800000019 R_X86_64_GOTOFF64 00000000000001a0 static_arr + 0
00000000007c  00120000001b R_X86_64_GOT64    0000000000000340 global_arr_big + 0
000000000090  000900000019 R_X86_64_GOTOFF64 0000000000031080 static_arr_big + 0

Again, the small vs. big data distinction isn’t important here, so we’ll focus on static_arr and global_arr. But first, there’s a new prologue in this code which we didn’t encounter earlier:

1e: 48 8d 1d f9 ff ff ff    lea    -0x7(%rip),%rbx
25: 49 bb 00 00 00 00 00    movabs $0x0,%r11
2c: 00 00 00
2f: 4c 01 db                add    %r11,%rbx

Here’s a relevant quote from the ABI:

In the small code model all addresses (including GOT entries) are accessible via
the IP-relative addressing provided by the AMD64 architecture. Hence there is no
need for an explicit GOT pointer and therefore no function prologue for setting it
up is necessary.
In the medium and large code models a register has to be allocated to hold
the address of the GOT in position-independent objects, because the AMD64 ISA
does not support an immediate displacement larger than 32 bits.

Let’s see how the prologue displayed above computes the address of GOT. First, the instruction at 0×1e loads its own address into rbx. Then, an absolute 64-bit move is done into r11, with a R_X86_64_GOTPC64 relocation. This relocation means: take the GOT address, subtract the relocated offset and add the addend. Finally, the instruction at 0×2f adds the two together. The result is the absolute address of GOT in rbx [7].

Why go through all this trouble to compute the address of GOT? Well, for one thing, as the quote says, in the large model we can’t assume that the 32-bit RIP relative offset will suffice to access GOT, so we need a full 64-bit address. On the other hand, we still want PIC, so we can’t just place an absolute address into the register. Rather, the address has to be computed relative to RIP. This is what the prologue does. It’s just a 64-bit RIP-relative computation.

Anyway, now we have the address of GOT firmly in our rbx, let’s see how static_arr is accessed:

  t += static_arr[7];
69:       48 b8 00 00 00 00 00    movabs $0x0,%rax
70:       00 00 00
73:       8b 44 03 1c             mov    0x1c(%rbx,%rax,1),%eax
77:       01 45 ec                add    %eax,-0x14(%rbp)

The relocation for the first instruction is R_X86_64_GOTOFF64, which means: symbol + addend – GOT. In our case: the relative offset between the address of static_arr and the address of GOT. The next instruction adds that to rbx (the absolute GOT address), and dereferences with a 0×1c offset. Here’s some pseudo-C to make this computation easier to visualize:

// char* static_arr
// char* GOT
rax = static_arr + 0 - GOT;  // rax now contains an offset
eax = *(rbx + rax + 0x1c);   // rbx == GOT, so eax now contains
                             // *(GOT + static_arr - GOT + 0x1c) or
                             // *(static_arr + 0x1c)

Note an interesting thing here: the GOT address is just used as an anchor to reach static_arr. This is unlike the normal usage of GOT to actually contain the address of a symbol within it. Since static_arr is not an external symbol, there’s no point keeping it inside the GOT. But still, GOT is used here as an anchor in the data section, relative to which the address of the symbol can be found with a full 64-bit offset, which is at the same time position independent (the linker will be able to resolve this relocation, leaving no need to modify the code section during loading).

How about global_arr?

  t += global_arr[7];
55:       48 b8 00 00 00 00 00    movabs $0x0,%rax
5c:       00 00 00
5f:       48 8b 04 03             mov    (%rbx,%rax,1),%rax
63:       8b 40 1c                mov    0x1c(%rax),%eax
66:       01 45 ec                add    %eax,-0x14(%rbp)

The code is a bit longer, and the relocation is also different. This is actually a more traditional use of GOT. The R_X86_64_GOT64 relocation for the movabs just tells it to place the offset into the GOT where the address of global_arr resides into rax. The instruction at 0×5f extracts the address of global_arr from the GOT and places it into rax. The next instruction dereferences global_arr[7], placing the value into eax.

Now let’s look at the code reference for global_func. Recall that in the large code model we can’t make any assumptions regarding the size of the code section, so we should assume that even to reach the PLT we need an absolute 64-bit address:

  int t = global_func(argc);
39: 8b 45 dc                mov    -0x24(%rbp),%eax
3c: 89 c7                   mov    %eax,%edi
3e: b8 00 00 00 00          mov    $0x0,%eax
43: 48 ba 00 00 00 00 00    movabs $0x0,%rdx
4a: 00 00 00
4d: 48 01 da                add    %rbx,%rdx
50: ff d2                   callq  *%rdx
52: 89 45 ec                mov    %eax,-0x14(%rbp)

The relevant relocation is a R_X86_64_PLTOFF64, which means: PLT entry address for global_func, minus GOT address. This is placed into rdx, into which rbx (the absolute address of GOT) is later added. The result is the PLT entry address for global_func in rdx.

Again, note the use of GOT as an "anchor" to enable position-independent reference to the PLT entry offset.

Medium PIC code model

Finally, we’ll examine the code generated for the medium PIC code model:

int main(int argc, const char* argv[])
{
  15:   55                      push   %rbp
  16:   48 89 e5                mov    %rsp,%rbp
  19:   53                      push   %rbx
  1a:   48 83 ec 28             sub    $0x28,%rsp
  1e:   48 8d 1d 00 00 00 00    lea    0x0(%rip),%rbx
  25:   89 7d dc                mov    %edi,-0x24(%rbp)
  28:   48 89 75 d0             mov    %rsi,-0x30(%rbp)
    int t = global_func(argc);
  2c:   8b 45 dc                mov    -0x24(%rbp),%eax
  2f:   89 c7                   mov    %eax,%edi
  31:   b8 00 00 00 00          mov    $0x0,%eax
  36:   e8 00 00 00 00          callq  3b <main+0x26>
  3b:   89 45 ec                mov    %eax,-0x14(%rbp)
    t += global_arr[7];
  3e:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
  45:   8b 40 1c                mov    0x1c(%rax),%eax
  48:   01 45 ec                add    %eax,-0x14(%rbp)
    t += static_arr[7];
  4b:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
  51:   01 45 ec                add    %eax,-0x14(%rbp)
    t += global_arr_big[7];
  54:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax
  5b:   8b 40 1c                mov    0x1c(%rax),%eax
  5e:   01 45 ec                add    %eax,-0x14(%rbp)
    t += static_arr_big[7];
  61:   48 b8 00 00 00 00 00    movabs $0x0,%rax
  68:   00 00 00
  6b:   8b 44 03 1c             mov    0x1c(%rbx,%rax,1),%eax
  6f:   01 45 ec                add    %eax,-0x14(%rbp)
    return t;
  72:   8b 45 ec                mov    -0x14(%rbp),%eax
}
  75:   48 83 c4 28             add    $0x28,%rsp
  79:   5b                      pop    %rbx
  7a:   c9                      leaveq
  7b:   c3                      retq

And the relocations:

Relocation section '.rela.text' at offset 0x62d60 contains 6 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000021  00160000001a R_X86_64_GOTPC32  0000000000000000 _GLOBAL_OFFSET_TABLE_ - 4
000000000037  001700000004 R_X86_64_PLT32    0000000000000000 global_func - 4
000000000041  001200000009 R_X86_64_GOTPCREL 0000000000000000 global_arr - 4
00000000004d  000300000002 R_X86_64_PC32     0000000000000000 .data + 1b8
000000000057  001300000009 R_X86_64_GOTPCREL 0000000000000000 global_arr_big - 4
000000000063  000a00000019 R_X86_64_GOTOFF64 0000000000030d40 static_arr_big + 0

First, let’s clear the function call out of the way. Similarly to the small model, in the medium model we assume that code references are within the bounds of a 32-bit offset from RIP. Therefore, the code to call global_func is exactly similar to the small PIC model. The same goes for the small data arrays static_arr and global_arr. So we’ll focus on the big data arrays, but first let’s discuss the prologue, which is different from the large model:

1e:   48 8d 1d 00 00 00 00    lea    0x0(%rip),%rbx

That’s it, a single instruction (instead of the 3 it took in the large model) to get the address of GOT into rbx (with the help of a R_X86_64_GOTPC32 relocation). Why the difference? Because in the medium code model, we assume the GOT itself is reachable with a 32-bit offset, because it’s not part of the "big data sections". In the large code model we couldn’t make this assumption and had to use a full 64-bit offset to access the GOT.

Interestingly, we notice that the code to access global_arr_big is also similar to the small PIC model. Why? For the same reason the prologue is shorter than in the large model. In the medium model, we assume the GOT itself is reachable with 32-bit RIP-relative addressing. True, global_arr_big itself is not, but this is covered by the GOT anyway, since the address of global_arr_big actually resides in the GOT, and it’s a full 64-bit address there.

For static_arr_big, the situation is different, however:

  t += static_arr_big[7];
61:   48 b8 00 00 00 00 00    movabs $0x0,%rax
68:   00 00 00
6b:   8b 44 03 1c             mov    0x1c(%rbx,%rax,1),%eax
6f:   01 45 ec                add    %eax,-0x14(%rbp)

This is actually similar to the large PIC code model, because here we do obtain an absolute address for the symbol, which doesn’t reside in the GOT itself. Since this is a large symbol that can’t be assumed to reside in the low 2 GB, we need the 64-bit PIC offset here, similarly to the large model.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] Code models are not to be confused with 64-bit data models and Intel memory models, both of which are different topics.
[2] An important thing to keep in mind here: the actual instructions are created by the compiler, and the addressing modes are "cemented" at that stage. The compiler has no way to know into which programs or shared libs the object it’s compiling will eventually get into. Some may be small, but some may be large. The linker does know the size of the resulting program, but it’s too late at that point, since the linker can’t actually change the instructions, just patch offsets within them with relocations. Therefore, the code model "contract" has to be "signed" by the programmer at the compilation stage.
[3] If this isn’t clear, read this article.
[4] Although it’s getting there. Last time I checked, the Debug+Asserts build of Clang was almost half a GB in size (thanks to quite a bit of auto-generated code).
[5] Unless you already know how PIC works (both in general and for x64 in particular), this would be a good time to go over my earlier articles on this subject – #1 and #2
[6] So the linker can’t fully resolve the references on its own, and has to leave GOT handling to the dynamic loader.
[7] 0×25 – 0×7 + GOT – 0×27 + 0×9 = GOT

Website stats for 2011

January 2nd, 2012 at 6:01 am

2011 turned out to be a great year for this website/blog in terms of traffic, with almost twice the amount of visits compared to 2010.

The site was visited on average 135,000 times a month, by 56,000 unique visitors, who viewed 474,000 pages (all these are monthly averages). Most of the traffic came from the United States, with Germany, Great Britain and China lagging far behind.

There was also far more traffic coming from search engines (mostly Google, with Yahoo, Microsoft and Yandex sharing the crumbs) than from direct links (mostly from Planet Python, Reddit and Hacker news).

The most searched phrases that led to the site were “eli bendersky”, “python destructor”, “ruby blocks” and “python serial port”. I’m not sure whether it’s flattening or creepy.

Finally, the five most viewed pages this year (counting only direct hits, not through the main page) were:

  1. How debuggers work: Part 1 – Basics
  2. Where the top of the stack is on x86
  3. The many faces of operator new in C++
  4. Writing a game in Python with Pygame. Part I
  5. How debuggers work: Part 2 – Breakpoints

So, a big thanks is due to all the readers of my blog, those who read it regularly, leave comments, re-post and discuss it on aggregators like Reddit, send me emails and so on. Without such massive support I doubt I’d have so much motivation for keeping it running.

Happy new year, with a hope (fueling an intention) for even more quality content in 2012.

Summary of reading: October – December 2011

December 30th, 2011 at 8:08 am
  • "Neuland" by Eshkol Nevo (read in Hebrew) – a fast-paced novel taking place mostly in Israel and in South America. The plot is very ambitious, with a lot of motives. Not all are handled perfectly, but eventually the author manages to steer the book to a safe haven.
  • "Azazel" by Boris Akunin (read in Russian) – this book’s English translated name is "The winter queen". The first in the Fandorin series, a fast-moving detective story. Although enjoyable overall, I found this book a bit chaotic, with a lot of events happening too quickly.
  • "The Little Golden Calf" by Ilf and Petrov (read in Russian) – another satirical novel by these authors, somewhat similar to the "12 chairs" (and featuring the same protagonist). Also fun to read, addressing pretty much the same issues of early Soviet society. Somewhere in the second half it gets a bit slow and unfocused, but eventually the ending is very good.
  • "Start-up Nation, The Story of Israel’s Economic Miracle" by Dan Senor and Saul Singer – The main goal of this book is to explore the phenomenon that is Israel’s technological start-up scene (Israel has by far more per-capita startups and VC investments than any other nation on earth). It appears to be very well researched (with dozens of references, with sources, for each chapter) and also is quite well written. We Israelis have a lot of self-criticism, so I was fearing this book would be romanticizing Israel too much. Well, it does romanticize, but not too much :-) The authors also succeed in staying a-political, which is very important – keeping it objective can actually reach larger audiences. Nevertheless, no matter how you look at it, the book is awesome PR for Israel, which probably explains why they got Shimon Peres writing a forward for it, and interviews with many seniors in the country.
  • "The Art of Learning" by Josh Waitzkin – an auto-biographic book by a very impressive guy who was junior US chess champion for a few years, and as an older man (in his 20s) became a world champion in the combat form of Tai Chi Chuan. His aim in the book is to explain some of the insights he gained about learning and reaching excellence. The book is very well written and fun to read. However, I didn’t really find anything too profound in the author’s advice. The only real lesson I learned from this book is that to be really good, in addition to talent, one needs a lot of dedication and hard work. It’s incredible how much time Waitzkin was spending on chess as a child, and how much time he was spending on his martial arts training later in life, constantly striving to improve. Finishing one training day, pondering about his weak points and working for hours on each of them in the next days. It really shows you what it takes to be among the best in your craft. That’s inspiring.
  • "The Young Yagers" by Mayne Reid – this was a nostalgic read for me. I fondly recall days spent engrossed in Mayne Reid’s adventure books as a kid, so I decided to give it a shot as an adult. Alas, I probably picked the wrong book – this one is a loose collection of hunting stories (with apparent continuity, but really it’s unrelated stories strung together). Each story in itself is fun, very well written, with a lot of interesting details about the flora and fauna of Africa. But when reading them in succession, the stories quickly become boring so I had to pace my reading (reading no more than two or three stories per day). Mayne Reid is a very good writer – I just have to find one of his better books.
  • "The Greatest Show on Earth" by Richard Dawkins – Dawkins’s attempt to explain in layman language why evolution is correct and creationism is not. His style is characteristically militant towards creationism, and similarly to "The God Delusion" it’s hard to envision a devoted creationist reading past the first few pages. If anything is preaching to the choir – it’s this book. Not that there’s anything wrong with it, and the author freely admits that the main goal of the book is to provide "intellectual ammunition" to proponents of evolution in debates versus creationists. I personally am not really into debates, but I enjoyed this book a lot. It truly has a lot of interesting scientific information in it, written in Dawkins’s typical lively style; it’s lots of fun to read. The parts I liked best are the one about the (non-)missing fossil record, and the one about various imperfections in living bodies and why they exist.

New year’s Python meme 2011

December 29th, 2011 at 6:39 am

Following Tarek’s lead:

1. What’s the coolest Python application, framework or library you have discovered in 2011?

There wasn’t a single huge discovery, but I did enjoy using some libs I hadn’t previously used. In alphabetical order:

  • Found the Python bindings to Clang.
  • The colorama library for coloring console output was useful for pss.
  • Python bindings to protocol buffers came in handy at work, and is a great tool for the toolbox in general.
  • I finally had some chance to play with Twisted, writing some real code based on it (not for production yet).
  • I really like the new unit-testing capabilities of Python 2.7 (a.k.a. unittest2)

2. What new programming technique did you learn in 2011?

I got to work much more with continuous integration servers than before. Both at work (using Pulse and Jenkins) and in my Python development role (using Buildbot).

In addition, I got much more experience delivering real Python packages and using some helpful tools like virtualenv and tox for making the task of package development and testing easier. This year pycparser saw a few new releases, pss was released, and there’s another project still in stealth mode that also benefits from this experience.

3. What’s the name of the open source project you contributed the most in 2011? What did you do?

Not counting my own open-source projects, that would probably be CPython itself. I became a core developer with commit rights in January, and since then have worked on a lot of issues. Mainly in the standard library, tests and documentation, but also a bit in the core interpreter itself.

4. What was the Python blog or website you read the most in 2011?

I usually don’t subscribe to individual blogs, but use Planet Python and Reddit’s Python page as aggregators.

5. What are the three top things you want to learn in 2012?

  1. I gained a lot of understanding on the low-level workings of compilers, linkers, loaders and related technologies this year. I’d definitely want to learn even more.
  2. To understand the architecture of LLVM better than I do now.
  3. Javascript. Whether we like it or not, it’s the future.

6. What are the top software, app or lib you wish someone would write in 2012?

I would like to see even more code ported to Python 3.

It would also be nice to have much improved voice recognition for Android.