An interesting issue that comes up when writing code for the x64 architecture is which code model to use. This probably isn't a very well-known topic, but if one wants to understand the x64 machine code generated by compilers, it's educational to be familiar with code models. There are also implications for optimization, for those who really care about performance down to the smallest instruction.
There's very little information on this topic online or anywhere. By far the most important resource is the official x64 ABI, which you can obtain from the uclibc page (from now on I'm going to refer to it simply as "the ABI"). There's also a bit of information in the gcc man-pages. The aim of this article is to provide an approachable reference, with some discussion of the topic and concrete examples to demonstrate the concepts in real-life code.
An important disclaimer: this is not a tutorial for beginners. The prerequisites are a solid understanding of C and assembly language, plus a basic familiarity with the x64 architecture.
Code models - motivation
References to both code and data on x64 are done with instruction-relative (RIP-relative in x64 parlance) addressing modes. The offset from RIP in these instructions is limited to 32 bits. So what do we do when 32 bits are not enough? What if the program is larger than 2 GB? Then, a case can arise when an instruction attempting to address some piece of code (or data) just can't do it with its 32-bit offset from RIP.
One solution to this problem is to give up the RIP-relative addressing modes, and use absolute 64-bit offsets for all code and data references. But this has a high cost - more instructions are required to perform the simplest operations. It's a high cost to pay in all code just for the sake of the (very rare) case of extremely huge programs or libraries.
So, the compromise is code models [1]. A code model is a formal agreement between the programmer and the compiler, in which the programmer states his intentions for the size of the eventual program(s) the object file that's being currently compiled will get into [2].
Code models exist for the programmer to be able to tell the compiler: don't worry, this object will only get into non-huge programs, so you can use the fast RIP-relative addressing modes. Conversely, he can tell the compiler: this object is expected to be linked into huge programs, so please use the slow but safe absolute addressing modes with full 64-bit offsets.
What will be covered here
The two scenarios described above have names: the small code model promises to the compiler that 32-bit relative offsets should be enough for all code and data references in the compiled object. The large code model, on the other hand, tells it not to make any assumptions and use absolute 64-bit addressing modes for code and data references. To make things more interesting, there's also a middle road, called the medium code model.
These code models exist separately for non-PIC and PIC code. The article is going to discuss all 6 variations.
Example C source
I'll be using the following C program compiled with different code models to demonstrate the concepts discussed in the article. In this code, the main function accesses 4 different global arrays and one global function. The arrays differ by two parameters: size and visibility. The size is important to explain the medium code model and won't be used for the small and large models. Visibility is either static (visible only in this source file) or completely global (visible by all other objects linked into the program). This distinction is important for the PIC code models.
int global_arr[100] = {2, 3};
static int static_arr[100] = {9, 7};
int global_arr_big[50000] = {5, 6};
static int static_arr_big[50000] = {10, 20};
int global_func(int param)
{
return param * 10;
}
int main(int argc, const char* argv[])
{
int t = global_func(argc);
t += global_arr[7];
t += static_arr[7];
t += global_arr_big[7];
t += static_arr_big[7];
return t;
}
gcc takes the code model as the value of the -mcmodel option. Additionally, PIC compilation can be specified with the -fpic flag.
For example, compiling it into an object file with the large code model and PIC enabled:
> gcc -g -O0 -c codemodel1.c -fpic -mcmodel=large -o codemodel1_large_pic.o
Small code model
Here's what man gcc has to say about the small code model:
-mcmodel=small
Generate code for the small code model: the program and its symbols must be linked in the lower 2 GB of the address space. Pointers are 64 bits. Programs can be statically or dynamically linked. This is the default code model.
In other words, the compiler is free to assume that all code and data can be accessed with 32-bit RIP-relative offsets from any instruction in the code. Let's see the disassembly of the example C program compiled in non-PIC small code model:
> objdump -dS codemodel1_small.o
[...]
int main(int argc, const char* argv[])
{
15: 55 push %rbp
16: 48 89 e5 mov %rsp,%rbp
19: 48 83 ec 20 sub $0x20,%rsp
1d: 89 7d ec mov %edi,-0x14(%rbp)
20: 48 89 75 e0 mov %rsi,-0x20(%rbp)
int t = global_func(argc);
24: 8b 45 ec mov -0x14(%rbp),%eax
27: 89 c7 mov %eax,%edi
29: b8 00 00 00 00 mov $0x0,%eax
2e: e8 00 00 00 00 callq 33 <main+0x1e>
33: 89 45 fc mov %eax,-0x4(%rbp)
t += global_arr[7];
36: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
3c: 01 45 fc add %eax,-0x4(%rbp)
t += static_arr[7];
3f: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
45: 01 45 fc add %eax,-0x4(%rbp)
t += global_arr_big[7];
48: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
4e: 01 45 fc add %eax,-0x4(%rbp)
t += static_arr_big[7];
51: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
57: 01 45 fc add %eax,-0x4(%rbp)
return t;
5a: 8b 45 fc mov -0x4(%rbp),%eax
}
5d: c9 leaveq
5e: c3 retq
As we can see, all arrays are accessed in exactly the same manner - by using a simple RIP-relative offset. However, the offset in the code is 0, because the compiler doesn't know where the data section will be placed. So it also creates a relocation for each such access:
> readelf -r codemodel1_small.o
Relocation section '.rela.text' at offset 0x62bd8 contains 5 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000002f 001500000002 R_X86_64_PC32 0000000000000000 global_func - 4
000000000038 001100000002 R_X86_64_PC32 0000000000000000 global_arr + 18
000000000041 000300000002 R_X86_64_PC32 0000000000000000 .data + 1b8
00000000004a 001200000002 R_X86_64_PC32 0000000000000340 global_arr_big + 18
000000000053 000300000002 R_X86_64_PC32 0000000000000000 .data + 31098
Let's fully decode the access to global_arr as an example. Here's the relevant part of the disassembly again:
t += global_arr[7];
36: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
3c: 01 45 fc add %eax,-0x4(%rbp)
RIP-relative addressing is relative to the next instruction. So the offset that should be patched into the mov instruction should be relative to 0x3c. The relevant relocation is the second one, pointing to the operand of mov at 0x38. It's R_X86_64_PC32, which means: take the symbol value, add the addend and subtract the offset this relocation points to. If you do the math you see this ends up placing the relative offset between the next instruction and global_arr, plus 0x1c. This relative offset is just what we need, since 0x1c simply means "the 7th int in the array" (each int is 4 bytes long on x64). So the instruction correctly references global_arr[7] using RIP relative addressing.
Another interesting thing to note here is that although the instructions for accessing static_arr are similar, its relocation has a different symbol, pointing to the .data section instead of the specific symbol. This is because the static array is placed by the linker in the .data section in a known location - it can't be shared with other shared libraries. This relocation will eventually get fully resolved by the linker. On the other hand, the reference to global_arr will be left to the dynamic loader to resolve, since global_arr can actually be used (or overridden by) a different shared library [3].
Finally, let's look at the reference to global_func:
int t = global_func(argc);
24: 8b 45 ec mov -0x14(%rbp),%eax
27: 89 c7 mov %eax,%edi
29: b8 00 00 00 00 mov $0x0,%eax
2e: e8 00 00 00 00 callq 33 <main+0x1e>
33: 89 45 fc mov %eax,-0x4(%rbp)
The operand of a callq is also RIP-relative, so the R_X86_64_PC32 relocation here works similarly to place the actual relative offset to global_func into the operand.
To conclude, since the small code model promises the compiler that all code and data in the eventual program can be accessible with 32-bit RIP-relative offsets, the compiler can generate simple and efficient code for accessing all kinds of objects.
Large code model
From man gcc:
-mcmodel=large
Generate code for the large model: This model makes no assumptions about addresses and sizes of sections.
Here's the disassembled code of main when compiled with the non-PIC large code model:
int main(int argc, const char* argv[])
{
15: 55 push %rbp
16: 48 89 e5 mov %rsp,%rbp
19: 48 83 ec 20 sub $0x20,%rsp
1d: 89 7d ec mov %edi,-0x14(%rbp)
20: 48 89 75 e0 mov %rsi,-0x20(%rbp)
int t = global_func(argc);
24: 8b 45 ec mov -0x14(%rbp),%eax
27: 89 c7 mov %eax,%edi
29: b8 00 00 00 00 mov $0x0,%eax
2e: 48 ba 00 00 00 00 00 movabs $0x0,%rdx
35: 00 00 00
38: ff d2 callq *%rdx
3a: 89 45 fc mov %eax,-0x4(%rbp)
t += global_arr[7];
3d: 48 b8 00 00 00 00 00 movabs $0x0,%rax
44: 00 00 00
47: 8b 40 1c mov 0x1c(%rax),%eax
4a: 01 45 fc add %eax,-0x4(%rbp)
t += static_arr[7];
4d: 48 b8 00 00 00 00 00 movabs $0x0,%rax
54: 00 00 00
57: 8b 40 1c mov 0x1c(%rax),%eax
5a: 01 45 fc add %eax,-0x4(%rbp)
t += global_arr_big[7];
5d: 48 b8 00 00 00 00 00 movabs $0x0,%rax
64: 00 00 00
67: 8b 40 1c mov 0x1c(%rax),%eax
6a: 01 45 fc add %eax,-0x4(%rbp)
t += static_arr_big[7];
6d: 48 b8 00 00 00 00 00 movabs $0x0,%rax
74: 00 00 00
77: 8b 40 1c mov 0x1c(%rax),%eax
7a: 01 45 fc add %eax,-0x4(%rbp)
return t;
7d: 8b 45 fc mov -0x4(%rbp),%eax
}
80: c9 leaveq
81: c3 retq
Again, looking at the relocations will be useful:
Relocation section '.rela.text' at offset 0x62c18 contains 5 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000030 001500000001 R_X86_64_64 0000000000000000 global_func + 0
00000000003f 001100000001 R_X86_64_64 0000000000000000 global_arr + 0
00000000004f 000300000001 R_X86_64_64 0000000000000000 .data + 1a0
00000000005f 001200000001 R_X86_64_64 0000000000000340 global_arr_big + 0
00000000006f 000300000001 R_X86_64_64 0000000000000000 .data + 31080
The large code model is also quite uniform - no assumptions can be made about the size of the code and data sections, so all data is accessed similarly. Let's pick global_arr once again:
t += global_arr[7];
3d: 48 b8 00 00 00 00 00 movabs $0x0,%rax
44: 00 00 00
47: 8b 40 1c mov 0x1c(%rax),%eax
4a: 01 45 fc add %eax,-0x4(%rbp)
Here two instructions are needed to pull the desired value from the array. The first places an absolute 64-bit address into rax. This is the address of global_arr, as we shall soon see. The second loads the word at (rax) + 0x1c into eax.
So, let's focus on the instruction at 0x3d. It's a movabs - the absolute 64-bit version of mov on x64. It can swing a full 64-bit immediate into a register. The value of this immediate in the disassembled code is 0, so we have to turn to the relocation table for the answer. It has a R_X86_64_64 relocation for the operand at 0x3f. This is an absolute relocation, which simply means - place the symbol value + addend back into the offset. In other words, rax will hold the absolute address of global_arr.
What about the function call?
int t = global_func(argc);
24: 8b 45 ec mov -0x14(%rbp),%eax
27: 89 c7 mov %eax,%edi
29: b8 00 00 00 00 mov $0x0,%eax
2e: 48 ba 00 00 00 00 00 movabs $0x0,%rdx
35: 00 00 00
38: ff d2 callq *%rdx
3a: 89 45 fc mov %eax,-0x4(%rbp)
After a famililar movabs, we have a call instruction that calls a function whose address is in rdx. From a glance at the relevant relocation it's obvious that this is very similar to the data access.
Evidently, the large code model makes absolutely no assumptions about the sizes of code and data sections, or where symbols might end up. It just takes the "safe road" everywhere, using absolute 64-bit moves to refer to symbols. This has a cost, of course. Notice that it now takes one extra instruction to access any symbol, when compared to the small model.
So, we've just witnessed two extremes. The small model happily assumes everything fits into the lower 2GB of memory, and the large model assumes everything is possible and any symbol can reside anywhere in the full 64-bit address space. The medium code model is a compromise.
Medium code model
As before, let's start with a quote from man gcc:
-mcmodel=medium
Generate code for the medium model: The program is linked in the lower 2 GB of the address space. Small symbols are also placed there. Symbols with sizes larger than -mlarge-data-threshold are put into large data or bss sections and can be located above 2GB. Programs can be statically or dynamically linked.
Similarly to the small code model, the medium code model assumes all code is linked into the low 2GB. Data, on the other hand, is divided into "large data" and "small data". Small data is also assumed to be linked into the low 2GB. Large data, on the other hand, is not restricted in its memory placement. Data is considered large when it's larger than a given threshold option, which is 64KB by default.
It is also interesting to note that in the medium code model, special sections will be created for the large data - .ldata and .lbss (parallel to .data and .bss). It's not really important for the sake of this article, however, so I'm going to sidestep the topic. Read the ABI for more details.
Now it should be clear why the sample C code has those _big arrays. These are meant for the medium code model to be considered as "large data" (which they certainly are, at 200KB each). Here's the disassembly:
int main(int argc, const char* argv[])
{
15: 55 push %rbp
16: 48 89 e5 mov %rsp,%rbp
19: 48 83 ec 20 sub $0x20,%rsp
1d: 89 7d ec mov %edi,-0x14(%rbp)
20: 48 89 75 e0 mov %rsi,-0x20(%rbp)
int t = global_func(argc);
24: 8b 45 ec mov -0x14(%rbp),%eax
27: 89 c7 mov %eax,%edi
29: b8 00 00 00 00 mov $0x0,%eax
2e: e8 00 00 00 00 callq 33 <main+0x1e>
33: 89 45 fc mov %eax,-0x4(%rbp)
t += global_arr[7];
36: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
3c: 01 45 fc add %eax,-0x4(%rbp)
t += static_arr[7];
3f: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
45: 01 45 fc add %eax,-0x4(%rbp)
t += global_arr_big[7];
48: 48 b8 00 00 00 00 00 movabs $0x0,%rax
4f: 00 00 00
52: 8b 40 1c mov 0x1c(%rax),%eax
55: 01 45 fc add %eax,-0x4(%rbp)
t += static_arr_big[7];
58: 48 b8 00 00 00 00 00 movabs $0x0,%rax
5f: 00 00 00
62: 8b 40 1c mov 0x1c(%rax),%eax
65: 01 45 fc add %eax,-0x4(%rbp)
return t;
68: 8b 45 fc mov -0x4(%rbp),%eax
}
6b: c9 leaveq
6c: c3 retq
Note that the _big arrays are accessed as in the large model, and the other arrays are accessed as in the small model. The function is also accessed as in the small model. I won't even show the relocations since there's nothing new in them either.
The medium model is a clever compromise between the small and large models. The program's code is unlikely to be terribly big [4], so what might push it over the 2GB threshold is large pieces of data statically linked into it (perhaps for some sort of big lookup tables). The medium code model separates these large chunks of data from the rest and handles them specially. All code just calling functions and accessing the other, smaller symbols will be as efficient as in the small code model. Only the code actually accessing the large symbols will have to go the whole 64-bit way similarly to the large code model.
Small PIC code model
Let us now turn to the code models for PIC, starting once again with the small model [5]. Here's the sample code, compiled with PIC and the small code model:
int main(int argc, const char* argv[])
{
15: 55 push %rbp
16: 48 89 e5 mov %rsp,%rbp
19: 48 83 ec 20 sub $0x20,%rsp
1d: 89 7d ec mov %edi,-0x14(%rbp)
20: 48 89 75 e0 mov %rsi,-0x20(%rbp)
int t = global_func(argc);
24: 8b 45 ec mov -0x14(%rbp),%eax
27: 89 c7 mov %eax,%edi
29: b8 00 00 00 00 mov $0x0,%eax
2e: e8 00 00 00 00 callq 33 <main+0x1e>
33: 89 45 fc mov %eax,-0x4(%rbp)
t += global_arr[7];
36: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax
3d: 8b 40 1c mov 0x1c(%rax),%eax
40: 01 45 fc add %eax,-0x4(%rbp)
t += static_arr[7];
43: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
49: 01 45 fc add %eax,-0x4(%rbp)
t += global_arr_big[7];
4c: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax
53: 8b 40 1c mov 0x1c(%rax),%eax
56: 01 45 fc add %eax,-0x4(%rbp)
t += static_arr_big[7];
59: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
5f: 01 45 fc add %eax,-0x4(%rbp)
return t;
62: 8b 45 fc mov -0x4(%rbp),%eax
}
65: c9 leaveq
66: c3 retq
And the relocations:
Relocation section '.rela.text' at offset 0x62ce8 contains 5 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000002f 001600000004 R_X86_64_PLT32 0000000000000000 global_func - 4
000000000039 001100000009 R_X86_64_GOTPCREL 0000000000000000 global_arr - 4
000000000045 000300000002 R_X86_64_PC32 0000000000000000 .data + 1b8
00000000004f 001200000009 R_X86_64_GOTPCREL 0000000000000340 global_arr_big - 4
00000000005b 000300000002 R_X86_64_PC32 0000000000000000 .data + 31098
Since the small vs. big data distinction plays no role in the small model, we're going to focus on the difference between local (static) and global symbols, which does play a role when PIC is generated.
As you can see, the code generated for the static arrays is exactly equivalent to the code generated in the non-PIC case. This is one of the boons of the x64 architecture - unless symbols have to be accessed externally, you get PIC for free because of the RIP-relative addressing for data. The instructions and relocations used are the same, so we won't go over them again.
The interesting case here is the global arrays. Recall that in PIC, global data has to go through GOT, because it may be eventually found or used in other shared libraries [6]. Here's the code generated for accessing global_arr:
t += global_arr[7];
36: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax
3d: 8b 40 1c mov 0x1c(%rax),%eax
40: 01 45 fc add %eax,-0x4(%rbp)
And the relevant relocation is a R_X86_64_GOTPCREL, which means: the location of the entry for the symbol in the GOT + addend, minus the offset for applying the relocation. In other words, the relative offset between RIP (of the next instruction) and the slot reserved for global_arr in GOT is patched into the instruction. So what's put into rax in the instruction at 0x36 is the actual address of global_arr. This is followed by dereferncing the address of global_arr plus an offset to its 7th element into eax.
Now let's examine the function call:
int t = global_func(argc);
24: 8b 45 ec mov -0x14(%rbp),%eax
27: 89 c7 mov %eax,%edi
29: b8 00 00 00 00 mov $0x0,%eax
2e: e8 00 00 00 00 callq 33 <main+0x1e>
33: 89 45 fc mov %eax,-0x4(%rbp)
There's a R_X86_64_PLT32 relocation for the operand of callq at 0x2e. This relocation means: the address of the PLT entry for the symbol + addend, minus the offset for applying the relocation. In other words, the callq should correctly call the PLT trampoline for global_func.
Note the implicit assumptions made by the compiler - that the GOT and PLT could be accessed with RIP-relative addresing. This will be important when comparing this model to the other PIC code models.
Large PIC code model
Here's the disassembly:
int main(int argc, const char* argv[])
{
15: 55 push %rbp
16: 48 89 e5 mov %rsp,%rbp
19: 53 push %rbx
1a: 48 83 ec 28 sub $0x28,%rsp
1e: 48 8d 1d f9 ff ff ff lea -0x7(%rip),%rbx
25: 49 bb 00 00 00 00 00 movabs $0x0,%r11
2c: 00 00 00
2f: 4c 01 db add %r11,%rbx
32: 89 7d dc mov %edi,-0x24(%rbp)
35: 48 89 75 d0 mov %rsi,-0x30(%rbp)
int t = global_func(argc);
39: 8b 45 dc mov -0x24(%rbp),%eax
3c: 89 c7 mov %eax,%edi
3e: b8 00 00 00 00 mov $0x0,%eax
43: 48 ba 00 00 00 00 00 movabs $0x0,%rdx
4a: 00 00 00
4d: 48 01 da add %rbx,%rdx
50: ff d2 callq *%rdx
52: 89 45 ec mov %eax,-0x14(%rbp)
t += global_arr[7];
55: 48 b8 00 00 00 00 00 movabs $0x0,%rax
5c: 00 00 00
5f: 48 8b 04 03 mov (%rbx,%rax,1),%rax
63: 8b 40 1c mov 0x1c(%rax),%eax
66: 01 45 ec add %eax,-0x14(%rbp)
t += static_arr[7];
69: 48 b8 00 00 00 00 00 movabs $0x0,%rax
70: 00 00 00
73: 8b 44 03 1c mov 0x1c(%rbx,%rax,1),%eax
77: 01 45 ec add %eax,-0x14(%rbp)
t += global_arr_big[7];
7a: 48 b8 00 00 00 00 00 movabs $0x0,%rax
81: 00 00 00
84: 48 8b 04 03 mov (%rbx,%rax,1),%rax
88: 8b 40 1c mov 0x1c(%rax),%eax
8b: 01 45 ec add %eax,-0x14(%rbp)
t += static_arr_big[7];
8e: 48 b8 00 00 00 00 00 movabs $0x0,%rax
95: 00 00 00
98: 8b 44 03 1c mov 0x1c(%rbx,%rax,1),%eax
9c: 01 45 ec add %eax,-0x14(%rbp)
return t;
9f: 8b 45 ec mov -0x14(%rbp),%eax
}
a2: 48 83 c4 28 add $0x28,%rsp
a6: 5b pop %rbx
a7: c9 leaveq
a8: c3 retq
And the relocations:
Relocation section '.rela.text' at offset 0x62c70 contains 6 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000027 00150000001d R_X86_64_GOTPC64 0000000000000000 _GLOBAL_OFFSET_TABLE_ + 9
000000000045 00160000001f R_X86_64_PLTOFF64 0000000000000000 global_func + 0
000000000057 00110000001b R_X86_64_GOT64 0000000000000000 global_arr + 0
00000000006b 000800000019 R_X86_64_GOTOFF64 00000000000001a0 static_arr + 0
00000000007c 00120000001b R_X86_64_GOT64 0000000000000340 global_arr_big + 0
000000000090 000900000019 R_X86_64_GOTOFF64 0000000000031080 static_arr_big + 0
Again, the small vs. big data distinction isn't important here, so we'll focus on static_arr and global_arr. But first, there's a new prologue in this code which we didn't encounter earlier:
1e: 48 8d 1d f9 ff ff ff lea -0x7(%rip),%rbx
25: 49 bb 00 00 00 00 00 movabs $0x0,%r11
2c: 00 00 00
2f: 4c 01 db add %r11,%rbx
Here's a relevant quote from the ABI:
In the small code model all addresses (including GOT entries) are accessible via the IP-relative addressing provided by the AMD64 architecture. Hence there is no need for an explicit GOT pointer and therefore no function prologue for setting it up is necessary. In the medium and large code models a register has to be allocated to hold the address of the GOT in position-independent objects, because the AMD64 ISA does not support an immediate displacement larger than 32 bits.
Let's see how the prologue displayed above computes the address of GOT. First, the instruction at 0x1e loads its own address into rbx. Then, an absolute 64-bit move is done into r11, with a R_X86_64_GOTPC64 relocation. This relocation means: take the GOT address, subtract the relocated offset and add the addend. Finally, the instruction at 0x2f adds the two together. The result is the absolute address of GOT in rbx [7].
Why go through all this trouble to compute the address of GOT? Well, for one thing, as the quote says, in the large model we can't assume that the 32-bit RIP relative offset will suffice to access GOT, so we need a full 64-bit address. On the other hand, we still want PIC, so we can't just place an absolute address into the register. Rather, the address has to be computed relative to RIP. This is what the prologue does. It's just a 64-bit RIP-relative computation.
Anyway, now we have the address of GOT firmly in our rbx, let's see how static_arr is accessed:
t += static_arr[7];
69: 48 b8 00 00 00 00 00 movabs $0x0,%rax
70: 00 00 00
73: 8b 44 03 1c mov 0x1c(%rbx,%rax,1),%eax
77: 01 45 ec add %eax,-0x14(%rbp)
The relocation for the first instruction is R_X86_64_GOTOFF64, which means: symbol + addend - GOT. In our case: the relative offset between the address of static_arr and the address of GOT. The next instruction adds that to rbx (the absolute GOT address), and dereferences with a 0x1c offset. Here's some pseudo-C to make this computation easier to visualize:
// char* static_arr
// char* GOT
rax = static_arr + 0 - GOT; // rax now contains an offset
eax = *(rbx + rax + 0x1c); // rbx == GOT, so eax now contains
// *(GOT + static_arr - GOT + 0x1c) or
// *(static_arr + 0x1c)
Note an interesting thing here: the GOT address is just used as an anchor to reach static_arr. This is unlike the normal usage of GOT to actually contain the address of a symbol within it. Since static_arr is not an external symbol, there's no point keeping it inside the GOT. But still, GOT is used here as an anchor in the data section, relative to which the address of the symbol can be found with a full 64-bit offset, which is at the same time position independent (the linker will be able to resolve this relocation, leaving no need to modify the code section during loading).
How about global_arr?
t += global_arr[7];
55: 48 b8 00 00 00 00 00 movabs $0x0,%rax
5c: 00 00 00
5f: 48 8b 04 03 mov (%rbx,%rax,1),%rax
63: 8b 40 1c mov 0x1c(%rax),%eax
66: 01 45 ec add %eax,-0x14(%rbp)
The code is a bit longer, and the relocation is also different. This is actually a more traditional use of GOT. The R_X86_64_GOT64 relocation for the movabs just tells it to place the offset into the GOT where the address of global_arr resides into rax. The instruction at 0x5f extracts the address of global_arr from the GOT and places it into rax. The next instruction dereferences global_arr[7], placing the value into eax.
Now let's look at the code reference for global_func. Recall that in the large code model we can't make any assumptions regarding the size of the code section, so we should assume that even to reach the PLT we need an absolute 64-bit address:
int t = global_func(argc);
39: 8b 45 dc mov -0x24(%rbp),%eax
3c: 89 c7 mov %eax,%edi
3e: b8 00 00 00 00 mov $0x0,%eax
43: 48 ba 00 00 00 00 00 movabs $0x0,%rdx
4a: 00 00 00
4d: 48 01 da add %rbx,%rdx
50: ff d2 callq *%rdx
52: 89 45 ec mov %eax,-0x14(%rbp)
The relevant relocation is a R_X86_64_PLTOFF64, which means: PLT entry address for global_func, minus GOT address. This is placed into rdx, into which rbx (the absolute address of GOT) is later added. The result is the PLT entry address for global_func in rdx.
Again, note the use of GOT as an "anchor" to enable position-independent reference to the PLT entry offset.
Medium PIC code model
Finally, we'll examine the code generated for the medium PIC code model:
int main(int argc, const char* argv[])
{
15: 55 push %rbp
16: 48 89 e5 mov %rsp,%rbp
19: 53 push %rbx
1a: 48 83 ec 28 sub $0x28,%rsp
1e: 48 8d 1d 00 00 00 00 lea 0x0(%rip),%rbx
25: 89 7d dc mov %edi,-0x24(%rbp)
28: 48 89 75 d0 mov %rsi,-0x30(%rbp)
int t = global_func(argc);
2c: 8b 45 dc mov -0x24(%rbp),%eax
2f: 89 c7 mov %eax,%edi
31: b8 00 00 00 00 mov $0x0,%eax
36: e8 00 00 00 00 callq 3b <main+0x26>
3b: 89 45 ec mov %eax,-0x14(%rbp)
t += global_arr[7];
3e: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax
45: 8b 40 1c mov 0x1c(%rax),%eax
48: 01 45 ec add %eax,-0x14(%rbp)
t += static_arr[7];
4b: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
51: 01 45 ec add %eax,-0x14(%rbp)
t += global_arr_big[7];
54: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax
5b: 8b 40 1c mov 0x1c(%rax),%eax
5e: 01 45 ec add %eax,-0x14(%rbp)
t += static_arr_big[7];
61: 48 b8 00 00 00 00 00 movabs $0x0,%rax
68: 00 00 00
6b: 8b 44 03 1c mov 0x1c(%rbx,%rax,1),%eax
6f: 01 45 ec add %eax,-0x14(%rbp)
return t;
72: 8b 45 ec mov -0x14(%rbp),%eax
}
75: 48 83 c4 28 add $0x28,%rsp
79: 5b pop %rbx
7a: c9 leaveq
7b: c3 retq
And the relocations:
Relocation section '.rela.text' at offset 0x62d60 contains 6 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000021 00160000001a R_X86_64_GOTPC32 0000000000000000 _GLOBAL_OFFSET_TABLE_ - 4
000000000037 001700000004 R_X86_64_PLT32 0000000000000000 global_func - 4
000000000041 001200000009 R_X86_64_GOTPCREL 0000000000000000 global_arr - 4
00000000004d 000300000002 R_X86_64_PC32 0000000000000000 .data + 1b8
000000000057 001300000009 R_X86_64_GOTPCREL 0000000000000000 global_arr_big - 4
000000000063 000a00000019 R_X86_64_GOTOFF64 0000000000030d40 static_arr_big + 0
First, let's clear the function call out of the way. Similarly to the small model, in the medium model we assume that code references are within the bounds of a 32-bit offset from RIP. Therefore, the code to call global_func is exactly similar to the small PIC model. The same goes for the small data arrays static_arr and global_arr. So we'll focus on the big data arrays, but first let's discuss the prologue, which is different from the large model:
1e: 48 8d 1d 00 00 00 00 lea 0x0(%rip),%rbx
That's it, a single instruction (instead of the 3 it took in the large model) to get the address of GOT into rbx (with the help of a R_X86_64_GOTPC32 relocation). Why the difference? Because in the medium code model, we assume the GOT itself is reachable with a 32-bit offset, because it's not part of the "big data sections". In the large code model we couldn't make this assumption and had to use a full 64-bit offset to access the GOT.
Interestingly, we notice that the code to access global_arr_big is also similar to the small PIC model. Why? For the same reason the prologue is shorter than in the large model. In the medium model, we assume the GOT itself is reachable with 32-bit RIP-relative addressing. True, global_arr_big itself is not, but this is covered by the GOT anyway, since the address of global_arr_big actually resides in the GOT, and it's a full 64-bit address there.
For static_arr_big, the situation is different, however:
t += static_arr_big[7];
61: 48 b8 00 00 00 00 00 movabs $0x0,%rax
68: 00 00 00
6b: 8b 44 03 1c mov 0x1c(%rbx,%rax,1),%eax
6f: 01 45 ec add %eax,-0x14(%rbp)
This is actually similar to the large PIC code model, because here we do obtain an absolute address for the symbol, which doesn't reside in the GOT itself. Since this is a large symbol that can't be assumed to reside in the low 2 GB, we need the 64-bit PIC offset here, similarly to the large model.
[1] | Code models are not to be confused with 64-bit data models and Intel memory models, both of which are different topics. |
[2] | An important thing to keep in mind here: the actual instructions are created by the compiler, and the addressing modes are "cemented" at that stage. The compiler has no way to know into which programs or shared libs the object it's compiling will eventually get into. Some may be small, but some may be large. The linker does know the size of the resulting program, but it's too late at that point, since the linker can't actually change the instructions, just patch offsets within them with relocations. Therefore, the code model "contract" has to be "signed" by the programmer at the compilation stage. |
[3] | If this isn't clear, read this article. |
[4] | Although it's getting there. Last time I checked, the Debug+Asserts build of Clang was almost half a GB in size (thanks to quite a bit of auto-generated code). |
[5] | Unless you already know how PIC works (both in general and for x64 in particular), this would be a good time to go over my earlier articles on this subject - #1 and #2 |
[6] | So the linker can't fully resolve the references on its own, and has to leave GOT handling to the dynamic loader. |
[7] | 0x25 - 0x7 + GOT - 0x27 + 0x9 = GOT |