Stack frame layout on x86-64

September 6th, 2011 at 8:13 pm

A few months ago I’ve written an article named Where the top of the stack is on x86, which aimed to clear some misunderstandings regarding stack usage on the x86 architecture. The article concluded with a useful diagram presenting the stack frame layout of a typical function call.

In this article I will examine the stack frame layout of the newer 64-bit version of the x86 architecture, x64 [1]. The focus will be on Linux and other OSes following the official System V AMD64 ABI (available from here). Windows uses a somewhat different ABI, and I will mention it briefly in the end.

I have no intention of detailing the complete x64 calling convention here. For that, you will literally have to read the whole AMD64 ABI.

Registers galore

x86 has just 8 general-purpose registers available (eax, ebx, ecx, edx, ebp, esp, esi, edi). x64 extended them to 64 bits (prefix "r" instead of "e") and added another 8 (r8, r9, r10, r11, r12, r13, r14, r15). Since some of x86′s registers have special implicit meanings and aren’t really used as general-purpose (most notably ebp and esp), the effective increase is even larger than it seems.

There’s a reason I’m mentioning this in an article focused on stack frames. The relatively large amount of available registers influenced some important design decisions for the ABI, such as passing many arguments in registers, thus rendering the stack less useful than before [2].

Argument passing

I’m going to simplify the discussion here on purpose and focus on integer/pointer arguments [3]. According to the ABI, the first 6 integer or pointer arguments to a function are passed in registers. The first is placed in rdi, the second in rsi, the third in rdx, and then rcx, r8 and r9. Only the 7th argument and onwards are passed on the stack.

The stack frame

With the above in mind, let’s see how the stack frame for this C function looks:

long myfunc(long a, long b, long c, long d,
            long e, long f, long g, long h)
{
    long xx = a * b * c * d * e * f * g * h;
    long yy = a + b + c + d + e + f + g + h;
    long zz = utilfunc(xx, yy, xx % yy);
    return zz + 20;
}

This is the stack frame:

http://eli.thegreenplace.net/wp-content/uploads/2011/08/x64_frame_nonleaf.png

So the first 6 arguments are passed via registers. But other than that, this doesn’t look very different from what happens on x86 [4], except this strange "red zone". What is that all about?

The red zone

First I’ll quote the formal definition from the AMD64 ABI:

The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.

Put simply, the red zone is an optimization. Code can assume that the 128 bytes below rsp will not be asynchronously clobbered by signals or interrupt handlers, and thus can use it for scratch data, without explicitly moving the stack pointer. The last sentence is where the optimization lays – decrementing rsp and restoring it are two instructions that can be saved when using the red zone for data.

However, keep in mind that the red zone will be clobbered by function calls, so it’s usually most useful in leaf functions (functions that call no other functions).

Recall how myfunc in the code sample above calls another function named utilfunc. This was done on purpose, to make myfunc non-leaf and thus prevent the compiler from applying the red zone optimization. Looking at the code of utilfunc:

long utilfunc(long a, long b, long c)
{
    long xx = a + 2;
    long yy = b + 3;
    long zz = c + 4;
    long sum = xx + yy + zz;

    return xx * yy * zz + sum;
}

This is indeed a leaf function. Let’s see how its stack frame looks when compiled with gcc:

http://eli.thegreenplace.net/wp-content/uploads/2011/08/x64_frame_leaf.png

Since utilfunc only has 3 arguments, calling it requires no stack usage since all the arguments fit into registers. In addition, since it’s a leaf function, gcc chooses to use the red zone for all its local variables. Thus, esp needs not be decremented (and later restored) to allocate space for this data.

Preserving the base pointer

The base pointer rbp (and its predecessor ebp on x86), being a stable "anchor" to the beginning of the stack frame throughout the execution of a function, is very convenient for manual assembly coding and for debugging [5]. However, some time ago it was noticed that compiler-generated code doesn’t really need it (the compiler can easily keep track of offsets from rsp), and the DWARF debugging format provides means (CFI) to access stack frames without the base pointer.

This is why some compilers started omitting the base pointer for aggressive optimizations, thus shortening the function prologue and epilogue, and providing an additional register for general-purpose use (which, recall, is quite useful on x86 with its limited set of GPRs).

gcc keeps the base pointer by default on x86, but allows the optimization with the -fomit-frame-pointer compilation flag. How recommended it is to use this flag is a debated issue – you may do some googling if this interests you.

Anyhow, one other "novelty" the AMD64 ABI introduced is making the base pointer explicitly optional, stating:

The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available.

gcc adheres to this recommendation and by default omits the frame pointer on x64, when compiling with optimizations. It gives an option to preserve it by providing the -fno-omit-frame-pointer flag. For clarity’s sake, the stack frames showed above were produced without omitting the frame pointer.

The Windows x64 ABI

Windows on x64 implements an ABI of its own, which is somewhat different from the AMD64 ABI. I will only discuss the Windows x64 ABI briefly, mentioning how its stack frame layout differs from AMD64. These are the main differences:

  1. Only 4 integer/pointer arguments are passed in registers (rcx, rdx, r8, r9).
  2. There is no concept of "red zone" whatsoever. In fact, the ABI explicitly states that the area beyond rsp is considered volatile and unsafe to use. The OS, debuggers or interrupt handlers may overwrite this area.
  3. Instead, a "register parameter area" [6] is provided by the caller in each stack frame. When a function is called, the last thing allocated on the stack before the return address is space for at least 4 registers (8 bytes each). This area is available for the callee’s use without explicitly allocating it. It’s useful for variable argument functions as well as for debugging (providing known locations for parameters, while registers may be reused for other purposes). Although the area was originally conceived for spilling the 4 arguments passed in registers, these days the compiler uses it for other optimization purposes as well (for example, if the function needs less than 32 bytes of stack space for its local variables, this area may be used without touching rsp).

Another important change that was made in the Windows x64 ABI is the cleanup of calling conventions. No more cdecl/stdcall/fastcall/thiscall/register/safecall madness – just a single "x64 calling convention". Cheers to that!

For more information on this and other aspects of the Windows x64 ABI, here are some good links:

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] This architecture goes by many names. Originated by AMD and dubbed AMD64, it was later implemented by Intel, which called it IA-32e, then EM64T and finally Intel 64. It’s also being called x86-64. But I like the name x64 – it’s nice and short.
[2] There are calling conventions for x86 that also dictate passing some of the arguments in registers. The best known is probably fastcall. Unfortunately, it’s not consistent across platforms.
[3] The ABI also defines passing floating-point arguments via the xmm registers. The idea is pretty much the same as for integers, however, and IMHO including floating-point arguments in the article will needlessly complicate it.
[4] I’m cheating a bit here. Any compiler worth its salt (and certainly gcc) will use registers for local variables as well, especially on x64 where registers are plentiful. But if there are a lot of local variables (or they’re large, like arrays or structs), they will go on the stack anyway.
[5] Since inside a function rbp always points at the previous stack frame, it forms a kind of linked list of stack frames which the debugger can use to access the execution stack trace at any given time (in core dumps as well).
[6] Also called "home space" sometimes.

Related posts:

  1. Where the top of the stack is on x86
  2. How statically linked programs run on Linux
  3. Position Independent Code (PIC) in shared libraries on x64
  4. Reading C type declarations
  5. Position Independent Code (PIC) in shared libraries

24 Responses to “Stack frame layout on x86-64”

  1. CoreAn_Crack3rZNo Gravatar Says:

    Informative! I am studying computer organization and digital design lately (just started 2 days ago). Great post! :D

  2. LuisNo Gravatar Says:

    Tienes un blog muy interesante, gracias :D . Adding entries to RSS…

  3. PragmaNo Gravatar Says:

    “However, keep in mind that the red zone will be clobbered by function calls, so it’s usually most useful in leaf functions (functions that call no other functions).”

    Interesting, but I’m struggling to see the advantage in the addition of the ‘red zone’ to the ABI: is it useful for anything else? What has me stumped is: what is a leaf function but something that ought to be inlined instead?

  4. Gary StachelskiNo Gravatar Says:

    Hi Eli,

    Thanks for an enlightening article. However, it appears that compilers and optimization have been using the stack for more than just data storage. Return-oriented programming (ROP) has popped up as an unstoppable method of attacking a system (once the attacker has control over the stack). It appears that compilers have been placing short executable sequences of instructions onto the the stack frame just before the return instruction. This allows an attacker to safely replace the instructions with their own and there is no way to determine that the processing has been compromised since the stack area is in memory that is marked executable and none of the program areas have been touched. I might have this wrong, as this does not match with my general understandings of the use of the stack (for storage of state and variables, not for instructions). Can you comment on this?

  5. Matt GiucaNo Gravatar Says:

    Excellent overview. One thing, if you can be bothered fixing it: the stack diagrams use “EBP” and “ESP” — shouldn’t that be “RBP” and “RSP” on x86-64?

  6. elibenNo Gravatar Says:

    Pragma,

    I would guess that inlining is something that happens before ASM code generation, so some functions still have to be leaf, right? Besides, some function calls can’t be inlined (such as calls through function pointers or virtual method calls).

    Gary,

    I’m not familiar with this issue, really. If you have some good references, I’d be glad to read them.

    Matt,

    Whoops, will fix! Thanks for noticing.

  7. jesseNo Gravatar Says:

    I’m not completely sold on the “red zone” idea, but one reason not to inline leaf functions is to keep the code footprint from getting too big. Depending on the size of the leaf function and how the code that’s calling it is structured, inlining can actually cause worse performance because of additional cache misses.

  8. RossCNo Gravatar Says:

    Very interesting. Thanks for posting.

  9. dahtahNo Gravatar Says:

    @Gary
    You are mixing up a few notions here.
    Indeed to detect a buffer overflow (overwrite of the return address) a common technique is to use something called a canari. That is a random sequence of bytes inserted at compile time between return address and saved rbp aligned on stack boundary, that is checked at function return (unwinding of the stack). If the canari value has changed, then the function has been overflowed, and program exits since the return address can be assumed to be controlled by the attacker.
    ROP is a different notion, it is the general case of ret2libc attacks, and used when stack is not executable. Since you can jump anywhere in an x86 instruction (they are not fixed length), you can use this property to use meaningful bytes in memory, and access them by address. This of course supposes that memory layout is not randomized (no ASLR). For example if the mapping of .text section in memory is constant, then you can search for instructions in that section, that you can access by memory addresses.
    Luckily for everyone, most modern operating system implement X^W and ASLR in userland, meaning that stack and heap and other sections are tagged as non executable, but writable (or the other way round), and address space is randomized (memory mappings will change between executions). Unfortunately, what is true in user land is not in kernel land… (Except for the Linux kernel as of 2.6.39)

  10. zorgNo Gravatar Says:

    > “This is indeed a leaf function. Let’s see how its stack frame looks when compiled with gcc:”
    How do you know/see how the stack frame looks ? Do you inspect it with gdb, how ?
    And
    > “…all its local variables. Thus, esp needs not be decremented (and later restored) to allocate space for this data.”
    Do you mean rsp instead of esp ?

  11. elibenNo Gravatar Says:

    zorg,

    Yep, thanks for noticing. Hopefully the intention is clear though :-)

  12. Fabio PozziNo Gravatar Says:

    @dahtah
    You’re missing something too.
    Return Oriented Programming can be applied also with W^X and ASLR, and even on platform where instructions have fixed length (like ARM or SPARC).
    Check out something like http://ivanlef0u.fr/repo/expl0it/Surgically%20returning%20to%20randomized%20lib(c).pdf
    for a research on a tecnique to use ROP to circumvent ASLR and W^X.
    You can also find a practical example of ROP (without ASLR) on arm platform here http://blog.zynamics.com/2010/04/16/rop-and-iphone/

  13. Joseph GarvinNo Gravatar Says:

    Why is the red zone per stack frame? If it’s clobbered by function calls anyway, why not just set 128 bytes aside at the start of the stack and let all frames write to that one location? I’d guess a threading problem but each thread would have its own stack with its own 128 bytes anyway, so that can’t be it. If the stack is relocatable then I guess you’d have to dedicate a register to store the start of it, maybe that’s the issue?

  14. elibenNo Gravatar Says:

    Joseph,

    I don’t understand what you mean there. How do all frames know where “the start of the stack” is?

  15. Joseph GarvinNo Gravatar Says:

    @eliben: Basically my question is what’s the difference between having the red zone in every frame and just having a single 128 byte buffer in a fixed location in memory? I think the answer is the latter couldn’t work for multiple threads, though it seems you could still make that work by dedicating a register to point to each thread’s 128 byte buffer.

  16. elibenNo Gravatar Says:

    Joseph,

    I see. As you yourself say, that would be a problem with multiple threads. In a way, the current red zone is an elegant way to give each thread its sandbox, and a register pointing to it rbp.

  17. Karen MgebrovaNo Gravatar Says:

    Translation of the publication into Armenian http://www.fatcow.com/edu/stack-frame-hy/

  18. Torkel BjørnsonNo Gravatar Says:

    Regarding the red zone:

    The red zone is not allocated/reserved per stack frame. Bellow the stack pointer there must always be some unused memory (to allow the stack to grow). The memory is already there, so why not put it to good use? The red zone it’s just a convention to allow functions to use 128 bytes of that space as a scratch area by mandating signal and interrupt handlers not to clobber it.

    So it’s just one red zone per stack (although it’s address varies). It can always be located by the %rsp register, and it works with multithreading since each thread already has each own private stack.

    It’s a quite elegant solution indeed.

  19. candidoNo Gravatar Says:

    Excelent explication, thanks.

    I have programmmed a c code: void main (void) { return 0;} and compiled on x86-84 platform (GNU/linux/core i5) with gcc front-end. There is not problem on execution.
    But if i edit a gas assembly source:

    .text
    .globl _start
    _start:
    pushq %rbp
    movq %rsp, %rbp
    movl $0, %eax
    popq %rbp
    ret
    .end

    the execution result is a SEGMENTATION FAULT

    The RET instruction dont recover de RETURN ADDRESS TO SYSTEM. Debugging with gdb i can read that the return address IS NOT on the stack. Before the first instruction pushq %rbp the %RSP stack pointer reference the 0×00000001 address that is not the return address and cause the SEGMENTATION FAULT.

    I had not this problem on a old 32 bits platform

    ¿ Some idea?
    Thanks in advance

  20. John LeunerNo Gravatar Says:

    _start is different from main, _start shouldn’t return normally. Try calling sys_exit instead.

  21. MohsenNo Gravatar Says:

    or use simply :

    movl $1,%eax
    int $0×80

  22. Jim CookNo Gravatar Says:

    I disagree that RBP can be freed for use as a general register if RSP is used to reference all variables.

    Consider the intrinsic function _alloca: I believe allocates space by simply incrementing RSP by the requested amount, then returning the old value of RSP.

    Suppose _alloca is called conditionally. The compiler cannot know at compile time whether a hunk of storage was allocated on the stack at compile time, therefore the offset of variables relative to RSP is indeterminate at compile time. RBP must be used to reference all variables.

    But wait! Writing this note has caused me to be further confused. Suppose a function pushes variables xx, yy, zz on the stack. Their offsets are know to be at RBP-8, RBP-16, and RB-24. So far, so good.

    Now suppose the code conditionally performs a _alloca(40). Then the code (unconditionally) pushed a variable xyzzy on the stack. Is the address of that variable RBP-32 or RBP-72? Ewww.

    Or do I have the definition of how _alloca works incorrect?

  23. elibenNo Gravatar Says:

    @Jim,

    Yes, the case of alloca is interesting. I’m not sure what the compiler does about functions that use it – it’s possible that for such functions the frame pointer cannot be eliminated. I haven’t researched this in depth, though.

  24. Jim CookNo Gravatar Says:

    A co-worker suggested that stack frames are static than I made them out to be. All variables (declared and internal) are allocated on function entry. The only dynamic allocations, aside from _alloca, are variables setup for a call to another function, which are freed upon return.

    So, no problem with referencing variables via RBP so long as _alloca is not invoked as a parameter to a function call.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)