Tags Assembly

I ran into an old post of mine from 2004 wondering how 64-bit types (like __int64 on Windows) are implemented on 32-bit CPUs. I don't really know why I was so lazy and "wondered" instead of actually trying. So here is the answer.

Compiler emulation

Consider the following code (compiled with Visual C++, __int64 is a Microsoft-specific extension type).

__int64 foo = 0x200;
__int64 bar = 0x223344556677U;
__int64 and = foo & bar;
__int64 sum = foo + bar;
__int64 shift = foo << 33;

A 32-bit CPU has no idea how to handle 64-bit integer types - its general-purpose registers are all 32-bit. So to make this work, the compiler emulates 64-bit types and operations on them, meaning that it translates the C code dealing with them into series of 32-bit operations [1]. Let's examine this code's disassembly listing to understand what's happening under the hood.


    __int64 foo = 0x200;
004114FE  mov         dword ptr [foo],200h
00411505  mov         dword ptr [ebp-8],0
    __int64 bar = 0x223344556677U;
0041150C  mov         dword ptr [bar],44556677h
00411513  mov         dword ptr [ebp-18h],2233h

Local variables in C++ are allocated by the compiler on the stack. Note how two stack dword (Microsoft's jargon for a 32-bit entity) locations are needed for each of these locals. To understand exactly why the ebp offsets are as they are we'd have to get into the stack allocation policy of Visual C++. It's a fascinating topic, but I'll leave it for another time.

Examining the memory at [ebp-18h], the 8 bytes from it onwards are:

77 66 55 44 33 22 00 00

These store the value of bar in little-endian order.

Binary AND

Bit-by-bit operations like and are the easiest to implement:

    __int64 and = foo & bar;
0041152C  mov         eax,dword ptr [foo]
0041152F  and         eax,dword ptr [bar]
00411532  mov         ecx,dword ptr [ebp-8]
00411535  and         ecx,dword ptr [ebp-18h]
00411538  mov         dword ptr [and],eax
0041153B  mov         dword ptr [ebp-38h],ecx

Since the 64-bit values consist of two 32-bit words, each word is ANDed separately in a 32-bit register, which gives the correct answer.


Adding is not much trickier, thanks to the wonderful "cascadability" property of addition [2]:

    __int64 sum = foo + bar;
0041151A  mov         eax,dword ptr [foo]
0041151D  add         eax,dword ptr [bar]
00411520  mov         ecx,dword ptr [ebp-8]
00411523  adc         ecx,dword ptr [ebp-18h]
00411526  mov         dword ptr [sum],eax
00411529  mov         dword ptr [ebp-28h],ecx

First, the low words of the variables are added with the add instruction. If add overflows its target register, the carry flag CF is turned on. Next, the high words are added with adc - the add with carry instruction, which adds its operands together with the carry flag, thus completing a full 64-bit addition.


The left-shift operation turns out to be the trickiest of all:

    __int64 shift = foo << 33;
0041153E  mov         eax,dword ptr [foo]
00411541  mov         edx,dword ptr [ebp-8]
00411544  mov         cl,21h
00411546  call        @ILT+70(__allshl) (41104Bh)
0041154B  mov         dword ptr [shift],eax
0041154E  mov         dword ptr [ebp-48h],edx

Visual C++ generates a procedure call to implement it, after preparing the arguments in registers - the input 64-bit value is split in eax and edx, and the shift size is passed in the low byte of ecx (0x21 is 33). After the __allshl procedure returns, the result is taken from eax and edx. But what does __allshl do? Stepping into it with the debugger shows its code, but for learning purposes it's easier to just find a commented version online (reformatted slightly for this blog):

; llshl.asm - long shift left
_TEXT           segment use32 para public 'CODE'
                public  __allshl

; llshl - long shift left
; Purpose:
;       Does a Long Shift Left (signed and unsigned
;       are identical)
;       Shifts a long left any number of bits.
; Entry:
;       EDX:EAX - long value to be shifted
;       CL      - number of bits to shift by
; Exit:
;       EDX:EAX - shifted value
; Uses:
;       CL is destroyed.

__allshl        proc    near
                assume  cs:_TEXT

; Handle shifts of 64 or more bits (all get 0)
        cmp     cl, 64
        jae     short RETZERO

; Handle shifts of between 0 and 31 bits
        cmp     cl, 32
        jae     short MORE32
        shld    edx,eax,cl
        shl     eax,cl

; Handle shifts of between 32 and 63 bits
        mov     edx,eax
        xor     eax,eax
        and     cl,31
        shl     edx,cl

; return 0 in edx:eax
        xor     eax,eax
        xor     edx,edx

__allshl        endp

_TEXT           ends

The comments are pretty good and it's easy to understand what's going on. The most interesting instruction used in this code is shld, which shifts its target register by the amount specified in the cl register (which is the lowest 8 bits of ecx), while shifting in bits from its second argument, which allows to implement cascaded shifts of multi-word quantities.

What about real 64-bit support?

Luckily I have a laptop with a 64-bit CPU and a 64-bit OS and a 64-bit version of Visual Studio on it. Here's the disassembly of the same few lines shining in 64-bit glory:

    __int64 foo = 0x200;
00000001400010EA  mov         qword ptr [foo],200h
    __int64 bar = 0x223344556677U;
00000001400010F3  mov         rax,223344556677h
00000001400010FD  mov         qword ptr [bar],rax
    __int64 sum = foo + bar;
0000000140001102  mov         rcx,qword ptr [bar]
0000000140001107  mov         rax,qword ptr [foo]
000000014000110C  add         rax,rcx
000000014000110F  mov         qword ptr [sum],rax
    __int64 and = foo & bar;
0000000140001114  mov         rcx,qword ptr [bar]
0000000140001119  mov         rax,qword ptr [foo]
000000014000111E  and         rax,rcx
0000000140001121  mov         qword ptr [and],rax
    __int64 shift = foo << 33;
0000000140001126  mov         rax,qword ptr [foo]
000000014000112B  shl         rax,21h
000000014000112F  mov         qword ptr [shift],rax

It uses the new 64-bit R<name> registers and the CPU's opcodes know how to operate on them, so this code doesn't look much different from similar code dealing with 32-bit integers on a 32-bit CPU, bar the different register names. Needless to say, this works much faster than the emulated versions presented above, so for arithmetic-intensive code full usage of the 64-bit capabilities of a CPU may bring nice performance benefits.

[1]To be fair to my desktop computer, it does have a 64-bit CPU (Core 2 Duo E4600), but the OS (Windows XP) is 32-bit and so is the compiler, which means that the CPU is actually treated as 32-bit. Since the x86-64 architecture is fully backwards compatible, the computer just works with its 32-bit OS and applications, which makes this post possible (I don't think people are going to have 32-bit OSes or applications on their desktops for much longer).
[2]OK, this is a word I just made up but addition is very easy to cascade from sub-entities to full entities using arithmetic carry. Once again, this is an interesting topic but out of the scope of this post.