I ran into an old post of mine from 2004 wondering how 64-bit types (like __int64 on Windows) are implemented on 32-bit CPUs. I don't really know why I was so lazy and "wondered" instead of actually trying. So here is the answer.
Compiler emulation
Consider the following code (compiled with Visual C++, __int64 is a Microsoft-specific extension type).
__int64 foo = 0x200;
__int64 bar = 0x223344556677U;
__int64 and = foo & bar;
__int64 sum = foo + bar;
__int64 shift = foo << 33;
A 32-bit CPU has no idea how to handle 64-bit integer types - its general-purpose registers are all 32-bit. So to make this work, the compiler emulates 64-bit types and operations on them, meaning that it translates the C code dealing with them into series of 32-bit operations [1]. Let's examine this code's disassembly listing to understand what's happening under the hood.
Initialization
__int64 foo = 0x200;
004114FE mov dword ptr [foo],200h
00411505 mov dword ptr [ebp-8],0
__int64 bar = 0x223344556677U;
0041150C mov dword ptr [bar],44556677h
00411513 mov dword ptr [ebp-18h],2233h
Local variables in C++ are allocated by the compiler on the stack. Note how two stack dword (Microsoft's jargon for a 32-bit entity) locations are needed for each of these locals. To understand exactly why the ebp offsets are as they are we'd have to get into the stack allocation policy of Visual C++. It's a fascinating topic, but I'll leave it for another time.
Examining the memory at [ebp-18h], the 8 bytes from it onwards are:
77 66 55 44 33 22 00 00
These store the value of bar in little-endian order.
Binary AND
Bit-by-bit operations like and are the easiest to implement:
__int64 and = foo & bar;
0041152C mov eax,dword ptr [foo]
0041152F and eax,dword ptr [bar]
00411532 mov ecx,dword ptr [ebp-8]
00411535 and ecx,dword ptr [ebp-18h]
00411538 mov dword ptr [and],eax
0041153B mov dword ptr [ebp-38h],ecx
Since the 64-bit values consist of two 32-bit words, each word is ANDed separately in a 32-bit register, which gives the correct answer.
Addition
Adding is not much trickier, thanks to the wonderful "cascadability" property of addition [2]:
__int64 sum = foo + bar;
0041151A mov eax,dword ptr [foo]
0041151D add eax,dword ptr [bar]
00411520 mov ecx,dword ptr [ebp-8]
00411523 adc ecx,dword ptr [ebp-18h]
00411526 mov dword ptr [sum],eax
00411529 mov dword ptr [ebp-28h],ecx
First, the low words of the variables are added with the add instruction. If add overflows its target register, the carry flag CF is turned on. Next, the high words are added with adc - the add with carry instruction, which adds its operands together with the carry flag, thus completing a full 64-bit addition.
Shifting
The left-shift operation turns out to be the trickiest of all:
__int64 shift = foo << 33;
0041153E mov eax,dword ptr [foo]
00411541 mov edx,dword ptr [ebp-8]
00411544 mov cl,21h
00411546 call @ILT+70(__allshl) (41104Bh)
0041154B mov dword ptr [shift],eax
0041154E mov dword ptr [ebp-48h],edx
Visual C++ generates a procedure call to implement it, after preparing the arguments in registers - the input 64-bit value is split in eax and edx, and the shift size is passed in the low byte of ecx (0x21 is 33). After the __allshl procedure returns, the result is taken from eax and edx. But what does __allshl do? Stepping into it with the debugger shows its code, but for learning purposes it's easier to just find a commented version online (reformatted slightly for this blog):
;------------------------------------------------
; llshl.asm - long shift left
;------------------------------------------------
.386
_TEXT segment use32 para public 'CODE'
public __allshl
;
; llshl - long shift left
;
; Purpose:
; Does a Long Shift Left (signed and unsigned
; are identical)
; Shifts a long left any number of bits.
;
; Entry:
; EDX:EAX - long value to be shifted
; CL - number of bits to shift by
;
; Exit:
; EDX:EAX - shifted value
;
; Uses:
; CL is destroyed.
;
__allshl proc near
assume cs:_TEXT
;
; Handle shifts of 64 or more bits (all get 0)
;
cmp cl, 64
jae short RETZERO
;
; Handle shifts of between 0 and 31 bits
;
cmp cl, 32
jae short MORE32
shld edx,eax,cl
shl eax,cl
ret
;
; Handle shifts of between 32 and 63 bits
;
MORE32:
mov edx,eax
xor eax,eax
and cl,31
shl edx,cl
ret
;
; return 0 in edx:eax
;
RETZERO:
xor eax,eax
xor edx,edx
ret
__allshl endp
_TEXT ends
end
The comments are pretty good and it's easy to understand what's going on. The most interesting instruction used in this code is shld, which shifts its target register by the amount specified in the cl register (which is the lowest 8 bits of ecx), while shifting in bits from its second argument, which allows to implement cascaded shifts of multi-word quantities.
What about real 64-bit support?
Luckily I have a laptop with a 64-bit CPU and a 64-bit OS and a 64-bit version of Visual Studio on it. Here's the disassembly of the same few lines shining in 64-bit glory:
__int64 foo = 0x200;
00000001400010EA mov qword ptr [foo],200h
__int64 bar = 0x223344556677U;
00000001400010F3 mov rax,223344556677h
00000001400010FD mov qword ptr [bar],rax
__int64 sum = foo + bar;
0000000140001102 mov rcx,qword ptr [bar]
0000000140001107 mov rax,qword ptr [foo]
000000014000110C add rax,rcx
000000014000110F mov qword ptr [sum],rax
__int64 and = foo & bar;
0000000140001114 mov rcx,qword ptr [bar]
0000000140001119 mov rax,qword ptr [foo]
000000014000111E and rax,rcx
0000000140001121 mov qword ptr [and],rax
__int64 shift = foo << 33;
0000000140001126 mov rax,qword ptr [foo]
000000014000112B shl rax,21h
000000014000112F mov qword ptr [shift],rax
It uses the new 64-bit R<name> registers and the CPU's opcodes know how to operate on them, so this code doesn't look much different from similar code dealing with 32-bit integers on a 32-bit CPU, bar the different register names. Needless to say, this works much faster than the emulated versions presented above, so for arithmetic-intensive code full usage of the 64-bit capabilities of a CPU may bring nice performance benefits.
[1] | To be fair to my desktop computer, it does have a 64-bit CPU (Core 2 Duo E4600), but the OS (Windows XP) is 32-bit and so is the compiler, which means that the CPU is actually treated as 32-bit. Since the x86-64 architecture is fully backwards compatible, the computer just works with its 32-bit OS and applications, which makes this post possible (I don't think people are going to have 32-bit OSes or applications on their desktops for much longer). |
[2] | OK, this is a word I just made up but addition is very easy to cascade from sub-entities to full entities using arithmetic carry. Once again, this is an interesting topic but out of the scope of this post. |