Intel i7 loop performance anomaly

December 3rd, 2013 at 7:18 am

Recently I’ve been doing some benchmarking and came upon a very surprising behavior from a number of different Intel i7 CPUs (it manifests on Sandy Bridge and Haswell desktop-class CPUs as well as Sandy Bridge-EP Xeon CPUs).

The benchmark is very simple and the result is… bizarre. Perhaps one of the readers of my blog knows what’s going on here. Here’s the C code for the benchmark (full code with a makefile is available in this Gist):

const unsigned N = 400 * 1000 * 1000;

volatile unsigned long long counter = 0;

// Don't inline the benchmarking code into main
void __attribute__((noinline)) tightloop();
void __attribute__((noinline)) loop_with_extra_call();

void tightloop() {
  unsigned j;
  for (j = 0; j < N; ++j) {
    counter += j;
  }
}

void foo() {
}

void loop_with_extra_call() {
  unsigned j;
  for (j = 0; j < N; ++j) {
    __asm__("call foo");
    counter += j;
  }
}

We’re benchmarking tightloop vs. loop_with_extra_call, which does exactly the same thing (increment a volatile counter) but has a dummy call to a do-nothing function in the middle. I don’t think anyone has doubts about how this should behave, right? How much slower do you think the extra call will make this loop? Twice as slow? 10% slower?

Here’s the driving main function:

int main(int argc, char** argv) {
  if (argc <= 1) {
    return 1;
  }

  if (argv[1][0] == 't') {
    tightloop();
  } else if (argv[1][0] == 'c') {
    loop_with_extra_call();
  }

  return 0;
}

Building the code with gcc version 4.8 (same output code is produced by 4.6, as well as when replacing -O2 by -O3):

$ gcc -O2 loop-call-weirdness.c -o build/loop-call-weirdness

Now I’ll run it on my Intel i7-4771 (Haswell) CPU. First run the version with tightloop:

$ perf stat -r 10 -e cycles,instructions  build/loop-call-weirdness t

 Performance counter stats for 'build/loop-call-weirdness t' (10 runs):

     2,659,506,002 cycles       #    0.000 GHz              ( +-  0.19% )
     2,401,144,539 instructions #    0.90  insns per cycle  ( +-  0.00% )

       0.685642994 seconds time elapsed                     ( +-  0.24% )

… and with the extra call:

$ perf stat -r 10 -e cycles,instructions  build/loop-call-weirdness c

 Performance counter stats for 'build/loop-call-weirdness c' (10 runs):

     2,336,765,798 cycles       #    0.000 GHz              ( +-  0.34% )
     3,201,055,823 instructions #    1.37  insns per cycle  ( +-  0.00% )

       0.602387097 seconds time elapsed                     ( +-  0.39% )

Yes, the extra call makes the code faster! You didn’t expect that, did you.

Looking at the disassembly, the compiler is doing fine here, producing quite expected code:

0000000000400530 <tightloop>:
  400530:     xor    %eax,%eax
  400532:     nopw   0x0(%rax,%rax,1)
  400538:     mov    0x200b01(%rip),%rdx        # 601040 <counter>
  40053f:     add    %rax,%rdx
  400542:     add    $0x1,%rax
  400546:     cmp    $0x17d78400,%rax
  40054c:     mov    %rdx,0x200aed(%rip)        # 601040 <counter>
  400553:     jne    400538 <tightloop+0x8>
  400555:     repz retq
  400557:     nopw   0x0(%rax,%rax,1)

0000000000400560 <foo>:
  400560:     repz retq

0000000000400570 <loop_with_extra_call>:
  400570:     xor    %eax,%eax
  400572:     nopw   0x0(%rax,%rax,1)
  400578:     callq  400560 <foo>
  40057d:     mov    0x200abc(%rip),%rdx        # 601040 <counter>
  400584:     add    %rax,%rdx
  400587:     add    $0x1,%rax
  40058b:     cmp    $0x17d78400,%rax
  400591:     mov    %rdx,0x200aa8(%rip)        # 601040 <counter>
  400598:     jne    400578 <loop_with_extra_call+0x8>
  40059a:     repz retq
  40059c:     nopl   0x0(%rax)

Note that the volatile is key here, since it forces the compiler to produce a load and store from the global on each iteration. Without volatile, the benchmark behaves normally (the extra call makes it significantly slower).

It’s easy to see that tightloop runs 6 instructions per iteration, which computes with the numbers reported by perf (400 million iterations, times 6 instructions, is 2.4 billion instructions). loop_with_extra_call adds two more instructions per iteration (the call to foo and the ret from it), and that also corresponds to the performance numbers.

That’s right, even though the version with the extra call executes 33% more instructions, it manages to do it quicker.

Unfortunately, my fast Haswell CPU (or the Linux kernel coming with Ubuntu 13.10) doesn’t support the whole range of perf stat counters, but running on an older CPU (where the anomaly also exists though the performance difference in smaller), I see that the tightloop benchmark has a lot of frontend and backend stalls (mostly frontend), for a total of 0.92 stalled cycles per instruction. The version with the extra call has just 0.25 stalled cycles per instruction.

So would it be right to assume that the tight loop stalls on loading from counter because the rest of the instructions in the loop depend on its value? So how does the call and ret help here? By providing non-data-dependent instructions that can be run in parallel while the others are stalled? Still, whatever that is, I find this result astonishing.

Let me know if you have any insights.

Related posts:

  1. An observation on writing line-processing loop code
  2. Book review: “Efficient C++: Performance Programming Techniques” by Bulka & Mayhew
  3. SICP section 1.2.1
  4. Position Independent Code (PIC) in shared libraries on x64

54 Responses to “Intel i7 loop performance anomaly”

  1. Jacques-Henri JourdanNo Gravatar Says:

    I do not understand everything in modern processor performances. However, here is my guess : tight loop are optimized to be not too small in memory. When you replace the call by some nops, You get similar performances.

  2. elibenNo Gravatar Says:

    @Jacques-Henri,

    I actually tried nops and that doesn’t give the same characteristics. Something about the call. Also I don’t know why super-tight loops wouldn’t be optimized, that doesn’t make too much sense to me.

  3. JeromeNo Gravatar Says:

    eliben, when you said you tried to insert nops, you replaced the assembly code loop_with_extra_call you present here?

    I don’t know much either on x86 architecture, but here is my guess:

    loop_with_extra_call loop starts and ends on 32 bits aligned addresses, tightloop ends on a completly unaligned address.

    AND/OR

    loop_with_extra_call loop is 32 bytes long, tightloop is 27 bytes long. I don’t know how the instruction works, but it may be annoyed by this ‘non an modulo 8 loop to execute’, or more specifically, it may enjoy the ‘oh yes, a modulo 8 bytes long loop’, and transform it in a hardware loop for example.

    These are just guesses…

  4. KyleNo Gravatar Says:

    With gcc I get the same results on my i5 sandy bridge.

    I tried it with clang 3.4.

    at O2 clang emits a single add in the loop

    add     qword ptr [rip + counter], rax

    ~/test/7770377$ perf stat -r 10 -e cycles,instructions build/loop-call-weirdness c

    Performance counter stats for ‘build/loop-call-weirdness c’ (10 runs):

    2,825,727,550 cycles # 0.000 GHz ( +- 0.19% )
    2,401,536,624 instructions # 0.85 insns per cycle ( +- 0.00% )

    0.913222771 seconds time elapsed ( +- 0.41% )

    ~/test/7770377$ perf stat -r 10 -e cycles,instructions build/loop-call-weirdness t

    Performance counter stats for ‘build/loop-call-weirdness t’ (10 runs):

    2,676,779,538 cycles # 0.000 GHz ( +- 2.13% )
    1,601,461,992 instructions # 0.60 insns per cycle ( +- 0.00% )

    0.864274867 seconds time elapsed ( +- 2.31% )

  5. elibenNo Gravatar Says:

    @Jerome,

    I tried aligning both loops to 64-byte boundaries – makes no difference.

    The code is easy to compile and play with on your own – there’s a limited amount of experiments I can do :)

  6. elibenNo Gravatar Says:

    @Kyle,

    What are the full disassemblies for the loops? [post it to some Gist / pastebin]

  7. just passingNo Gravatar Says:

    Could it be that the load is actually being tried optimistically, failing because the store isn’t done yet, and then being repeated when the store it depends on does complete, rather than simply stalling – whereas the call/ret leaves it long enough for the store to complete before the load is tried, which means it doesn’t have to be repeated? After all, that store/load would be the bottleneck here, and the timing (store/load vs store/load-fails/load is 3/2, and so, give or take a bit, is 1.37/0.90) is suggestive.

  8. KyleNo Gravatar Says:

    generated with

    clang -O2 -S -mllvm –x86-asm-syntax=intel loop-call-weirdness.c

    http://pastebin.com/vbkn1jqs

  9. just passingNo Gravatar Says:

    Further to my other comment, looks like Sandy Bridge introduced the idea of two load/store units per memory channel; my second guess would be that the load and store are getting paired, and the store is invalidating the load and forcing a rescheduling, whereas the call ensures that the load and store go to the same unit and get properly pipelined/forwarded?

  10. NadavNo Gravatar Says:

    The slowdown is NOT due to the alignment of instructions in the loop. All of the instructions in the loop are decoded and are stored in a uOP buffer. The size of the uOp buffer is like 20k, so both loops fit.

    I think that the extra call interferes with the load-store forwarding mechanism on X86. The Call would generate a store uOP (because we need to save the current IP on the stack). Maybe this extra store delays the forwarding of the store value to the load of the next iteration.

  11. elibenNo Gravatar Says:

    @Nadav,

    It’s not clear how this interference with load-store forwarding helps make it run faster, though.

    Do you know someone at Intel who can run this on a full simulator and see what’s actually going on?

  12. Song GaoNo Gravatar Says:

    I tried on my Intel Xeon W3680 (which is at least two years old) but got opposite result (see following). Is this something that only shows in newer CPUs?

    gcc -O2 -o main main.c
    perf stat -r 10 -e cycles,instructions ./main t

    Performance counter stats for ‘./main t’ (10 runs):

    2,404,044,302 cycles # 0.000 GHz ( +- 0.02% )
    2,401,349,643 instructions # 1.00 insns per cycle ( +- 0.00% )

    0.722454506 seconds time elapsed ( +- 0.09% )

    perf stat -r 10 -e cycles,instructions ./main c

    Performance counter stats for ‘./main c’ (10 runs):

    2,802,313,253 cycles # 0.000 GHz ( +- 0.00% )
    3,201,498,466 instructions # 1.14 insns per cycle ( +- 0.00% )

    0.841712913 seconds time elapsed ( +- 0.04% )

  13. NadavNo Gravatar Says:

    Tal Uliel.

  14. peterNo Gravatar Says:

    This item on Stack Exchange might be relevant:
    http://stackoverflow.com/questions/17896714/why-would-introducing-useless-mov-instructions-speed-up-a-tight-loop-in-x86-64-a

  15. ssssamNo Gravatar Says:

    Some more discussion is at http://www.reddit.com/r/programming/comments/1s066i/intel_i7_loop_performance_anomaly/

  16. Stefan BellerNo Gravatar Says:

    This observation will not stand!
    http://pastebin.com/0YkiR5Rg

    Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz using Ubuntu 13.10

  17. CalvinNo Gravatar Says:

    Is there a way to make the compiler produce code that doesn’t use RIP relative addressing? Maybe just hand-code it that way? I doubt it would matter, but it might be worth eliminating that variable…

    I’d also be curious to see what the disassembly of main() looks like.

  18. Antti-Ville TuunainenNo Gravatar Says:

    64-bit alignment is insufficient to remove all alignment issues in Sandy Bridge and Haswell. I doubt they are the cause, but to be sure, align the functions starting from the first mov to 256 bits.

    @Nadav: While the uOp buffer is 20k instructions long, it has internal alignment restrictions. Instructions in it are stored in up to 6 instructions per line, each line being part of a 32-bit block, you can have up to 3 lines per block. As the 32-byte boundary is crossed at different points in the two versions (in tightloop, the mov straddles it, in the extra call version, the add straddles it), if the code was running from the uOp cache, alignment could be an issue.

    However, it isn’t. Because after the uOp cache there is yet another level of cache — the loop buffer, which has varying length of typically slightly over 30 ops, and both these functions should fit in it. Also, the low throughput screams that the problem is not in the frontend. Even with pessimistic alignment, Haswell should exceed an issue rate of 2 per clock on that code.

    I agree with Nadav that the problem is probably in the store path. This code would hit the store to load forwarding path on every iteration, and it is not publicly perfectly well understood.

    As for how it causes the CPU to be faster — the store-to-load forwarding path might possibly use some form of optimistic optimizations. That is, assume no hit, fix up after. Since this code always hits, it could lead to excessive pipeline flushes. Adding another store could delay the load enough that the conflict can be detected and handled faster.

  19. KyleNo Gravatar Says:

    Do you have vtune?

    There is a resource stall in the tight loop that doesn’t exist in the call version.

    Resource_stalls.RS which I believe is a stall in the reservation station.

    I don’t seem to be able to measure this event with perf.

    tight loop

    Hardware Event Type Hardware Event Count Hardware Event Sample Count Events Per Sample

    RESOURCE_STALLS.ANY 1,610,000,000 35 2000000
    RESOURCE_STALLS.LB 0 0 2000000
    RESOURCE_STALLS.ROB 0 0 2000000
    RESOURCE_STALLS.RS 1,656,000,000 36 2000000
    RESOURCE_STALLS.SB 0 0 2000000
    RESOURCE_STALLS2.ALL_FL_EMPTY 0 0 2000000
    RESOURCE_STALLS2.ALL_PRF_CONTROL 0 0 2000000
    RESOURCE_STALLS2.BOB_FULL 0 0 2000000
    RESOURCE_STALLS2.OOO_RSRC 92,000,000 2 2000000

    call loop

    RESOURCE_STALLS.ANY 0 0 2000000
    RESOURCE_STALLS.LB 0 0 2000000
    RESOURCE_STALLS.ROB 0 0 2000000
    RESOURCE_STALLS.RS 0 0 2000000
    RESOURCE_STALLS.SB 0 0 2000000
    RESOURCE_STALLS2.ALL_FL_EMPTY 0 0 2000000
    RESOURCE_STALLS2.ALL_PRF_CONTROL 0 0 2000000
    RESOURCE_STALLS2.BOB_FULL 0 0 2000000
    RESOURCE_STALLS2.OOO_RSRC 0 0 2000000

  20. DannoNo Gravatar Says:

    You’re not the first to observe this issue. It has to do with branch prediction. Here is a Stack Overflow link with a good explanation:

    http://stackoverflow.com/questions/17896714/why-would-introducing-useless-mov-instructions-speed-up-a-tight-loop-in-x86-64-a

  21. Antti-Ville TuunainenNo Gravatar Says:

    @Danno: this loop should be perfectly predicted on almost every iteration, regardless of past history. Branch prediction should therefore have no effect.

  22. elibenNo Gravatar Says:

    @Stefan Beller,

    Interesting, though on Reddit people have actually managed to reproduce it with AMD x86s and ARM CPUs as well!

    @Antti-Ville Tuunainen,

    To reiterate, I tried 64-byte (.p2align 6) and it’s the same performance.

    @Danno,

    Interesting, though if you read carefully at least the leading answer there does not apply here, IMHO.

  23. Semi EssessiNo Gravatar Says:

    > loop_with_extra_call loop starts and ends on 32 bits aligned addresses, tightloop ends on a completly unaligned address.

    i spotted the same thing – but then padding with nops could confirm this – and actually this is the sort of thing that hardware guys will probably push for compiler guys to leverage. did you try the intel compiler? (cc in makefile is gcc)

    the usual answers already pointed at about branch prediction might be the answer, except that the tight loop is probably a dream situation for that…

    i’d be curious if it is actually to do with the nature of the call vs. alignment. some nops can measure this or at least rule it out.

  24. TNo Gravatar Says:

    I don’t think it is faster.

    The cycles/second is within 0.008% between the two. Yes, one has more instructions, but those probably get eliminated in decoding or by microcode.

  25. kenaNo Gravatar Says:

    (disclaimer: micro-architecture is my field of research)

    I think it’s quite simple really, also the reason why it can be reproduced with ARM or AMD, as long as the processor has a branch predictor.

    What happens in the tight loop is as follows: the store enters the pipeline, then two cycles later the load enters the pipeline (branch properly predicted), before the store actually reaches the memory unit. Then a couple cycles later, the store reaches the load-store unit (LSU) and the cache is locked for the store. Chances are L1 is write-through (most are nowadays) so the store needs to keep the cache busy for just a little while, perhaps two cycles. But the load follows immediately in the pipeline, and comes against a structural hazard on the cache. A stall ensues, or more probably with out-of-order execution even a reschedule: at least a couple cycles wasted, maybe more depending on the schedule policy.

    With the empty call, the load follows the store a couple cycles delay, just enough that the store is completed by the time the load hits the LSU. And then everything runs smoothly.

  26. elibenNo Gravatar Says:

    @T,

    > The cycles/second is within 0.008% between the two

    Of course. Both were run on the same CPU.

    @Kena,

    Similar explanations have been proposed. However, since the longer version runs strictly more instructions (all the instructions of the short version and then some more), it’s hard to make a case that the CPU doesn’t have a bug here – after all, what you say could perhaps explain why the long version is not slower than the short one. But faster?

  27. KyleNo Gravatar Says:

    I found the a relevant section in one of the intel optimization manuals that points to store load forwarding as well.

    3.6.5.2 Store-forwarding Restriction on Data Availability
    The value to be stored must be available before the load operation can be completed. If this restriction is
    violated, the execution of the load will be delayed until the data is available. This delay causes some
    execution resources to be used unnecessarily, and that can lead to sizable but non-deterministic delays.
    However, the overall impact of this problem is much smaller than that from violating size and alignment
    requirements.
    In modern microarchitectures, hardware predicts when loads are dependent on and get their data
    forwarded from preceding stores. These predictions can significantly improve performance. However, if a
    load is scheduled too soon after the store it depends on or if the generation of the data to be stored is
    delayed, there can be a significant penalty.

    There are several cases in which data is passed through memory, and the store may need to be separated from the load:
    • Spills, save and restore registers in a stack frame
    • Parameter passing
    Global and volatile variables
    • Type conversion between integer and floating-point
    • When compilers do not analyze code that is inlined, forcing variables that are involved in the interface
    with inlined code to be in memory, creating more memory variables and preventing the elimination of
    redundant loads

    http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

  28. Sanjoy DasNo Gravatar Says:

    Turns out replacing call foo with jmp foo with foo jmping back to right after that makes this even faster than the call.

  29. MaciejSNo Gravatar Says:

    Fascinating! From quick tests I made locally it seems that dummy call disables the loop stream detector (which should actually make things faster… That’s probably also why NOPs do not make any difference). The other difference, as noticed by Kyle (probably related) is that tightloop generates way more resource stalls, specifically – RS_FULL events (see Intel’s site for more info http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/reference/pmn/events/resource_stalls.html). There’s no “long latency” operations in the pipe, though…

  30. just passing againNo Gravatar Says:

    @kena – this is the same explanation that ‘just passing’ gave (although his seems to have been ignored) and yes it seems reasonable.

  31. elibenNo Gravatar Says:

    @Kyle – that’s very interesting!

  32. Nathan KurzNo Gravatar Says:

    Hi Eli –

    It gets worse. At least on Sandy Bridge, it’s inefficient to issue the “call” on every loop iteration, and much better to call it only every second time: if (j % 2) __asm__("call foo");

    Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz, GCC 4.7.2, x86_64-linux-gnu
    Normal: 3,001,963,820 cycles, 2,401,562,120 instructions
    With Call: 2,802,524,795 cycles, 4,801,383,275 instructions
    Call % 2: 2,403,555,648 cycles, 4,801,278,754 instructions

    | Num Of |              Ports pressure in cycles               |    |
    |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
    ---------------------------------------------------------------------
    |   0*   |           |     |           |           |     |     |    | xor eax, eax
    |   0*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
    |   1    |           |     |           |           |     | 1.0 | CP | test al, 0x1
    |   0F   |           |     |           |           |     |     |    | jnz 0x7
    |   3^   |           |     | 0.5       | 0.5       | 1.0 | 1.0 | CP | call 0x5
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | mov rdx, qword ptr [rip]
    |   1    | 0.4       | 0.6 |           |           |     |     |    | add rdx, rax
    |   1    | 0.6       | 0.4 |           |           |     |     |    | add rax, 0x1
    |   1    | 0.4       | 0.5 |           |           |     | 0.1 | CP | cmp rax, 0x17d78400
    |   2^   |           |     | 0.5       | 0.5       | 1.0 |     |    | mov qword ptr [rip], rdx
    |   1    |           |     |           |           |     | 1.0 | CP | jnz 0xffffffffffffffdc

    None of it actually makes sense to me yet, but this might lend credence to the idea that the Loop Stream Detector (which is defeated by the ‘call’) is somehow slower than the more modern Decoded Instruction Cache.

  33. just passingNo Gravatar Says:

    @eliben,

    Doesn’t matter that strictly more instructions are executed, if that mix of instructions fits in better with the execution units available. Think Tetris.

  34. Nathan KurzNo Gravatar Says:

    I think my proposed distinction between Loop Stream and Decoded ICache is wrong. I’ve been able to come up with equally fast solutions that use both. But what the fast solutions all have in common is that they add pressure to Ports 2 and 3. I think Kyle is correct that the issue is store-to-load-forwarding being defeated by loads that are “too close” to the stores. In particular, I think “too close” means that they try to execute in the same cycle. Prefetches work for to break these up, but I’ve had the best success with adding an unrelated read between the store and the reload. Here’s my fastest solution so far for Sandy Bridge (specs in previous post):

    volatile unsigned dummy = 0;
    void loop_dummy_read() {
        IACA_START;
        unsigned j;
        unsigned dummy_read;
        for (j = 0; j < N; ++j) {
    	dummy_read = dummy;
            counter += j;
        }
        IACA_END;
    }
    | Num Of |              Ports pressure in cycles               |    |
    |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
    ---------------------------------------------------------------------
    |   0*   |           |     |           |           |     |     |    | xor eax, eax
    |   0*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
    |   1    |           |     | 1.0   1.0 |           |     |     |    | mov rdx, qword ptr [rip]
    |   1    |           |     |           | 1.0   1.0 |     |     |    | mov rdx, qword ptr [rip]
    |   1    | 0.4       | 0.6 |           |           |     | 0.1 |    | add rdx, rax
    |   1    | 0.4       | 0.2 |           |           |     | 0.3 |    | add rax, 0x1
    |   1    | 0.3       | 0.4 |           |           |     | 0.3 |    | cmp rax, 0x17d78400
    |   2^   |           |     | 0.5       | 0.5       | 1.0 |     |    | mov qword ptr [rip], rdx
    |   1    |           |     |           |           |     | 1.0 |    | jnz 0xffffffffffffffde

    It’s exactly the same as the ‘tight’ loop, but with an extra read. Here’s the original ‘tight’ for comparison:

    | Num Of |              Ports pressure in cycles               |    |
    |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
    ---------------------------------------------------------------------
    |   0*   |           |     |           |           |     |     |    | xor eax, eax
    |   0*   |           |     |           |           |     |     |    | nop dword ptr [rax], eax
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | mov rdx, qword ptr [rip]
    |   1    | 0.6       | 0.3 |           |           |     |     |    | add rdx, rax
    |   1    | 0.6       | 0.3 |           |           |     | 0.1 |    | add rax, 0x1
    |   1    | 0.1       | 0.6 |           |           |     | 0.3 |    | cmp rax, 0x17d78400
    |   2^   |           |     | 0.5       | 0.5       | 1.0 |     |    | mov qword ptr [rip], rdx
    |   1    |           |     |           |           |     | 1.0 |    | jnz 0xffffffffffffffe5

    Here’s what I see for performance, improving from Tight to Call to Prefetch, and then just a smidgen faster than that for Dummy:

    nate@fastpfor:~/tmp$ perf stat -r  6 -e cycles,instructions,r4a2  loop-call-weirdness t
    
     Performance counter stats for 'loop-call-weirdness t' (6 runs):
    
         3,001,611,648 cycles                    #    0.000 GHz                      ( +-  0.00% )
         2,401,527,805 instructions              #    0.80  insns per cycle          ( +-  0.00% )
         2,199,485,167 raw 0x4a2                                                     ( +-  0.01% )
    
           0.835984171 seconds time elapsed                                          ( +-  0.00% )
    
    nate@fastpfor:~/tmp$ perf stat -r  6 -e cycles,instructions,r4a2  loop-call-weirdness c
    
     Performance counter stats for 'loop-call-weirdness c' (6 runs):
    
         2,802,111,340 cycles                    #    0.000 GHz                      ( +-  0.00% )
         4,801,501,952 instructions              #    1.71  insns per cycle          ( +-  0.00% )
                87,343 raw 0x4a2                                                     ( +- 17.94% )
    
           0.780426256 seconds time elapsed                                          ( +-  0.00% )
    
    nate@fastpfor:~/tmp$ perf stat -r  6 -e cycles,instructions,r4a2  loop-call-weirdness p
    
     Performance counter stats for 'loop-call-weirdness p' (6 runs):
    
         2,462,457,921 cycles                    #    0.000 GHz                      ( +-  0.40% )
         2,801,321,356 instructions              #    1.14  insns per cycle          ( +-  0.00% )
         1,660,383,614 raw 0x4a2                                                     ( +-  0.59% )
    
           0.685858861 seconds time elapsed                                          ( +-  0.40% )
    
    nate@fastpfor:~/tmp$ perf stat -r  6 -e cycles,instructions,r4a2  loop-call-weirdness d
    
     Performance counter stats for 'loop-call-weirdness d' (6 runs):
    
         2,419,697,328 cycles                    #    0.000 GHz                      ( +-  0.24% )
         2,801,365,416 instructions              #    1.16  insns per cycle          ( +-  0.00% )
         1,617,649,166 raw 0x4a2                                                     ( +-  0.35% )
    
           0.673950006 seconds time elapsed                                          ( +-  0.24% )

    I think that “r4a2″ event allows measuring RESOURCE_STALLS.RS with perf. I include it to show the lack of correlation with total cycles and run time. I have not yet found a counter that does correlate with runtime. I thought I would have luck with LOAD_BLOCKS/LD_BLOCKS, but didn’t find anything that lines up. I would be very interested to find which measure does.

  35. nedbrekNo Gravatar Says:

    The “faster” version is not truly measurably faster. The overhead in measuring cancels the apparent difference (the tolerance of your measurement is looser than the real difference).

    A modern processor can execute a lot of instructions in parallel. This is limited only by the branch predictor (which, as you noticed, is going to do really well). There is some critical path to your program (a whole lot of loads and stores with adds in between) – as long as you don’t impact that critical path, you can do a lot of stuff in parallel.

  36. elibenNo Gravatar Says:

    @Nathan Kurz,

    What tool/simulator are you using for those listings with port pressures?

    @nedbrek,

    I beg to differ. A 10% difference is definitely a measurable difference, especially when confirmed by multiple others on different machines (and architectures!)

  37. NadavNo Gravatar Says:

    @Nathan Kurz, IACA uses instruction latencies and port assignment and can’t model the complexity of the load-store forwarding mechanism.

  38. Nathan KurzNo Gravatar Says:

    @eliben: http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
    Far from perfect, but much better than nothing.

  39. Svilen KanevNo Gravatar Says:

    I looked into this a little. I have a suspicion that it has to the with the SandyBridge implementation of register renaming, but nothing completely conclusive.
    Notes are on my blog: http://blog.skanev.org/2013/12/call-performance-weirdness.html

  40. Nathan KurzNo Gravatar Says:

    I think Svilen is on to something. In addition to moves and prefetches, an extra XCHG can produce a speedup even greater than a CALL. Playing around with different operands, I found that exchanging ‘counter’ with itself produced the fastest result:

    void loop_exchange() {
        IACA_START;
        unsigned j;
        for (j = 0; j < N; ++j) {
            __asm__("xchg %rdx, %rdx"); // assumes counter is %rdx
            counter += j;
        }
        IACA_END;
    }

    With the same baseline speeds and Sandy Bridge processor as above, I just barely (but consistently) edge out the ‘dummy read’ version:

    nate@fastpfor:~/tmp$ perf stat -r  6 -e cycles,instructions loop-call-weirdness e
         2,406,843,856 cycles                    #    0.000 GHz                      ( +-  0.01% )
         2,801,328,713 instructions              #    1.16  insns per cycle          ( +-  0.00% )
           0.670357920 seconds time elapsed                                          ( +-  0.01% )
  41. cafNo Gravatar Says:

    Linus suggests ( http://www.realworldtech.com/forum/?threadid=138097&curpostid=138183 ) that this is due to the alias detection failing on the PC-relative load and store to the same address, so it tries to hoist the load above the store, which then fails (obviously).

    This implies that passing the address of counter should improve things:

    void tightloop(volatile unsigned long long *pcounter)
    {
      unsigned j;
      for (j = 0; j < N; ++j) {
        *pcounter += j;
      }
    }
  42. elibenNo Gravatar Says:

    @caf,

    Yes, I think this is one of the more popular emerging explanations, and @Kyle’s quote from the Intel optimization guide seems related.

  43. nedbrekNo Gravatar Says:

    Sorry for the delay. I wanted to check a couple of machines.

    It does appear something is odd. I modified the benchmark to account for the error I mentioned:
    @@ -1,4 +1,4 @@
    -const unsigned N = 400 * 1000 * 1000;
    +const unsigned N = 4000u * 1000 * 1000;

    This increases the runtime 10x, and reduces the vulnerability to jitter.

    I am getting 7.711528942 on tight, and 7.084437664 with the call (about 10%, which is consistent with your findings).

    I will try and dig into this more…

  44. Nathan KurzNo Gravatar Says:

    Per CAF, I just tried passing the counter as a variable, and found no difference in from base “tight”:

    tight, 3001715771, cycles, 0.01%
    variable, 3001286809, cycles, 0.00%
    call, 2802080562, cycles, 0.00%
    write, 2794980288, cycles, 0.47%
    prefetch, 2532314708, cycles, 0.58%
    selfdummy, 2424865808, cycles, 0.20%
    jump, 2416657598, cycles, 0.18%
    dummyread, 2422786648, cycles, 0.26%
    2mod, 2404006892, cycles, 0.03%
    exchange, 2406538561, cycles, 0.01%

    I’m gone for the weekend, but I put my current tests up at https://github.com/nkurz/loop-call-weirdness

    Prefetches and extra reads work as well or better than adding a call. I can come up with plausible explanations of this, but nothing that compelling. The strange part is that the register involved in the read makes more difference than the target address. A double read into the counter register (‘dummyread’ above) is the fastest of these options.

    I think the most interesting question is why adding an ‘xchg’ makes things fastest, and why the choice of variables used in the ‘xchg’ makes a difference. ‘xchg same, same’ should be a NOP, but ‘xchg %rdx, %rdx’ (with counter in %rdx) is fastest and ‘xchg %rdc, %rdc’ (otherwise unused) is worse. (need to modify source to see)

    I’m also confused why a selective call (’2mod’ above) is faster than a straight call. A poorly predicted branch is predictably poor, but there the best case seems to be a 50% perfectly predicted branch that either calls or does not call. A branch that is always true or always false is worse. (need to modify source to see) How can this be?

  45. nedbrekNo Gravatar Says:

    Ok, I think I have some good data…

    It does seem to be some issue with ld/st collisions. It would take someone with knowledge of the implementation to say exactly what is going on. My guess is that you have some occasional replays when the dependent load is issued too close to the store it depends on.

    The critical path (in uops) is:
    .L7:
    1 upush nextip
    2 ujmp foo
    3 uload rdx = [rip+imm]; depends on (7) store(n-1)
    4 uadd rdx += rax; depends on (5) add(n-1)
    5 uadd rax += 1; depends on (5) add(n-1)
    6 ucmp rax - rcx; depends on (5)(n)
    7 ustore rdx, [rip+imm]; depends on (4)(n)
    8 ujne .L7; depends on (6)(n)

    The time is 6 cycles per in the fast case, and 6.5 in the slow case.

    The critical path is 3,4,7 (ld,add,st) – with 4 cycles for the load, 1 for add, and 1 (ideally) for store.

    Changing the code slightly will erase the difference:
    @@ -1,6 +1,7 @@
    -const unsigned N = 400 * 1000 * 1000;
    +const unsigned N = 4000u * 1000 * 1000;

    -volatile unsigned long long counter = 0;
    +volatile unsigned long long counter1 = 0;
    +volatile unsigned long long counter2 = 0;

    // Don’t inline the benchmarking code into main
    void __attribute__((noinline)) tightloop();
    @@ -8,8 +9,9 @@ void __attribute__((noinline)) loop_with_extra_call();

    void tightloop() {
    unsigned j;
    - for (j = 0; j < N; ++j) {
    - counter += j;
    + for (j = 0; j < N; j+= 2) {
    + counter1 += j;
    + counter2 += j + 1;
    }
    }

    @@ -18,9 +20,10 @@ void foo() {

    void loop_with_extra_call() {
    unsigned j;
    - for (j = 0; j < N; ++j) {
    + for (j = 0; j < N; j += 2) {
    __asm__("call foo");
    - counter += j;
    + counter1 += j;
    + counter2 += j + 1;
    }
    }

    That unrolls the loop by a factor of 2, and puts the chains into parallel dependencies.

    The results are:
    funroll tight
    16,046,759,667 cycles # 0.000 GHz
    18,006,775,147 instructions # 1.12 insns per cycle

    4.570298141 seconds time elapsed

    funroll call
    16,057,416,003 cycles # 0.000 GHz
    22,008,290,298 instructions # 1.37 insns per cycle

    4.604639464 seconds time elapsed

    Which is within the margin of error.

  46. elibenNo Gravatar Says:

    @nedbrek,

    Where is the µop breakdown from?

  47. nedbrekNo Gravatar Says:

    You can get the mapping from the optimization guide. I left out the “return” in foo, but I don’t think the front-end is a bottleneck here – it’s got 6 (or 6.5) cycles to deliver the uops.

  48. cafNo Gravatar Says:

    Nathan Kurz: You need to put the loop functions into a separate source file – as you have it now, the loop functions are being inlined into main(). In particular this means that the indirection through the pointer in loop_var() does not actually happen, because in the inlined version the compiler can resolve the indirection at compile-time so it ends up being exactly the same code as tightloop().

  49. Nathan KurzNo Gravatar Says:

    caf: I agree this but be happening, but doesn’t seem to be the case. I’ve changed the Makefile so the functions are not inlined, and checked with objdump to confirm, and checked in ‘objdump -d loop-file-weirdness > objdump-d.txt’: https://github.com/nkurz/loop-call-weirdness/blob/master/objdump-d.txt

    0000000000400590 <tightloop>:
      400590:        31 c0                        xor    %eax,%eax
      400592:        66 0f 1f 44 00 00            nopw   0x0(%rax,%rax,1)
      400598:        48 8b 15 a1 0a 20 00         mov    0x200aa1(%rip),%rdx        # 601040 <counter>
      40059f:        48 01 c2                     add    %rax,%rdx
      4005a2:        48 83 c0 01                  add    $0x1,%rax
      4005a6:        48 3d 00 84 d7 17            cmp    $0x17d78400,%rax
      4005ac:        48 89 15 8d 0a 20 00         mov    %rdx,0x200a8d(%rip)        # 601040 <counter>
      4005b3:        75 e3                        jne    400598 <tightloop+0x8>
      4005b5:        f3 c3                        repz retq
      4005b7:        66 0f 1f 84 00 00 00         nopw   0x0(%rax,%rax,1)
      4005be:        00 00 
    
    00000000004005c0 <loop_var>:
      4005c0:        31 c0                        xor    %eax,%eax
      4005c2:        66 0f 1f 44 00 00            nopw   0x0(%rax,%rax,1)
      4005c8:        48 8b 17                     mov    (%rdi),%rdx
      4005cb:        48 01 c2                     add    %rax,%rdx
      4005ce:        48 83 c0 01                  add    $0x1,%rax
      4005d2:        48 3d 00 84 d7 17            cmp    $0x17d78400,%rax
      4005d8:        48 89 17                     mov    %rdx,(%rdi)
      4005db:        75 eb                        jne    4005c8 <loop_var+0x8>
      4005dd:        f3 c3                        repz retq
      4005df:        90                           nop

    I think they are working as you suggest (RIP relative vs counter in RDI), but I’m seeing no difference in performance:

    tight, 3001173584, cycles, 0.01%
    variable, 3001496186, cycles, 0.00%
  50. Nathan KurzNo Gravatar Says:

    I presume almost everyone has moved on, but I’m still confused by this. I think the increase in speed due to ‘call’ is a red herring — there are lots of things that change the speed of the loop. The one I’d most like to understand is the use of “XCHG same, same”, and why the choice of register makes a large difference.

    0000000000400640 <loop_exchange>:
      400640:       31 c0                   xor    %eax,%eax
      400642:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
      400648:       48 87 d2                xchg   %rdx,%rdx
      40064b:       48 8b 15 ee 09 20 00    mov    0x2009ee(%rip),%rdx        # 6010
    40 <counter>
      400652:       48 01 c2                add    %rax,%rdx
      400655:       48 83 c0 01             add    $0x1,%rax
      400659:       48 3d 00 84 d7 17       cmp    $0x17d78400,%rax
      40065f:       48 89 15 da 09 20 00    mov    %rdx,0x2009da(%rip)        # 6010
    40 <counter>
      400666:       75 e0                   jne    400648 <loop_exchange+0x8>
      400668:       f3 c3                   repz retq
      40066a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
    0000000000400670 <loop_exchange_other>:
      400670:       31 c0                   xor    %eax,%eax
      400672:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
      400678:       48 87 c9                xchg   %rcx,%rcx
      40067b:       48 8b 15 be 09 20 00    mov    0x2009be(%rip),%rdx        # 601040 <counter>
      400682:       48 01 c2                add    %rax,%rdx
      400685:       48 83 c0 01             add    $0x1,%rax
      400689:       48 3d 00 84 d7 17       cmp    $0x17d78400,%rax
      40068f:       48 89 15 aa 09 20 00    mov    %rdx,0x2009aa(%rip)        # 601040 <counter>
      400696:       75 e0                   jne    400678 <loop_exchange_other+0x8>
      400698:       f3 c3                   repz retq
      40069a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

    The only difference between these two is “xchg %rdx, %rdx” has been changed to “xchg %rdc, %rdc”, and the only difference between them and ‘tight’ is the ‘xchg’ has been added. And yet we see large differences in performance:

    tight, 3002184598, cycles, 0.04%
    exchange, 2406490282, cycles, 0.01%
    other, 2766645894, cycles, 0.66%

    Are any of the theories presented so far able to explain the large difference in execution speed between a useless “xchg %rcx, %rcx” vs a useless “xchg %rdx, %rdx”?

  51. elibenNo Gravatar Says:

    @Nathan,

    I’m still reading all your explorations with great interest. In the latest, one obvious difference is that %rdx is actually used within the loop, unlike %rcx. Why this makes an actual performance difference is a mystery, similar to the original problem. I guess that what I really hoped would happen is for someone from Intel to come and conclusively say – here is the problem. But alas no one stepped up (this post is widely popular, I’m 100% sure multiple Intel engineers ran into it by now).

  52. Ants AasmaNo Gravatar Says:

    The problem is the store followed almost immediately by a load from the same address. For Intel CPU’s with 2 load ports this seems to cause replays when the load is scheduled too close to the store. Speculative load-store reordering is supposed to have predictor that disables reordering when a load instruction is shown to alias, but for some reason it doesn’t seem to work in this example. Try adding 7 loads from unrelated addresses before load from the counter variable. This fills the load/address generation ports with enough work to prevent the replay hazard, resulting in performance that you would expect from execution bandwidth and instruction latencies.

  53. Nathan KurzNo Gravatar Says:

    Hi Ants — You are definitely on the right track. Adding the correct sequence of dummy loads dramatically improves performance. The best I’ve found is 3 dummy loads, the real load and add, 6 dummy loads, then the store.

    void loop_load() {
        IACA_START;
        register uint64_t c = 0;
        register uint64_t d = 0;
        for (uint64_t i = LOOP_COUNT; i; i--) {
            ASM_LOAD(dummy, d);
            ASM_LOAD(dummy, d);
            ASM_LOAD(dummy, d);
            ASM_ADD(i, c);
            ASM_LOAD(dummy, d);
            ASM_LOAD(volatile_global_counter, c);
            ASM_LOAD(dummy, d);
            ASM_LOAD(dummy, d);
            ASM_LOAD(dummy, d);
            ASM_LOAD(dummy, d);
            ASM_STORE(c, volatile_global_counter);
        }
        IACA_END;
    }
    0000000000400a00 <loop_load>:
      400a00:       xor    %edx,%edx
      400a02:       mov    $0x5f5e100,%eax
      400a07:       nopw   0x0(%rax,%rax,1)
      400a10:       mov    0x200c59(%rip),%rcx        # 601670 <dummy>
      400a17:       mov    0x200c52(%rip),%rsi        # 601670 <dummy>
      400a1e:       mov    0x200c4b(%rip),%rdi        # 601670 <dummy>
      400a25:       add    %rax,%rdx
      400a28:       mov    0x200c41(%rip),%rdx        # 601670 <dummy>
      400a2f:       mov    0x200c32(%rip),%rdx        # 601668 <volatile_global_counter>
      400a36:       mov    0x200c33(%rip),%r8        # 601670 <dummy>
      400a3d:       mov    0x200c2c(%rip),%r9        # 601670 <dummy>
      400a44:       mov    0x200c25(%rip),%r10        # 601670 <dummy>
      400a4b:       mov    0x200c1e(%rip),%r11        # 601670 <dummy>
      400a52:       mov    %rdx,0x200c0f(%rip)        # 601668 <volatile_global_counter>
      400a59:       dec    %rax
      400a5c:       jne    400a10 <loop_load+0x10>
      400a5e:       retq
      400a5f:       nop
    likwid-perfctr -g INSTR_RETIRED_ANY:FIXC0,CPU_CLK_UNHALTED_CORE:FIXC1,CPU_CLK_UNHALTED_REF:FIXC2,LOAD_BLOCKS_DATA_UNKNOWN:PMC0,MEM_UOP_RETIRED_STORES:PMC1,UOPS_DISPATCHED_PORT_PORT_4:PMC2 -C 0 ./intel-loop l
    
    |      INSTR_RETIRED_ANY      | 1.30037e+09 |
    |    CPU_CLK_UNHALTED_CORE    | 5.00899e+08 |
    |    CPU_CLK_UNHALTED_REF     | 5.00899e+08 |
    |  LOAD_BLOCKS_DATA_UNKNOWN   |    37053    |
    |   MEM_UOP_RETIRED_STORES    | 1.00035e+08 |
    | UOPS_DISPATCHED_PORT_PORT_4 | 1.00086e+08 |

    The important metrics to note are that there are few “LOAD_BLOCKS_DATA_UNKNOWN”, and that the number of stores retired is almost the same as the number dispatched. This is for 100,000,000 iterations, thus we are at 5.0 cycles per iterations. By contrast, the ‘final’ C loop is around 6.5 cycles.

  54. ShamiNo Gravatar Says:

    Check out this link on StackOverFlow, it might help http://stackoverflow.com/questions/17896714/why-would-introducing-useless-mov-instructions-speed-up-a-tight-loop-in-x86-64-a

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)