The cost of dynamic (virtual calls) vs. static (CRTP) dispatch in C++

December 5th, 2013 at 6:13 am

A couple of years ago I wrote an article about the Curiously Recurring Template Pattern in C++, focusing on the motivation behind it and how to implement it.

That article mentioned runtime performance as the main reason for employing CRTP instead of the more traditional runtime polymorphism (dispatch via virtual functions). While some rationale for the cost of virtual calls was given, I didn’t go too deep into it. Today I want to fix that by carefully analyzing the performance of virtual calls as opposed to the static calls made possible by CRTP.

Mandatory precaution about benchmarks

Benchmarking in 2013 is really hard. Today’s CPUs are super-pipelined, branch-predicting out-of-order executing beasts. The memory hierarchy is very deep and the caches have complex behavior. All of this makes detailed performance analysis devilishly complex, and the results are sometimes baffling. We’re clearly long past counting MIPS. Add to that overly-clever optimizing compilers that occasionally produce not quite the code you expected, and it’s apparent why so many online resources and articles provide bad benchmarks.

So any benchmarks need to be taken with a large grain of salt, including the one posted here. Personally, I’m trying to validate the benchmarks I’m running by attacking them with the scientific method:

  1. First, create a hypothesis about the relative speed of two approaches.
  2. Take a detailed look at the code generated by the compiler to verify the hypothesis w.r.t. code generation – is this the machine code you expected to see?
  3. Run the benchmark and compare the runtime to the initial hypothesis, as well as to (2) – while not perfect, performance is easier to correlate to machine code than to original source code.
  4. If anything doesn’t feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2).

Hypothesis – what makes virtual calls slower

The previous article listed the following components in the runtime cost of virtual calls:

  • Extra indirection (pointer dereference) for each call to a virtual method.
  • Virtual methods usually can’t be inlined, which may be a significant cost hit for some small methods.
  • Additional pointer per object. On 64-bit systems which are prevalent these days, this is 8 bytes per object. For small objects that carry little data this may be a serious overhead.

While the third component can definitely play a role in some scenarios (i.e. a lot of small objects where the additional memory means less of them fit into L1 data cache), I’ll focus on the first two in this article, because they are easier to expose in a simple synthetic benchmark.

The source code – what are we comparing?

There’s a plethora of uses for polymorphism in C++. Here I’ll focus on a basic one that will let me expose the performance characteristics of virtual calls. I’ll define a simple interface with a couple of methods and one implementation of it:

class DynamicInterface {
public:
  virtual void tick(uint64_t n) = 0;
  virtual uint64_t getvalue() = 0;
};

class DynamicImplementation : public DynamicInterface {
  uint64_t counter;

public:
  DynamicImplementation()
    : counter(0) {
  }

  virtual void tick(uint64_t n) {
    counter += n;
  }

  virtual uint64_t getvalue() {
    return counter;
  }
};

The following code runs the actual benchmark:

const unsigned N = 40000;

void run_dynamic(DynamicInterface* obj) {
  for (unsigned i = 0; i < N; ++i) {
    for (unsigned j = 0; j < i; ++j) {
      obj->tick(j);
    }
  }
}

What this does is simply invoke the virtual method tick on the base pointer obj in the order of O(N^2) times.

The alternative statically-polymorphic implementation is this [1]:

template <typename Implementation>
class CRTPInterface {
public:
  void tick(uint64_t n) {
    impl().tick(n);
  }

  uint64_t getvalue() {
    return impl().getvalue();
  }
private:
  Implementation& impl() {
    return *static_cast<Implementation*>(this);
  }
};

class CRTPImplementation : public CRTPInterface<CRTPImplementation> {
  uint64_t counter;
public:
  CRTPImplementation()
    : counter(0) {
  }

  void tick(uint64_t n) {
    counter += n;
  }

  uint64_t getvalue() {
    return counter;
  }
};

template <typename Implementation>
void run_crtp(CRTPInterface<Implementation>* obj) {
  for (unsigned i = 0; i < N; ++i) {
    for (unsigned j = 0; j < i; ++j) {
      obj->tick(j);
    }
  }
}

Generated code – how virtual calls look under the hood

Now let’s spend some time studying the machine code generated by gcc -O2 (version 4.8) from the code above. The code for DynamicImplementation::tick is very compact:

0000000000400cf0 <_ZN21DynamicImplementation4tickEm>:
  400cf0:       add    %rsi,0x8(%rdi)
  400cf4:       retq

To understand what this means, some familiarity with the Itanium C++ ABI is required. The ABI in this case mandates both the name mangling that produces the weird symbol name, and the layout of the object in memory, which mandates how its fields are accessed. Here’s a short description for the code above:

Since DynamicInterface has virtual methods, the class hierarchy it begets comes with a virtual method table, a pointer to which resides in each object. This is the way the compiler arranges for the runtime code to call the correct method when an actual object is used. The address of the virtual method table (vptr) is in the beginning of the object, and the actual class members come afterwards. So counter lives at offset 8 in DynamicImplementation objects.

add    %rsi,0x8(%rdi)

%rdi is the first argument to tick, which is the hidden this pointer – the address of the object. Hence 0x8(%rdi) is the address of this->counter. The instruction, then, adds n (passed in %rsi according to the calling convention) to this->counter.

By the way, whenever you’re curious about object layouts and want to verify your understanding of the ABI, I find Clang’s ability to dump the class record layouts very helpful. In this case:

*** Dumping AST Record Layout
   0 | class DynamicImplementation
   0 |   class DynamicInterface (primary base)
   0 |     (DynamicInterface vtable pointer)
   8 |   uint64_t counter
     | [sizeof=16, dsize=16, align=8
     |  nvsize=16, nvalign=8]

*** Dumping AST Record Layout
   0 | class CRTPImplementation
   0 |   class CRTPInterface<class CRTPImplementation> (base) (empty)
   0 |   uint64_t counter
     | [sizeof=8, dsize=8, align=8
     |  nvsize=8, nvalign=8]

On to the invocation of tick now. This is the disassembly for run_dynamic, annotated with comments:

0000000000400c10 <_Z11run_dynamicP16DynamicInterface>:
  400c10:       push   %r13
  400c12:       mov    $0x1,%r13d
  400c18:       push   %r12
        // r12d holds i, initialized to 0
  400c1a:       xor    %r12d,%r12d
  400c1d:       push   %rbp
        // Place obj in %rbp
  400c1e:       mov    %rdi,%rbp
  400c21:       push   %rbx
  400c22:       sub    $0x8,%rsp
  400c26:       nopw   %cs:0x0(%rax,%rax,1)
  400c30:       test   %r12d,%r12d
        // when i is 0, the body of the loop won't run, so increment
        // both i and j and try again.
  400c33:       je     400c5e
        // rbx holds j, initialized to 0
  400c35:       xor    %ebx,%ebx
  400c37:       nopw   0x0(%rax,%rax,1)
        // Place the address of obj's vtable in rax
  400c40:       mov    0x0(%rbp),%rax
        // j is the second argument of tick
  400c44:       mov    %rbx,%rsi
        // j++
  400c47:       add    $0x1,%rbx
        // obj is the first argument of tick ('this' pointer)
  400c4b:       mov    %rbp,%rdi
        // tick is the first entry in the vtable.
        // This calls obj->tick(obj, j)
  400c4e:       callq  *(%rax)
        // Compare j < i and perform inner loop
  400c50:       cmp    %ebx,%r12d
  400c53:       ja     400c40
        // Compare i == 40000 and perform outer loop
  400c55:       cmp    $0x9c40,%r13d
  400c5c:       je     400c68
  400c5e:       add    $0x1,%r13d
  400c62:       add    $0x1,%r12d
  400c66:       jmp    400c30
  400c68:       add    $0x8,%rsp
  400c6c:       pop    %rbx
  400c6d:       pop    %rbp
  400c6e:       pop    %r12
  400c70:       pop    %r13
  400c72:       retq
  400c73:       data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

The interesting parts here are:

  1. How obj->tick is actually invoked. Since tick is the first method in DynamicInterface, it sits in the first slot in the vtable. So to actually call it, we have a double indirection from obj – one to get to the vtable, the other to get to the method in the vtable.
  2. The constituents of the inner loop – the part that the program spends the vast majority of its time executing. We’ll get back to it for a more careful analysis later.

How CRTP calls look under the hood

Now it’s time to disassemble the equivalent code that uses CRTP for static polymorphism. Again, we’ll want to start with CRTPImplementation::tick, but we won’t find it in the disassembly because it was fully inlined into run_crtp. The compiler was able to inline it because it could know statically (at compile time) which method is called. Such inlining is an important tenet of the "zero-cost abstractions" philosophy of modern C++.

Let’s go straight to run_crtp, then:

0000000000400d00 <_Z8run_crtpI18CRTPImplementationEvP13CRTPInterfaceIT_E>:
        // Place obj->counter into rdx
  400d00:       mov    (%rdi),%rdx
  400d03:       mov    $0x1,%esi
        // rcx holds i, initialized to 0
  400d08:       xor    %ecx,%ecx
  400d0a:       nopw   0x0(%rax,%rax,1)
  400d10:       test   %ecx,%ecx
  400d12:       je     400d36
        // rax holds j, initialized to 0
  400d14:       xor    %eax,%eax
  400d16:       nopw   %cs:0x0(%rax,%rax,1)
        // counter += j
  400d20:       add    %rax,%rdx
        // j++ and perform inner loop
  400d23:       add    $0x1,%rax
  400d27:       cmp    %eax,%ecx
  400d29:       ja     400d20
  400d2b:       cmp    $0x9c40,%esi
        // when we're done, put the final value back into obj->counter
  400d31:       mov    %rdx,(%rdi)
  400d34:       je     400d3e
  400d36:       add    $0x1,%esi
  400d39:       add    $0x1,%ecx
  400d3c:       jmp    400d10
  400d3e:       repz retq

It’s not hard to see we’d expect this code to run much faster, for two main reasons:

  1. Since the tick dispatch was inlined, the compiler was free to see that all it does is a simple member increment. The member is then saved in rdx and the loop can then simply bump a register, instead of having a call on each iteration.
  2. As there’s no call involved, the inner loop is shorter.

Performance numbers

As expected, the CRTP approach is much faster. The benchmark above takes 1.25 seconds on my i7-4771 CPU for run_dynamic and 0.21 seconds for run_crtp This is a huge difference, and it’s much larger than I expected. I was looking for a 2x boost, not 6x [2]. So here comes the 4th bullet of the benchmarking methodology I outlined above. Let’s look more carefully at the numbers.

I’ll start with producing a trace of the inner loop for both cases, to see the sequence of instructions executed. Since the loop is short, this can be easily done with basic disassembly reading, and also verifying with gdb by stepping through the execution for a few iterations.

Here is the inner loop for run_dynamic:

400c40:     mov    0x0(%rbp),%rax
400c44:     mov    %rbx,%rsi
400c47:     add    $0x1,%rbx
400c4b:     mov    %rbp,%rdi
400c4e:     callq  *(%rax) ... calls tick
    400ce0: add    %rsi,0x8(%rdi)
    400ce4: retq
400c50:     cmp    %ebx,%r12d
400c53:     ja     400c40

How many times we’d expect it to run? The double loop has a simple summing pattern so we can compute it’s in the vicinity of N/2 * N, which in our case means 800e6 (800 million times).

Since the loop above is 9 instructions long, it means 7.2e9 instructions total. Let’s look at detailed perf stat numbers for this run:

Performance counter stats for 'build/vcall-benchmark d':

      1253.807247 task-clock                #    0.999 CPUs utilized
              107 context-switches          #    0.085 K/sec
                0 cpu-migrations            #    0.000 K/sec
              318 page-faults               #    0.254 K/sec
    4,807,848,980 cycles                    #    3.835 GHz
  <not supported> stalled-cycles-frontend
  <not supported> stalled-cycles-backend
    7,203,771,146 instructions              #    1.50  insns per cycle
    2,400,716,784 branches                  # 1914.742 M/sec
           58,358 branch-misses             #    0.00% of all branches

      1.255560284 seconds time elapsed

Indeed, the amount of instructions fits our expectation.

Now, let’s turn to run_crtp. Its inner loop is this:

400d20:     add    %rax,%rdx
400d23:     add    $0x1,%rax
400d27:     cmp    %eax,%ecx
400d29:     ja     400d20

So only 4 instructions. In other words, we’d expect the total amount of instructions executed to be in the area of 3.2e9. Let’s see:

Performance counter stats for 'build/vcall-benchmark c':

       215.919352 task-clock                #    0.997 CPUs utilized
               18 context-switches          #    0.083 K/sec
                0 cpu-migrations            #    0.000 K/sec
              318 page-faults               #    0.001 M/sec
      809,355,502 cycles                    #    3.748 GHz
  <not supported> stalled-cycles-frontend
  <not supported> stalled-cycles-backend
    3,202,645,106 instructions              #    3.96  insns per cycle
      800,522,521 branches                  # 3707.507 M/sec
           53,684 branch-misses             #    0.01% of all branches

      0.216596060 seconds time elapsed

Bingo!

But wait, a 2.25x difference in the amount of instructions should not have translated to a 6x difference in runtime, right? Note the amount of branches, though. While the CRTP run has one branch per inner loop, the numbers for the dynamic run show 3 branches per inner loop (for a total of 2.4e9). What gives?

The CPU considers indirect calls and returns as branches for this purpose, and if you think about it, this makes sense. An indirect branch or return transfer control to a location the CPU cannot determine statically (unlike a direct call, for instance) – it depends on the contents of registers & stack. So the CPU doesn’t know where to fetch instructions ahead-of-time in order to satisfy its eternally hungry super-pipeline. True, the branch predictor alleviates most of that cost, but such instructions are still more expensive for the CPU than, say, simple adds, because they cannot pump through the pipeline as quickly.

Moreover, the call and ret instructions push and pop data to the stack, which resides in memory. It’s almost certainly in L1 cache, but that’s still more expensive to access than registers.

Variation: -O3 compilation

Vigilant readers might have noticed that I did not set the highest optimization level of gcc for this benchmark. This was done on purpose, to make the results simpler to explain.

When compiled with -O3, the dynamic version runs as before (and the code produced for it is the same), but the CRTP version runs even faster and finishes within 0.17 seconds, which is 7.2x faster than the dynamic version.

The extra boost comes from auto-vectorization. When one looks at the code produced by the compiler for run_crtp, one can see SIMD instructions in there. The inner loop was unrolled 4x and the operations are performed on whole quad-words, combining several inner loop iterations at a time.

So this is an example where previous optimizations (inlining) enabled the compiler to apply even more advanced optimizations such as vectorization to make the code faster yet.

Variation: disabling inlining

It’s also interesting to build the benchmark with -fno-inline and compare the results. Curiously, in this case the CRTP approach runs 1.5x slower than virtual calls. Before you read on, can you guess why?

The reason is quite simple. Note that for proper CRTP, the interface class implements the interface methods and calls through to the implementation. So to actually invoke tick, run_crtp calls:

  • CRTPInterface<CRTPImplementation>::tick, which calls
  • CRTPInterface<CRTPImplementation>::impl, and then calls
  • CRTPImplementation::tick

This is a lot of calls, which all have to be executed when the inliner is turned off. When it’s turned on, all of these calls get inlined and the actual instructions of the leaf call are embedded into run_crtp.

There are two lessons here:

  1. Be careful with benchmarking, but you already knew that ;-)
  2. When implementing inlining in a compiler, it’s super important to make the inliner iterative – doing multiple passes on the code and discovering new inlining opportunities in each iteration.

Devirtualization

A brand new optimization that I recently heard about is devirtualization. The idea is to find cases of dynamic dispatch where the actual type at a given call site can always to proven to be known at compile time, and specialize those call sites to dispatch statically. This carries the promise of making virtual calls as fast as static dispatch in some special cases.

While this definitely sound interesting, at the time of writing this article devirtualization is still experimental (support in gcc started trickling in version 4.7). In any case, the example examined in this article is probably simple enough to trigger the optimization, but as you can see it didn’t happen, even though the -fdevirtualize flag should be on in gcc with optimization levels -O2 and -O3. It will be interesting to follow the development of this optimization and see what cases of virtual calls it can detect and optimize in the future.

Conclusions

There are a lot of lessons to be learned here, so I’ll just list them in an arbitrary order:

  • Benchmarking is an art – if everything is too easy, you’re either doing something trivial or wrong. Always cross-verify your assumptions and results with hard data like disassembly listings and detailed performance numbers.
  • Beware of different compilers and different targets. The above discusses gcc 4.8 for x86-64. Elsewhere, you may expect slightly or considerably different results. Ah, if only programming was easy. But then I guess programmers wouldn’t get paid a lot for clicking in front of computers all day.
  • Compiler optimizations are, by definition, a multi-layered affair. Each is simple but they enable each other. Inlining enables some additional optimizations (such as moving hot code out of inner loops). Other optimizations may enable inlining (by making the leaf methods smaller).
  • CRTP, when implemented correctly, is recognized by the compiler as static dispatch and optimized accordingly.
  • CRTP can thus be significantly more efficient than virtual calls, mostly due to inlining. This also means that inlining is crucial to its performance (as it is to many performance features of C++).

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] This is a degenerate use of CRTP, for sure. It’s not here to be realistic – just to demonstrate the same mechanism used in a simple scenario. See the previous article for a more use-focused discussion of CRTP.
[2] These numbers depend on the CPU, of course. When I tried the same benchmark on a Xeon E5-2690 (Sandy Bridge) with gcc 4.6.3 (same code generated) the speed difference is just 3x (0.46 vs 1.39 sec).

Related posts:

  1. Pure virtual destructors in C++
  2. Computed goto for efficient dispatch tables
  3. Construction of function static variables in C++ is not thread safe
  4. null-modem, physical and virtual COM ports
  5. Library order in static linking

18 Responses to “The cost of dynamic (virtual calls) vs. static (CRTP) dispatch in C++”

  1. AnonNo Gravatar Says:

    Here is what a recent gcc 4.9 snapshot produces for the dynamic version at -O2, filtered through c++filt (and in Intel syntax because I loathe AT&T):

    run_dynamic(DynamicInterface*):
            push    r13
            push    r12
            mov     r13d, 1
            push    rbp
            push    rbx
            mov     rbp, rdi
            xor     r12d, r12d
            sub     rsp, 8
            .p2align 4,,10
            .p2align 3
    .L3:
            test    r12d, r12d
            je      .L4
            xor     ebx, ebx
            jmp     .L8
            .p2align 4,,10
            .p2align 3
    .L15:
            add     QWORD PTR [rbp+8], rbx
            add     rbx, 1
            cmp     r12d, ebx
            jbe     .L14
    .L8:
            mov     rax, QWORD PTR [rbp+0]
            mov     rax, QWORD PTR [rax]
            cmp     rax, OFFSET FLAT:DynamicImplementation::tick(unsigned long)
            je      .L15
            mov     rsi, rbx
            add     rbx, 1
            mov     rdi, rbp
            call    rax
            cmp     r12d, ebx
            ja      .L8
    .L14:
            cmp     r13d, 40000
            je      .L16
    .L4:
            add     r13d, 1
            add     r12d, 1
            jmp     .L3
    .L16:
            add     rsp, 8
            pop     rbx
            pop     rbp
            pop     r12
            pop     r13
            ret

    As you can see there’s a neat approach to devirtualization going on here — it includes an inlined copy of DynamicImplementation::tick(), and a check to see if the vtable slot is in fact pointing to that implementation. If it is, the call is skipped and the inlined version runs, otherwise the indirect call goes ahead as usual.

  2. elibenNo Gravatar Says:

    @Anon,

    That’s very interesting, thanks for sharing! I wonder what the performance is (comparing the pointer on each iteration also has a cost, although this should be faster than always calling through the vptr.

    The case presented here is very simple (it’s trivial for the compiler to infer the actual runtime type) and realistic cases will be more difficult (and spread over multiple compilation units and perhaps even shared objects); but I do expect compilers will keep getting better with time.

  3. AnonNo Gravatar Says:

    On a Sandy Bridge machine with gcc 4.8 -O2, I see about a 3.25x difference, just as you reported. With gcc 4.9 they are dead even. Actually the dynamic version even comes out a smidge faster for some reason. The margin is tiny, like half a percent, but it’s repeatable, although it could be just gremlins. (I would have to write a more in-depth test harness that runs each test multiple times before drawing any conclusions, though.) And the unrolling and vectorization at -O3 helps both versions equally as well.

    So this speaks well of the upcoming 4.9 release. The gcc trunk recently moved to stage 3 which means a feature freeze in preparation for a stabilization push. If all goes as usual, we should see a release around late March.

  4. Kevin ButlerNo Gravatar Says:

    “If anything doesn’t feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2).”

    This is explicit support for confirmation bias.

    See Feynman’s discussion of measuring the charge of the electron in Cargo Cult Science:

    “Why didn’t they discover the new number was higher right away? It’s a thing that scientists are ashamed of—this history—because it’s apparent that people did things like this: When they got a number that was too high above Millikan’s, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan’s value they didn’t look so hard. And so they eliminated the numbers that were too far off, and did other things like that…”

    http://neurotheory.columbia.edu/~ken/cargo_cult.html

  5. SivaNo Gravatar Says:

    Great post

  6. Frederik AalundNo Gravatar Says:

    That was a great read. Cheers for sharing!

    Regarding the recent de-virtualization optimizations found in GCC 4.9…

    Quote from eliben:
    “I wonder what the performance is (comparing the pointer on each iteration also has a cost, although this should be faster than always calling through the vptr.”

    Quote from Anon:
    “With gcc 4.9 they are dead even. Actually the dynamic version even comes out a smidge faster for some reason”

    This was very surprising for me. I always assumed that having a branch in the innermost body of a loop lead to horrible performance. I guess the branch prediction unit really has a field day because the vptr always compares equal to %rax. Still I don’t see how this can lead to better performance than the corresponding CRTP version. Can somebody provide some insight?
    I guess that this is a special case because there is only one implementation of the virtual function and hence only one result for the comparison. If the loop was over a container of different polymorphic objects, then the branch prediction unit would have a much harder time. Then again, CRTP would rarely apply to such a case so a comparison is contrived at best. I guess it doesn’t hurt to use CRTP. Especially if you find yourself in one of the cases where it actually does lead to better performance… Or what? Is the recent de-virtualization optimization options deprecating the use of CRTP?

  7. Tom LeysNo Gravatar Says:

    In your list of reasons virtual functions are slower, you’ve missed reason number 1.

    Like function pointers, Virtual functions require the CPU to vector to an unpredictable ram address specified by a memory location (the vtable). Because this address cannot be predicted it causes the CPU’s pipeline to be cleared every time.

    Reason number 2, which you don’t seem to mention either is that code like data has to be loaded into memory and can have cache misses too.

    To help counter this, the CPU has a table of code locations that vector to such function pointers and for each one stores the previous location that was hit. This helps somewhat when the same method call is always handled by the same overload.

    On the CPUs I’ve looked at this table holds 16 entries, probably slightly more now but still not many.

    Because your synthetic test always calls the same function and doesn’t call ANY OTHER Virtual functions this table allows the CPU to keep the relevant code in cache and the prediction is 100% correct. If you were to to a comparison with enough different virtual functions and virtual objects that the CPU could not predict this (think any game with virtual methods) you would do far worse in your benchmark.

  8. elibenNo Gravatar Says:

    @Tom Leys,

    I still posit that reason number one is lack of inlining.

  9. Tony DelroyNo Gravatar Says:

    You’re not correcting for the loop overheads, so you’re talking in ratios of (overhead+inline)/(overhead+virtual dispatch). If you manually edit NOP operations over the incrementing code, then you’d be able to measure the overhead for looping independent of the operation that loop performs, then subtract that from both benchmarks to get the time to do an inline increment versus the time to indirectly call to an out-of-line increment. Recalculate the ratio – you’ll see virtual dispatch is massively more than 6 or 7 times slower, but the real-world implications of that result are tricky – given an increment takes a cycle, that’s the upper limit for a real-world performance ratio, but needs to be considered with knowledge of the inevitable overheads. Separately, it’s worth pointing out briefly that sometimes custom implementation functions return constants, and with CRTP inlining the constants can allow dead-code elimination, static array sizing and other optimisations that aren’t possible when using run-time polymorphism.

  10. Samuel WilliamsNo Gravatar Says:

    Hi – nice article. I did a comparison of C, C++ and Objective-C because I was interested in how much slower Objective-C might be: http://www.codeotaku.com/journal/2009-05/run-time-polymorphism/index – you might find the results interesting.

  11. NilsNo Gravatar Says:

    I was wondering why you were using gcc and not clang to do the tests. Anyways I did it with clang on my MacBook Air (Haswell CPU)

    With O1:
    Execution time dynamic dispatch: 1.585649
    Execution time crtp: 4.402200

    With O2:
    Execution time dynamic dispatch: 1.625478
    Execution time crtp: 0.000000

    Looks like it doesn’t do anything anymore with O2 in the CRTP version.

  12. Maxim YegorushkinNo Gravatar Says:

    By default, the OS may modulate the CPU frequency on demand. I wonder if the CPU frequency was locked during this benchmark?

  13. Elazar LeibovichNo Gravatar Says:

    Java JIT, had been doing de-virtualization for years, not too new in this area.

  14. Rob GraingerNo Gravatar Says:

    The language Self introduced many of the techniques used by modern JIT compilers (some of the team went on to work on the HotSpot VM, amongst others) getting a prototype-based fully dynamic language to 1/2 the speed of native C code. The advantage dynamic languages have here is that they can take advantage of “hotspot” analysis to identify portions of code to optimise and use caches (of most recently called versions) and inlining to aggressively optimise based on known behaviour.

    As Samuel Williams illustrates above, Objective-C manages to pick a sour-spot here – using static languages combined with hash-based lookup means that the calls cannot be JIT-optimised, and have to assume worst case performance. A truly awful design decision.

  15. Ron JonesNo Gravatar Says:

    Nils: I’m fairly curious as to what’s happening with Clang there. Is it able to actually determine that getvalue() is never called in the CRTP case and that it needn’t even execute tick() and the loop itself? If so: great optimizer!

  16. NilsNo Gravatar Says:

    @Ron I am not familiar with LLVM code, but it seems that it eliminated calls to tick if I compile with -O2 or 3. Pasted it here: http://codepad.org/nAeQjVQ3

  17. PeterNo Gravatar Says:

    I don’t see the point in comparing static and dynamic dispatch, because it is resolving different things. Dynamic dispatch corresponds with polymorphism and this has its overheads.

  18. Johan BouléNo Gravatar Says:

    Do you have any idea why the compiler is not retrieving the virtual function address just once, before the loops ? Why would the vtable ever change during the loop ?

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)