Xeon E5 v3 Haswell-EP Performance — Linpack

The new Intel Xeon E5 v3 Haswell-EP processors are here and they are fantastic! Lots of cores, AVX2 (SIMD plus FMA3) operations, lots of PCIe lanes, DDR4 memory support… nice!

I’ve been anxious for the the E5 v3 Haswell processors to come out since my first testing on the desktop core i7 and E3 v3 Haswell processors. I was really impressed with the numerical performance potential of these processors but they are limited by only supporting 16 PCIe lanes and 32GB of system memory and only 4 cores. The E5 v3 Haswell-EP removes all of those drawbacks. (the new Haswell-E desktop processors remove these drawbacks too!) These are really great processors!

In this post we’ll look at my favorite parallel numerical performance benchmark, Linpack. The Intel optimized Linpack benchmark using the MKL numeric libraries gives near theoretical peak double precision performance on Intel hardware so. It’s highly tuned to take advantage of all of the features of the processors. This makes it a bit artificial as an indicator of “real world” application performance but it clearly shows off the capabilities of the processors and give developers something to aspire too ๐Ÿ™‚

The processor feature that has the most impact on numerical performance on Haswell is the AVX2 instruction set. The SIMD vector length is the same as for Ivy Bridge, i.e. 256-bit, but there is a little bit of new secret sauce on Haswell from the FMA3 instructions (that’s a 3 operand Fused Multiply Add that executes in a single clock tic) This has the potential to nearly double floating point performance for this type of operation, and this is the most common operation in numerical matrix calculations.

Theoretical Peak

A good approximation of theoretical peak for Ivy Bridge and Haswell looks like this;

 CPU GHz * number of cores * SIMD vector ops (AVX) * special instructions effect (FMA3)

For the duall Xeon E5-2687W v3 @ 3.10GHz system theoretical peak would be

 3.1 * 20 * 8 * 2 = 992 GFLOPS

What did I get?

788 GFLOPS approx. 80% of theoretical peak

That is an incredible amount of compute capability for a “standard” dual CPU machine! I would like to see a number closer to theoretical peak for linpack but, I’m not complaining, it’s really very good. The chart and table below have linpack performance for various systems I’ve tested over the past year or so. The compiler version used, OS, etc. is not the same for every result, but it’s still a good general comparison. I’ll keep expanding this with new CPU’s and hopefully clean it up a bit adding job run notes for each entry. For now just enjoy numbers! (Notice that I put in a Xeon Phi number in there too ๐Ÿ™‚

New E5 v3 Test System

The test system was a Puget Peak Dual Xeon Tower;

Note: typo in top line of chart!  E5 2697v3 should be E5 2687v3

Linpack benchmark using the Intel MKL optimizations

 
Processor Brief Spec Linpack (GFLOPS)
Dual Xeon E5 2687v3 20 cores @ 3.1GHz AVX2 788
Xeon Phi 3120A 57 cores @ 1.1GHz 512-bit SIMD 710
Quad Xeon E5 4624Lv2 40 cores @ 1.9GHz AVX 581
Dual Xeon 2695v2 24 cores @ 2.4GHz AVX 441
Core i7 5960X (Haswell E) 8 cores @ 3.0GHz AVX2 354
Dual Xeon E5 2687W 16 cores @ 3.2GHz AVX 345
Core i7 5930K (Haswell E) 6 cores @ 3.5GHz AVX2 289
Dual Xeon E5 2650 16 cores @ 2.0GHz AVX 262
Core i7 4770K (Haswell) 4 cores @ 3.5GHz AVX2 182
Xeon E3 1245v3 (Haswell) 4 cores @ 3.4GHz AVX2 170
Core i7 4960X (Ivy Bridge) 6 cores @ 3.6GHz AVX 165
Core i5 3570 (Ivy Bridge) 4 cores @ 3.4GHz AVX 105
Core i7 920 4 cores @ 2.66GHz SSE4.2 40

Happy computing! –dbk