Xeon E5 v3 Haswell-EP Performance -- Linpack

Xeon E5 v3 Haswell-EP Performance — Linpack

Posted on September 8, 2014 by Dr. Donald Kinghorn | Last updated: September 8, 2014

Table of Contents

The new Intel Xeon E5 v3 Haswell-EP processors are here and they are fantastic! Lots of cores, AVX2 (SIMD plus FMA3) operations, lots of PCIe lanes, DDR4 memory support… nice!

I’ve been anxious for the the E5 v3 Haswell processors to come out since my first testing on the desktop core i7 and E3 v3 Haswell processors. I was really impressed with the numerical performance potential of these processors but they are limited by only supporting 16 PCIe lanes and 32GB of system memory and only 4 cores. The E5 v3 Haswell-EP removes all of those drawbacks. (the new Haswell-E desktop processors remove these drawbacks too!) These are really great processors!

In this post we’ll look at my favorite parallel numerical performance benchmark, Linpack. The Intel optimized Linpack benchmark using the MKL numeric libraries gives near theoretical peak double precision performance on Intel hardware so. It’s highly tuned to take advantage of all of the features of the processors. This makes it a bit artificial as an indicator of “real world” application performance but it clearly shows off the capabilities of the processors and give developers something to aspire too 🙂

The processor feature that has the most impact on numerical performance on Haswell is the AVX2 instruction set. The SIMD vector length is the same as for Ivy Bridge, i.e. 256-bit, but there is a little bit of new secret sauce on Haswell from the FMA3 instructions (that’s a 3 operand Fused Multiply Add that executes in a single clock tic) This has the potential to nearly double floating point performance for this type of operation, and this is the most common operation in numerical matrix calculations.

Theoretical Peak

A good approximation of theoretical peak for Ivy Bridge and Haswell looks like this;

 CPU GHz * number of cores * SIMD vector ops (AVX) * special instructions effect (FMA3)

For the duall Xeon E5-2687W v3 @ 3.10GHz system theoretical peak would be

 3.1 * 20 * 8 * 2 = 992 GFLOPS

What did I get?

788 GFLOPS approx. 80% of theoretical peak

That is an incredible amount of compute capability for a “standard” dual CPU machine! I would like to see a number closer to theoretical peak for linpack but, I’m not complaining, it’s really very good. The chart and table below have linpack performance for various systems I’ve tested over the past year or so. The compiler version used, OS, etc. is not the same for every result, but it’s still a good general comparison. I’ll keep expanding this with new CPU’s and hopefully clean it up a bit adding job run notes for each entry. For now just enjoy numbers! (Notice that I put in a Xeon Phi number in there too 🙂

New E5 v3 Test System

The test system was a Puget Peak Dual Xeon Tower;

Puget Systems Peak Dual Xeon:
- 2 x Intel Xeon E5-2687W v3 @3.1GHz 10-core
- 64GB DDR4 2133MHz Reg ECC
- …
- CentOS 6.5
- Intel Parallel Studio XE 2015

Note: typo in top line of chart! E5 2697v3 should be E5 2687v3

Linpack benchmark using the Intel MKL optimizations


Processor	Brief Spec	Linpack (GFLOPS)
Dual Xeon E5 2687v3	20 cores @ 3.1GHz AVX2	788
Xeon Phi 3120A	57 cores @ 1.1GHz 512-bit SIMD	710
Quad Xeon E5 4624Lv2	40 cores @ 1.9GHz AVX	581
Dual Xeon 2695v2	24 cores @ 2.4GHz AVX	441
Core i7 5960X (Haswell E)	8 cores @ 3.0GHz AVX2	354
Dual Xeon E5 2687W	16 cores @ 3.2GHz AVX	345
Core i7 5930K (Haswell E)	6 cores @ 3.5GHz AVX2	289
Dual Xeon E5 2650	16 cores @ 2.0GHz AVX	262
Core i7 4770K (Haswell)	4 cores @ 3.5GHz AVX2	182
Xeon E3 1245v3 (Haswell)	4 cores @ 3.4GHz AVX2	170
Core i7 4960X (Ivy Bridge)	6 cores @ 3.6GHz AVX	165
Core i5 3570 (Ivy Bridge)	4 cores @ 3.4GHz AVX	105
Core i7 920	4 cores @ 2.66GHz SSE4.2	40

Happy computing! –dbk

Tags: benchmark, Haswell EP, HPC, linpack