Dr Donald Kinghorn (HPC and Scientific Computing)

Xeon E5 v3 Haswell-EP Performance -- Linpack

Written on September 8, 2014 by Dr Donald Kinghorn

The new Intel Xeon E5 v3 Haswell-EP processors are here and they are fantastic! Lots of cores, AVX2 (SIMD plus FMA3) operations, lots of PCIe lanes, DDR4 memory support… nice!

I’ve been anxious for the the E5 v3 Haswell processors to come out since my first testing on the desktop core i7 and E3 v3 Haswell processors. I was really impressed with the numerical performance potential of these processors but they are limited by only supporting 16 PCIe lanes and 32GB of system memory and only 4 cores. The E5 v3 Haswell-EP removes all of those drawbacks. (the new Haswell-E desktop processors remove these drawbacks too!) These are really great processors!

In this post we’ll look at my favorite parallel numerical performance benchmark, Linpack. The Intel optimized Linpack benchmark using the MKL numeric libraries gives near theoretical peak double precision performance on Intel hardware so. It’s highly tuned to take advantage of all of the features of the processors. This makes it a bit artificial as an indicator of “real world” application performance but it clearly shows off the capabilities of the processors and give developers something to aspire too :-)

The processor feature that has the most impact on numerical performance on Haswell is the AVX2 instruction set. The SIMD vector length is the same as for Ivy Bridge, i.e. 256-bit, but there is a little bit of new secret sauce on Haswell from the FMA3 instructions (that’s a 3 operand Fused Multiply Add that executes in a single clock tic) This has the potential to nearly double floating point performance for this type of operation, and this is the most common operation in numerical matrix calculations.

Theoretical Peak

A good approximation of theoretical peak for Ivy Bridge and Haswell looks like this;

 CPU GHz * number of cores * SIMD vector ops (AVX) * special instructions effect (FMA3)

For the duall Xeon E5-2687W v3 @ 3.10GHz system theoretical peak would be

 3.1 * 20 * 8 * 2 = 992 GFLOPS

What did I get?

788 GFLOPS approx. 80% of theoretical peak

That is an incredible amount of compute capability for a “standard” dual CPU machine! I would like to see a number closer to theoretical peak for linpack but, I’m not complaining, it’s really very good. The chart and table below have linpack performance for various systems I’ve tested over the past year or so. The compiler version used, OS, etc. is not the same for every result, but it’s still a good general comparison. I’ll keep expanding this with new CPU’s and hopefully clean it up a bit adding job run notes for each entry. For now just enjoy numbers! (Notice that I put in a Xeon Phi number in there too :-)

New E5 v3 Test System

The test system was a Puget Peak Dual Xeon Tower;

Note: typo in top line of chart!  E5 2697v3 should be E5 2687v3

Linpack benchmark using the Intel MKL optimizations

Processor Brief Spec Linpack (GFLOPS)
Dual Xeon E5 2687v3 20 cores @ 3.1GHz AVX2 788
Xeon Phi 3120A 57 cores @ 1.1GHz 512-bit SIMD 710
Quad Xeon E5 4624Lv2 40 cores @ 1.9GHz AVX 581
Dual Xeon 2695v2 24 cores @ 2.4GHz AVX 441
Core i7 5960X (Haswell E) 8 cores @ 3.0GHz AVX2 354
Dual Xeon E5 2687W 16 cores @ 3.2GHz AVX 345
Core i7 5930K (Haswell E) 6 cores @ 3.5GHz AVX2 289
Dual Xeon E5 2650 16 cores @ 2.0GHz AVX 262
Core i7 4770K (Haswell) 4 cores @ 3.5GHz AVX2 182
Xeon E3 1245v3 (Haswell) 4 cores @ 3.4GHz AVX2 170
Core i7 4960X (Ivy Bridge) 6 cores @ 3.6GHz AVX 165
Core i5 3570 (Ivy Bridge) 4 cores @ 3.4GHz AVX 105
Core i7 920 4 cores @ 2.66GHz SSE4.2 40

Happy computing! --dbk

Tags: Haswell EP, Linpack, HPC, benchmark
Georg Boman

You have some great information here Donald! There is not much information on the net about the new E5-v3 processors real world performance. I'm currently doing some research on which cpu is optimal for a CFD-cluster running Fluent, CFX, StarCCM+ and this page was a good start.

Let me point out one error though in the above list, the Dual E5-2697v3 configuration is 28 cores @ 2.6 GHz AVX2 not 20 cores @ 3.1 GHz.

I'm looking for information about the CPU behaviour when fully loaded. For the processor to maintain levels below TDP it will lower frequency below base frequency if necessary, in the case of 2697v3 down to 2.2 GHz. Understanding how this works in a real world scenario with different solvers would be great. If anyone here has some information on this it would be greatly appreciated.

Georg Boman

Posted on 2014-10-26 11:02:53

The Processor's Spec. Page ( http://ark.intel.com/products/... ) provides a bit of info about "IntelĀ® Turbo Boost Technology" and "Thermal Monitoring Technologies".

Perhaps this is helpful: http://www.intel.com/content/w...

If you like that Chip Number Series have you considered the E5 2687W v3 ( http://ark.intel.com/products/... ), fewer Cores but a higher Base Freq. (and TDP); which means that when you get kicked you still run at a higher speed.

It also depends on what you mean by "optimal", TCO, TDP, or MHz, the presence (need) of ECC memory and 'Enterprise Features' (expensive Features) vPro and TXT, etc.

Would a Cluster of i7s on Mini-ITX MBs work for you, it would reduce the cost of your Research.

Posted on 2014-11-09 20:43:48
Tristan Leflier

The problem is this.

FMA3 can be used in only a small fraction of all multipies and adds in a real program (like CFD). I'm just testing Haswells... not a big difference on those jobs, I get.. let's see .... a very small 209:197 speedup avx2:avx (20 threads on 10-core E5-2660, dual cpu, devel. on a trial node of SciNet supercomp in Toronto). So I wouldn't praise the new instructions too much. Linpack is really really simple opertions like matrix mult. In the real world we don't rotate matrices.
I mean, some people do, but I for one live in a nonlinear world (of compressible explicit

We do things like a = a+4*b[i]*c[j]/d[k] +e[j]/2- 1.5*f[i-1,j]*f[i+1,j], or so. In that example you can squeeze one or two FMA3 but the rest are normal additions or multiplications. I would have expected some speedup of index calculation in multi-d arrays but as you see... no real speedup,
only +6% on the whole 3D code with large arrays exceeding 2GB.

Posted on 2014-11-07 06:13:22
Tristan Leflier

I am struggling to squeeze the expected performance from dual haswells, as opposed to a single haswell - but I haven't finished my tests yet so I won't prematurely complain. The first trials weren't showing enough speedup from dual setup. playing with the env. variable KMP_AFFINITY now... people from SciNet pointed the affininty issue to me - & I'm very grateful :-)

Posted on 2014-11-07 06:19:50
Tristan Leflier

One more comment: notice the disappointing performance of the flagship supercomputer part Xeon Phi 3120A, as compared with the nominally weaker processors (57 cores at 1.1GHz and 512bit avx (should be great!) - vs. 20 core dual Xeon setup at 3.1 with only 256 bit simd)
So again - something doesn't add. The Phi should use its 512 (if that's correct) bits, to crush the Xeon, yet the CPU wins. Actually, I'm a GPU guy too, and the newest Xeons match a GTX Titan,
while being more versatile in programming. CPUs don't require simple tasks repeated on massive amount of data, for instance; can use my old Fortran codes easily - for high level language programming; CUDA C, and C in general, is a glorified assembler.

Physicists have recently published their benchmarks in an HPC journal (HOC Magazine? don't remember) a few days ago. Ivy Bridge or even Sandy Bridge against the slightly older Xeon Phi.
Same conclusion, CPU beats Xeon Phi.

Which is tragic, because I bought two Phi's and they're not that useful, I think they perform like an O/C'd i7-5960X :-|

Posted on 2014-11-07 06:29:34

Well, Xeon Phi performance for generic massively multithreaded algorithms/aplications is a way worse then dual Haswell CPU may provide; it is not by chance Intel does not publish SPECint2006Rate Results for Xeon Phi... dual e52699v3 provides around 1400 scores while Phi probably may not get above 700 (likely it's a way lower). SPECfp2006Rate bench may exhibit a less devastating result yet likely it is still well below what dual Haswell may provide. However, SIMD style computation does allow to present the cases where Phi may look not so bad and even competitive; SIMD computations are suitable mostly for brute-force number crunching algorithms and not well suitable for an adaptive algorithms which may change the code path of each thread based on ongoing intermediary context specific for context of particular thread (adaptive ray tracing or/and adaptive volume ray casting are the most well known such examples).

Posted on 2014-11-23 22:54:48

I work with a Sandy Bridge, but I'm having some trouble with your result of 262 GFLOP/s. If my calculations are right, the maximum performance is 256 for double precision, which is the one used by LINPACK, acordding with their FAQ.

Posted on 2016-01-12 17:35:49