Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/595
Dr Donald Kinghorn (Scientific Computing Advisor )

Xeon E5 v3 Haswell-EP Performance -- Linpack

Written on September 8, 2014 by Dr Donald Kinghorn

The new Intel Xeon E5 v3 Haswell-EP processors are here and they are fantastic! Lots of cores, AVX2 (SIMD plus FMA3) operations, lots of PCIe lanes, DDR4 memory support… nice!

I’ve been anxious for the the E5 v3 Haswell processors to come out since my first testing on the desktop core i7 and E3 v3 Haswell processors. I was really impressed with the numerical performance potential of these processors but they are limited by only supporting 16 PCIe lanes and 32GB of system memory and only 4 cores. The E5 v3 Haswell-EP removes all of those drawbacks. (the new Haswell-E desktop processors remove these drawbacks too!) These are really great processors!

In this post we’ll look at my favorite parallel numerical performance benchmark, Linpack. The Intel optimized Linpack benchmark using the MKL numeric libraries gives near theoretical peak double precision performance on Intel hardware so. It’s highly tuned to take advantage of all of the features of the processors. This makes it a bit artificial as an indicator of “real world” application performance but it clearly shows off the capabilities of the processors and give developers something to aspire too :-)

The processor feature that has the most impact on numerical performance on Haswell is the AVX2 instruction set. The SIMD vector length is the same as for Ivy Bridge, i.e. 256-bit, but there is a little bit of new secret sauce on Haswell from the FMA3 instructions (that’s a 3 operand Fused Multiply Add that executes in a single clock tic) This has the potential to nearly double floating point performance for this type of operation, and this is the most common operation in numerical matrix calculations.

Theoretical Peak

A good approximation of theoretical peak for Ivy Bridge and Haswell looks like this;

 CPU GHz * number of cores * SIMD vector ops (AVX) * special instructions effect (FMA3)

For the duall Xeon E5-2687W v3 @ 3.10GHz system theoretical peak would be

 3.1 * 20 * 8 * 2 = 992 GFLOPS

What did I get?

788 GFLOPS approx. 80% of theoretical peak

That is an incredible amount of compute capability for a “standard” dual CPU machine! I would like to see a number closer to theoretical peak for linpack but, I’m not complaining, it’s really very good. The chart and table below have linpack performance for various systems I’ve tested over the past year or so. The compiler version used, OS, etc. is not the same for every result, but it’s still a good general comparison. I’ll keep expanding this with new CPU’s and hopefully clean it up a bit adding job run notes for each entry. For now just enjoy numbers! (Notice that I put in a Xeon Phi number in there too :-)

New E5 v3 Test System

The test system was a Puget Peak Dual Xeon Tower;

Note: typo in top line of chart!  E5 2697v3 should be E5 2687v3

Linpack benchmark using the Intel MKL optimizations

Processor Brief Spec Linpack (GFLOPS)
Dual Xeon E5 2687v3 20 cores @ 3.1GHz AVX2 788
Xeon Phi 3120A 57 cores @ 1.1GHz 512-bit SIMD 710
Quad Xeon E5 4624Lv2 40 cores @ 1.9GHz AVX 581
Dual Xeon 2695v2 24 cores @ 2.4GHz AVX 441
Core i7 5960X (Haswell E) 8 cores @ 3.0GHz AVX2 354
Dual Xeon E5 2687W 16 cores @ 3.2GHz AVX 345
Core i7 5930K (Haswell E) 6 cores @ 3.5GHz AVX2 289
Dual Xeon E5 2650 16 cores @ 2.0GHz AVX 262
Core i7 4770K (Haswell) 4 cores @ 3.5GHz AVX2 182
Xeon E3 1245v3 (Haswell) 4 cores @ 3.4GHz AVX2 170
Core i7 4960X (Ivy Bridge) 6 cores @ 3.6GHz AVX 165
Core i5 3570 (Ivy Bridge) 4 cores @ 3.4GHz AVX 105
Core i7 920 4 cores @ 2.66GHz SSE4.2 40

Happy computing! --dbk

Tags: Haswell EP, Linpack, HPC, benchmark

Really impressive numbers! Thanks for the survey Donald. Was kinda hoping you would include an E5-1650(or 1660) v3 next time, as it's the one I am getting for my next desktop machine.

Posted on 2014-09-25 19:30:22

From a performance standpoint, the E5-1650 is actually almost the same thing as the Core i7 5960K except it has a .1GHz higher Turbo frequency. The E5-1660, on the other hand, is exactly the same as the Core i7 5960X. Really the only major difference between the Core i7 and E5 Xeon CPUs is that the Core i7's can be overclocked while the E5's support ECC/Reg. ECC memory and things like vPro and TXT

So the numbers for the 5960K might be a hair lower than you would see with an E5-1650, but it should be really, really close. And the 5960X should be almost exactly what you would see with an E5-1660

Posted on 2014-09-25 19:40:58

Those are pretty much identical to the Core i7 5930K and 5960X, respectively. Exact same architecture, core count, clock speed, etc... just on the server / workstation "Xeon" side of the Intel branding. The numbers Don listed for those processors should give you exactly the info you want :)

Posted on 2014-09-25 19:42:35

D'oh, Matt ninja'd me :)

Posted on 2014-09-25 19:44:18

Yeah its true that the 1650 is sorta the "serious man's 5930k" and the 1660 is the 5960x, but from a computational standpoint, I wouldn't use the 59xx for computations as they dont support ECC.
But yes, forgot that they are almost the same and have almost the same price.

Posted on 2014-09-25 20:22:10
Georg Boman

You have some great information here Donald! There is not much information on the net about the new E5-v3 processors real world performance. I'm currently doing some research on which cpu is optimal for a CFD-cluster running Fluent, CFX, StarCCM+ and this page was a good start.

Let me point out one error though in the above list, the Dual E5-2697v3 configuration is 28 cores @ 2.6 GHz AVX2 not 20 cores @ 3.1 GHz.

I'm looking for information about the CPU behaviour when fully loaded. For the processor to maintain levels below TDP it will lower frequency below base frequency if necessary, in the case of 2697v3 down to 2.2 GHz. Understanding how this works in a real world scenario with different solvers would be great. If anyone here has some information on this it would be greatly appreciated.

Georg Boman

Posted on 2014-10-26 11:02:53

I believe that Intel CPUs will only down-clock if cooling is insufficient - if the processor gets to hot it will first stop turbo-boosting up above the base clock speed, and then if needed drop below it (eventually turning the computer off altogether, as in the case of a major cooling failure, to prevent damage). If you keep the CPU properly cooled, though, it will always run at the base clock or higher - with how much higher depending on the turbo boost settings and number of active cores.

Posted on 2014-10-27 05:52:22
Donald Kinghorn

Hi Georg, Thanks for the catch on the 2697 typo! That was a E5 2687v3 that I tested on. Best regards --Don

Posted on 2014-10-27 17:55:27
XeviousDeathStar ✓ ˢᵐᵃʳᵗ ᵍᵘʸ

The Processor's Spec. Page ( http://ark.intel.com/produc... ) provides a bit of info about "Intel® Turbo Boost Technology" and "Thermal Monitoring Technologies".

Perhaps this is helpful: http://www.intel.com/conten...

If you like that Chip Number Series have you considered the E5 2687W v3 ( http://ark.intel.com/produc... ), fewer Cores but a higher Base Freq. (and TDP); which means that when you get kicked you still run at a higher speed.

It also depends on what you mean by "optimal", TCO, TDP, or MHz, the presence (need) of ECC memory and 'Enterprise Features' (expensive Features) vPro and TXT, etc.

Would a Cluster of i7s on Mini-ITX MBs work for you, it would reduce the cost of your Research.

Posted on 2014-11-09 20:43:48
Tristan Leflier

The problem is this. Linpack is misleading.

FMA3 can be used in only a small fraction of all multipies and adds in a real program (like CFD). I'm just testing Haswells... not a big difference on those jobs, I get.. let's see .... a very small 209:197 speedup avx2:avx (20 threads on 10-core E5-2660, dual cpu, devel. on a trial node of SciNet supercomp in Toronto). So I wouldn't praise the new instructions too much. Linpack is really really simple opertions like matrix mult. In the real world we don't rotate matrices.
I mean, some people do, but I for one live in a nonlinear world (of compressible explicit

We do things like a = a+4*b[i]*c[j]/d[k] +e[j]/2- 1.5*f[i-1,j]*f[i+1,j], or so. In that example you can squeeze one or two FMA3 but the rest are normal additions or multiplications. I would have expected some speedup of index calculation in multi-d arrays but as you see... no real speedup,
only +6% on the whole 3D code with large arrays exceeding 2GB.

PS. a good guide to performance of dual and single setups

Posted on 2014-11-07 06:13:22
Tristan Leflier

I am struggling to squeeze the expected performance from dual haswells, as opposed to a single haswell - but I haven't finished my tests yet so I won't prematurely complain. The first trials weren't showing enough speedup from dual setup. playing with the env. variable KMP_AFFINITY now... people from SciNet pointed the affininty issue to me - & I'm very grateful :-)

Posted on 2014-11-07 06:19:50
Tristan Leflier

One more comment: notice the disappointing performance of the flagship supercomputer part Xeon Phi 3120A, as compared with the nominally weaker processors (57 cores at 1.1GHz and 512bit avx (should be great!) - vs. 20 core dual Xeon setup at 3.1 with only 256 bit simd)
So again - something doesn't add. The Phi should use its 512 (if that's correct) bits, to crush the Xeon, yet the CPU wins. Actually, I'm a GPU guy too, and the newest Xeons match a GTX Titan,
while being more versatile in programming. CPUs don't require simple tasks repeated on massive amount of data, for instance; can use my old Fortran codes easily - for high level language programming; CUDA C, and C in general, is a glorified assembler.

Physicists have recently published their benchmarks in an HPC journal (HOC Magazine? don't remember) a few days ago. Ivy Bridge or even Sandy Bridge against the slightly older Xeon Phi.
Same conclusion, CPU beats Xeon Phi.

Which is tragic, because I bought two Phi's and they're not that useful, I think they perform like an O/C'd i7-5960X :-|

Posted on 2014-11-07 06:29:34

Well, Xeon Phi performance for generic massively multithreaded algorithms/aplications is a way worse then dual Haswell CPU may provide; it is not by chance Intel does not publish SPECint2006Rate Results for Xeon Phi... dual e52699v3 provides around 1400 scores while Phi probably may not get above 700 (likely it's a way lower). SPECfp2006Rate bench may exhibit a less devastating result yet likely it is still well below what dual Haswell may provide. However, SIMD style computation does allow to present the cases where Phi may look not so bad and even competitive; SIMD computations are suitable mostly for brute-force number crunching algorithms and not well suitable for an adaptive algorithms which may change the code path of each thread based on ongoing intermediary context specific for context of particular thread (adaptive ray tracing or/and adaptive volume ray casting are the most well known such examples).

Posted on 2014-11-23 22:54:48

I work with a Sandy Bridge, but I'm having some trouble with your result of 262 GFLOP/s. If my calculations are right, the maximum performance is 256 for double precision, which is the one used by LINPACK, acordding with their FAQ.

Posted on 2016-01-12 17:35:49
Donald Kinghorn

Hey Raphael, The reason you are having trouble with my 262 GFLOP/s is because it's wrong! That's a typo it should be 256 as you point out.Thanks!
I'm working on a post that may go up on our HPC blog latter today. It will have a nice chart with '"effective" performance numbers on it based on all-core-turbo and Amdahl's law It's really interesting! Newer Intel processors run at all-core-turbo under load unless there is a cooling or power problem so it's a much better estimate of performance ... and of course real world parallelism usually follows an Amdahl's law curve. I would appreciate your comments on it and especially appreciate spotting any errors

Thanks --Don

Posted on 2016-01-13 16:36:18
Antonio Carlos Pereira de Azam

Nice job. Thank you for the information you publish to open public.

I have just one question: why is the double precision theoretical peak performance 992 GFlops? In the main text, it says the peak performance is

3.1 * 20 * 8 * 2 = 992 GFLOPS. But aren't AVX2 built to perform 4 double precision simultaneously? Not 8?

Edit: Now I see: there are 256 bits vector engine for add and another one for multiply.

Posted on 2017-02-20 21:05:52
Donald Kinghorn

Hi Antonio, You are correct it's 256bit so it can load 4 doubles. When I wrote this I was using some guidance from some Intel docs (I don't have a reference) It was from an overall performance approximation. But, it is misleading, and I probably should not have listed it as a factor of 8 from AVX. I have resorted to just using a global approximation for "special ops" Like I do in the post on Xeon v4

This is going to get much more complicated when the Xeon v5 comes out. There are a lot of performance improvements (and a big one will be AVX512) [ this is going to the most exciting Intel processor to come out in years! ]

Years ago you could actually make a good guess for a theoretical peak but now all bets are off since so much is tied to dynamic power management and the over all complexity of the hardware. I believe it is still useful to have an approximation but, hardware feature to performance attribution is very complicated. --Don

Posted on 2017-02-21 19:18:59
Antonio Carlos Pereira de Azam

Thank you very much for your explanation.

Posted on 2017-03-03 13:23:39