Table of Contents
The new Intel Xeon E5 v3 Haswell-EP processors are here and they are fantastic! Lots of cores, AVX2 (SIMD plus FMA3) operations, lots of PCIe lanes, DDR4 memory support… nice!
I’ve been anxious for the the E5 v3 Haswell processors to come out since my first testing on the desktop core i7 and E3 v3 Haswell processors. I was really impressed with the numerical performance potential of these processors but they are limited by only supporting 16 PCIe lanes and 32GB of system memory and only 4 cores. The E5 v3 Haswell-EP removes all of those drawbacks. (the new Haswell-E desktop processors remove these drawbacks too!) These are really great processors!
In this post we’ll look at my favorite parallel numerical performance benchmark, Linpack. The Intel optimized Linpack benchmark using the MKL numeric libraries gives near theoretical peak double precision performance on Intel hardware so. It’s highly tuned to take advantage of all of the features of the processors. This makes it a bit artificial as an indicator of “real world” application performance but it clearly shows off the capabilities of the processors and give developers something to aspire too ๐
The processor feature that has the most impact on numerical performance on Haswell is the AVX2 instruction set. The SIMD vector length is the same as for Ivy Bridge, i.e. 256-bit, but there is a little bit of new secret sauce on Haswell from the FMA3 instructions (that’s a 3 operand Fused Multiply Add that executes in a single clock tic) This has the potential to nearly double floating point performance for this type of operation, and this is the most common operation in numerical matrix calculations.
Theoretical Peak
A good approximation of theoretical peak for Ivy Bridge and Haswell looks like this;
CPU GHz * number of cores * SIMD vector ops (AVX) * special instructions effect (FMA3)
For the duall Xeon E5-2687W v3 @ 3.10GHz system theoretical peak would be
3.1 * 20 * 8 * 2 = 992 GFLOPS
What did I get?
788 GFLOPS approx. 80% of theoretical peak
That is an incredible amount of compute capability for a “standard” dual CPU machine! I would like to see a number closer to theoretical peak for linpack but, I’m not complaining, it’s really very good. The chart and table below have linpack performance for various systems I’ve tested over the past year or so. The compiler version used, OS, etc. is not the same for every result, but it’s still a good general comparison. I’ll keep expanding this with new CPU’s and hopefully clean it up a bit adding job run notes for each entry. For now just enjoy numbers! (Notice that I put in a Xeon Phi number in there too ๐
New E5 v3 Test System
The test system was a Puget Peak Dual Xeon Tower;
-
Puget Systems Peak Dual Xeon:
- 2 x Intel Xeon E5-2687W v3 @3.1GHz 10-core
- 64GB DDR4 2133MHz Reg ECC
- …
- CentOS 6.5
- Intel Parallel Studio XE 2015
Note: typo in top line of chart! E5 2697v3 should be E5 2687v3
Linpack benchmark using the Intel MKL optimizations
Processor | Brief Spec | Linpack (GFLOPS) |
---|---|---|
Dual Xeon E5 2687v3 | 20 cores @ 3.1GHz AVX2 | 788 |
Xeon Phi 3120A | 57 cores @ 1.1GHz 512-bit SIMD | 710 |
Quad Xeon E5 4624Lv2 | 40 cores @ 1.9GHz AVX | 581 |
Dual Xeon 2695v2 | 24 cores @ 2.4GHz AVX | 441 |
Core i7 5960X (Haswell E) | 8 cores @ 3.0GHz AVX2 | 354 |
Dual Xeon E5 2687W | 16 cores @ 3.2GHz AVX | 345 |
Core i7 5930K (Haswell E) | 6 cores @ 3.5GHz AVX2 | 289 |
Dual Xeon E5 2650 | 16 cores @ 2.0GHz AVX | 262 |
Core i7 4770K (Haswell) | 4 cores @ 3.5GHz AVX2 | 182 |
Xeon E3 1245v3 (Haswell) | 4 cores @ 3.4GHz AVX2 | 170 |
Core i7 4960X (Ivy Bridge) | 6 cores @ 3.6GHz AVX | 165 |
Core i5 3570 (Ivy Bridge) | 4 cores @ 3.4GHz AVX | 105 |
Core i7 920 | 4 cores @ 2.66GHz SSE4.2 | 40 |
Happy computing! –dbk