Intel Broadwell Xeon E5 2600v4 performance test

The Intel Xeon E5-2600 v4 Broadwell processors are available now and I’ve got a dual 2687W v4 system on my desk! Time to see what performance is like.

Broadwell is the “Tick” to Haswell’s “Tock”. It’s a die shrink from 22nm to 14nm, with increases in core count and memory bandwidth and power handling. There are many new features and improvements,

  • AVX Optimizations (power and clock improvements)
  • Reduced cycle times for several floating point ops and improved division ops
  • TLB and Gather op improvements
  • TSX support
  • Better cache management
  • Higher maximum memory clocks
  • … lots of good small changes

If you want details on the many subtle differences a google search will be fruitful 🙂 Lets see how these changes effect performance.

Test Hardware and Software

Test System:Peak Tower Dual
CPU: Intel Xeon E5 2687W v4 12-core @ 3.0/3.2/3.5GHz
Memory: 256GB DDR4 2133MHz
OS: CentOS 7.2
** CentOS 7.2 will not recognize the new Broadwell processor during install and will report “Unknown Hardware”. It does install OK and after updates the CPU is detected properly.
Test programs:
Linpack benchmark from Intel MKL version 11.3
NAMD version 2.10
STMV benchmark [ stmv.namd ]
Satellite Tobacco Mosaic Virus
1,066,628 atoms, periodic, PME, CPU only

Note: NAMD is a molecular dynamics program developed and maintained by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign.

Note: Base clock is 3.0GHz, All-Core-Turbo is 3.2GHz and Max-Turbo is 3.5GHz. For the Haswell 2687v3 these clocks were 3.1GHz, 3.2GHz, and 3.5GHz. AND the number that really matters is All-Core-Turbo since that is what the processors run at under full with proper cooling.

Linpack benchmark on Xeon E5 v4 Broadwell

We’re running the Intel optimized Linpack binary contained in the MKL benchmarks directory from MKL version 11.3. This is the latest version of MKL as of this writing and this code is highly optimized for Intel processors. I feel it is a good measure of double precision floating point performance and is highly optimized for Intel architecture.

Broadwell Xeon E5 2687W v4 Linpack 1078 GFLOP/s!

That is outstanding performance from a single node dual socket system! By comparison the testing I did with its Haswell predecessor 2687W v3 10-core gave 788 GFLOP/s. Keep in mind that the v3 version had 2 fewer cores and I ran that test with a smaller (16GB) problem size, and an older version of MKL was used. However, also, keep in mind that the Broadwell v4 version is the same price as the Haswell v3 version of this processor!

Here’s some output from the Linpack testing

“Standard” problem sizes (up to ~ 16GB)

Mon May 16 18:09:08 EDT 2016
Intel(R) Optimized LINPACK Benchmark data
...
Number of CPUs: 2
Number of cores: 24
Number of threads: 24
...
Maximum memory requested that can be used=16200901024, at the size=45000

Performance Summary (GFlops)
...
Size   LDA    Align.  Average  Maximal
1000   1000   4       93.2748  115.6553
2000   2000   4       257.3304 268.6970
5000   5008   4       598.5752 612.6550
10000  10000  4       786.9355 812.9519
15000  15000  4       790.3809 799.4511
18000  18008  4       902.1919 903.0113
20000  20016  4       926.9572 927.2577
22000  22008  4       929.5465 930.6058
25000  25000  4       923.1123 924.7006
26000  26000  4       925.6606 928.4090
27000  27000  4       917.0465 917.0465
30000  30000  1       924.4760 924.4760
35000  35000  1       937.2601 937.2601
40000  40000  1       947.9258 947.9258
45000  45000  1       946.1127 946.1127

Residual checks PASSED

End of tests

Large problem size output (up to ~200GB),

Maximum memory requested that can be used=204803201024, at the size=160000

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
50000  50000  1       947.9553 947.9553
55000  55000  1       951.8988 951.8988
60000  60000  1       963.6150 963.6150
70000  70000  1       1032.6451 1032.6451
80000  80000  1       1042.3514 1042.3514
120000 120000 1       1069.4436 1069.4436
160000 160000 1       1077.8674 1077.8674

Residual checks PASSED

End of tests

NAMD Molecular Dynamics test (no GPU acceleration)

NAMD is a good CPU test code in my opinion. The parallel scaling of NAMD with threads or message passing is excellent. I have used the “standard” binary build for x86_64 multicore CPU version 2.10. There are more recent versions but I have some test data on Haswell for this version.

Note: If you are running NAMD and NOT using GPU acceleration then you should probably reconsider that since it has excellent GPU acceleration!

One of the things I like about this program for testing is that there is not that much advantage in recompiling using Intel compilers ( see my blog post about NAMD ). That means we get a test that may better represent how existing programs will perform on the new Broadwell Xeon E5’s.

The gains with the Broadwell Xeon over the Haswell Xeon is much more modest in this case.

NAMD stmv simulation 500 time steps (CPU only) — Intel Xeon E5 2687v3 (Haswell) vs 2687v4 (Broadwell)

Haswell E5-2687v3 Broadwell E5-2687v4
CPU cores wall time day/ns wall time day/ns
1 4220.0 96.29 3747 85.3
2 2167.0 48.66 1919 43.8
4 1150.1 26.91 1001 22.8
8 612.9 13.59 547 12.3
10 494.6 11.65 440 9.79
12 —– —- 369 8.18
16 313.9 6.93 281 6.17
20 268.3 5.51 231 4.97
24 —– —- 195 4.16
40(HT) 228.0 4.80 —-
48(HT) —– —- 175 3.62
Notes:
These processors run at 3.5GHz Max-Turbo for the 1,2,4 core jobs and then 3.2GHz All-Core-Turbo for the rest.

The speedup using the v4 Broadwell Xeon is not nearly so dramatic with NAMD but it is still a nice speedup, and, there are 2 extra cores!

I’ll have another post up soon as a buyers guide for the new Broadwell processors that will show price, theoretical performance, and Amdahl’s Law scaling. That will be an update to the post that shows this information for the Haswell Xeon’s.


Happy computing! –dbk