Memory Performance for Intel Xeon Haswell-EP DDR4

Getting data into and out of memory is a major performance factor for many compute intensive applications. “Memory bandwidth” is often referenced as the metric to gague this performance. One of the “standard” benchmarks for measuring this memory bandwidth is the “STREAM benchmark”. We’ll take a look at the STEAM benchmark for the new Intel Xeon E5 v3 Haswell processors with DDR4 memory and compare this with a system running a Xeon E5 v2 Ivy Bridge processor.

I want to add a cautionary note right up front. My results varied significantly from run to run. Precision was very poor and thus, accuracy is likely to also be very poor! I’m not sure of the reason for this but it could be that the internal counters on these modern architectures are not useful for this benchmark anymore(?) I’ll try to resolve this and add comments later.

**********

NOTE: My bad! I forgot to set process affinity. After doing so precision (repeatability) was very good! You need to set the KMP_AFFINITY environment variable.

export KMP_AFFINITY=granularity=core,compact

**********

Test Systems

The test systems were a Puget Peak Dual Xeon Tower and the Peak Dual Xeon Stacker;

Puget Systems Peak Dual Xeon Stacker:

  • 2 x Intel Xeon E5-2695 v2 @2.4GHz 12-core
  • 64GB DDR3 1600MHz Reg ECC

I compiled the STREAM source using the new Intel compilers from Parallel Studio XE 2015 and used the following build flags;

CC = icc
CFLAGS = -O3 -xHost -openmp -DSTREAM_ARRAY_SIZE=64000000 -opt-streaming-cache-evict=0 -opt-streaming-stores always -opt-prefetch-distance=64,8

Results

The most often reported numbers from the STREAM benchmark for bandwidth are the “TRIAD” result. Here they are;

Haswell Xeon E5 v3 2697W 20 cores — 102 GB/second

Haswell Xeon E5 v3 2697W 1 core — 20 GB/second

Ivy Bridge Xeon E5 v2 2695 24 cores — 75 GB/second

Ivy Bridge Xeon E5 v2 2695 1 core — 8.7 GB/second

The following table summarizes the results and following that is the job run output.

STREAM benchmark Xeon E5 Haswell DDR4vs Ivy Bridge DDR3 MB/sec

 
Processor-cores Copy Scale Add Triad
Xeon E5 2697v3 20-cores DDR4 98919 99147 101802 102386
Xeon E5 2697v3 1-core DDR4 18544 18066 19265 19565
Dual Xeon 2695v2 24-cores DDR3 69169 69301 75032 75264
Dual Xeon 2695v2 1-core DDR3 6232 6280 8570 8693

STREAM 1 core Intel Xeon E5-2695 v2 @2.4GHz

[kinghorn@tbench stream]$ ./stream_c.exe 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 64000000 (elements), Offset = 0 (elements)
Memory per array = 488.3 MiB (= 0.5 GiB).
Total memory required = 1464.8 MiB (= 1.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 107601 microseconds.
   (= 107601 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            6232.3     0.164468     0.164305     0.164868
Scale:           6280.1     0.163229     0.163056     0.163578
Add:             8570.9     0.183882     0.179212     0.184838
Triad:           8693.3     0.183788     0.176688     0.185958
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

STREAM 24 cores Intel Xeon E5-2695 v2 @2.4GHz

[kinghorn@tbench stream]$ ./stream_c.exe 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 64000000 (elements), Offset = 0 (elements)
Memory per array = 488.3 MiB (= 0.5 GiB).
Total memory required = 1464.8 MiB (= 1.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 24
Number of Threads counted = 24
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 14868 microseconds.
   (= 14868 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           69169.9     0.014902     0.014804     0.014958
Scale:          69301.6     0.014910     0.014776     0.015047
Add:            75032.6     0.020578     0.020471     0.020641
Triad:          75264.9     0.020632     0.020408     0.020865
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

STREAM 1 core Intel Xeon E5-2687W v3 @3.1GHz

[kinghorn@tbench stream]$ ./stream_c.exe 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 64000000 (elements), Offset = 0 (elements)
Memory per array = 488.3 MiB (= 0.5 GiB).
Total memory required = 1464.8 MiB (= 1.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 69363 microseconds.
   (= 69363 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           18544.4     0.055288     0.055219     0.055335
Scale:          18066.7     0.056706     0.056679     0.056728
Add:            19265.5     0.080222     0.079728     0.080301
Triad:          19565.9     0.080177     0.078504     0.080420
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

STREAM 20 cores Intel Xeon E5-2687W v3 @3.1GHz

[kinghorn@tbench stream]$ ./stream_c.exe 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 64000000 (elements), Offset = 0 (elements)
Memory per array = 488.3 MiB (= 0.5 GiB).
Total memory required = 1464.8 MiB (= 1.4 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 20
Number of Threads counted = 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 15702 microseconds.
   (= 15702 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           98919.1     0.010408     0.010352     0.010436
Scale:          99147.4     0.010399     0.010328     0.010465
Add:           101802.2     0.015115     0.015088     0.015133
Triad:         102386.3     0.015031     0.015002     0.015054
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Happy computing! –dbk