Why quad Xeon? 95% of peak LINPACK on 40 cores!

I’ve been doing application performance testing on our quad socket systems and I am especially liking the quad Xeon box on our test bench. I realized that I haven’t published any LINPACK performance numbers for this system (that’s my favorite benchmark). I’ll show the results for the Intel optimized multi-threaded binary that is included with Intel MKL and do a compile from source using OpenMPI. It turns out that both openMP threads and MPI processes give outstanding, near theoretical peak performance. Building from source hopefully shows that it’s not just Intel “magic” that leads to this performance … although I guess it really is.

Why I like the Quad Socket Xeon

One of the biggest concerns about using a quad socket “many-core” system is parallel performance scaling and memory contention for large numbers of threads. The early quad socket Xeon systems were not so good! (awful in my not so humble opinion) They suffered from poor memory handling, improper CPU core and memory process binding, and the tools needed for programing with larger numbers of threads were immature and troublesome. This is no longer the case today. Modern hardware like the Intel Xeon E5 4xxx processor is really good! Also, OpenMP has become very strong and has many excellent implementations. There are also very good specialized tools from Intel, PGI (and Microsoft too!) And, modern MPI implementations have kept pace with the new platforms and are well adapted to work on systems with many SMP cores.

These days I like the quad socket systems as an alternative to small clusters. They are compact and easy to configure and maintain since you only have one install and configuration to take care of. I’ve personally built and configured 100’s of small clusters (and some large ones) but now the more testing I do with the quads the more I like them as an alternative. A modern quad socket machine is a good platform for applications that gave good performance running on 4 to 8 node clusters a few years back. They also provide Windows users with a nice platform for running well written modern multi-threaded applications.

Another advantage of the quad socket systems is that they have twice the number of memory controllers and memory sockets. It’s not a problem to configure a quad socket box with 1TB of memory. When you have a problem that requires a single very large memory space for at least part of the job run then a quad socket box can make the difference between being able to solve your problem or not! [ If you take quad socket systems to the extreme and go with a quad Intel E7 setup you can get 4 x 15-core processors and 3TB of memory! ( people advertise specs that say 6TB but that requires 64GB LRDIMM’s and they are, aaahhh, hard to get a hold of 🙂 … we’re trying actually… ]

LINPACK Performance on the Quad Xeon

The test system was a Puget Peak Quad Xeon Tower;

Note: this is a testbench system using quad socket Ivy Bridge engineering sample CPU’s. We normally configure our Quad Xeon systems with higher clocked CPU’s.

LINPACK using openMP threads with Intel MKL on 40 cores

The following output is from using the optimized Intel SMP linpack from the MKL benchmark directory. This is an Intel provided binary.

Notes:

  1. The CPU frequency of 2.499GHz is an artifact from “turbo-boost”. The CPU clock will rapidly drop back to 1.9GHz once the calculation starts.
  2. The problem size 83328 was chosen to utilize 90% of the system memory.
 [kinghorn@tbench linpack]$ ./runme_xeon64
This is a SAMPLE run script for SMP LINPACK. Change it to reflect
the correct number of CPUs/threads, problem input files, etc..
Wed Jul 23 11:50:49 PDT 2014
Intel(R) Optimized LINPACK Benchmark data


Current date/time: Wed Jul 23 11:50:49 2014


CPU frequency:    2.499 GHz
Number of CPUs: 4
Number of cores: 40
Number of threads: 40


Parameters are set to:


Number of tests: 1
Number of equations to solve (problem size) : 83328
Leading dimension of array                  : 83328
Number of trials to run                     : 1    
Data alignment value (in Kbytes)            : 1    


Maximum memory requested that can be used=55550112256, at the size=83328


=================== Timing linear equation system solver ===================


Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
83328  83328  1      671.733    574.2497 5.776853e-09 2.973327e-02   pass


Performance Summary (GFlops)


Size   LDA    Align.  Average  Maximal
83328  83328  1       574.2497 574.2497


Residual checks PASSED


End of tests


Done: Wed Jul 23 12:09:48 PDT 2014

Theoretical peak double precision floating point for this system is ( when utilizing the CPU vector units )

1.9 GHz * 40 cores * 8 ops/cycle = 608 GFLOPS

This LINPACK run using multi-threaded MKL gives, 574.25/608 * 100 = 94.4%, of the theoretical system peak performance!

LINPACK using 40 MPI processes with openMPI (compiled from source)

I’m compiling using the Intel compilers so in order to be able to link the LINPACK code with openMPI we need to compile it from source using the Intel compilers. I grabbed a copy of the current openMPI distribution, version 1.8.1. I used the following simple configuration setup for the build;

[root@tbench openmpi-1.8.1]# ./configure --enable-static --prefix=/opt/mpi/ompi-1.8.1-intel  CC=icc CXX=icpc F77=ifort FC=ifort

For the source LINPACK build I used the Make.intel64 in the mp_linpack directory from MKL with the following changes;

MPdir        = /opt/mpi/ompi-1.8.1-intel
CC      = mpicc
#  CC      = mpiicc


CCFLAGS = $(HPL_DEFS) -O3 -axAVX …

I’ve changed the MPI base directory and mpicc build script to point to my openMPI build and I explicitly added the -axAVX compiler flag to let the compiler know that I really want AVX vectorization rather than something more generic like SSE3.

For the job run I used the following environment for thread/process pinning. (I could also have used options to mpiexec for this…)

export KMP_AFFINITY=nowarnings,compact

Here’s a run of the LINPACK benchmark using the openMPI built with the Intel compilers. Performance is even better than the the Intel optimized SMP linpack!

================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================


An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.


The following parameter values will be used:


N        :   83328 
NB       :     168 
PMAP     : Row-major process mapping
P        :       4 
Q        :      10 
PFACT    :   Right 
NBMIN    :       4 
NDIV     :       2 
RFACT    :   Crout 
BCAST    :  1ringM 
DEPTH    :       0 
SWAP     : Mix (threshold = 64)
L1       : transposed form
U        : transposed form
EQUIL    : yes
ALIGN    :    8 double precision words


--------------------------------------------------------------------------------


- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0


Column=000504 Fraction=0.005 Mflops=593418.38
Column=000840 Fraction=0.010 Mflops=598103.98
Column=001344 Fraction=0.015 Mflops=600040.58
Column=001680 Fraction=0.020 Mflops=602494.41
Column=002184 Fraction=0.025 Mflops=598044.80
Column=002520 Fraction=0.030 Mflops=599223.28
Column=003024 Fraction=0.035 Mflops=599302.16
Column=003360 Fraction=0.040 Mflops=599212.44


… 


Column=041328 Fraction=0.495 Mflops=588568.85
Column=043008 Fraction=0.515 Mflops=588242.62
Column=044688 Fraction=0.535 Mflops=587721.01
Column=046368 Fraction=0.555 Mflops=587483.43
Column=048048 Fraction=0.575 Mflops=587017.56
Column=049728 Fraction=0.595 Mflops=586653.46
Column=051408 Fraction=0.615 Mflops=586447.70
Column=052920 Fraction=0.635 Mflops=586110.26
Column=054600 Fraction=0.655 Mflops=585684.31
Column=056280 Fraction=0.675 Mflops=585372.05
Column=057960 Fraction=0.695 Mflops=585088.92
Column=066360 Fraction=0.795 Mflops=583676.23
Column=074592 Fraction=0.895 Mflops=582583.52
Column=082992 Fraction=0.995 Mflops=582116.16
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR01C2R4       83328   168     4    10             663.63            5.81256e+02
HPL_pdgesv() start time Wed Jul 23 14:15:32 2014


HPL_pdgesv() end time   Wed Jul 23 14:26:35 2014


--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0019790 ...... PASSED
================================================================================


Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------


End of Tests.
================================================================================
Done: Wed Jul 23 14:26:42 PDT 2014

That’s 581.26/608 * 100 = 95.6% of theoretical peak!

This is really very good. On a small cluster you would be doing good to get %60-70 of peak over a gigabit ethernet connection and the Quad Xeon really doesn’t cost any more than a small cluster with the same core count and total memory … I like it!

Happy computing –dbk