Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1291
Dr Donald Kinghorn (Scientific Computing Advisor )

How to Run an Optimized HPL Linpack Benchmark on AMD Ryzen Threadripper -- 2990WX 32-core Performance

Written on November 30, 2018 by Dr Donald Kinghorn
Share:


I've had several people ask me about AMD Threadripper (TR) performance recently. I've replied with my best guesses on how it would do for their varied applications based on my intuition and understanding of the overall design. But, I've had to qualify everything with the comment that "I haven't done my own testing yet". Well, I got to spend a week with a test platform using the TR 2990WX 32-core processor. Of course, the first thing I always want to know about a new processor is how well it performs with the industry standard HPC benchmark, the "High Performance Linpack" (HPL) benchmark.

The Linpack benchmark is a parallel implementation of a large scale linear-system-of-equations "solver". It gives a good measure of a machines maximum floating point performance for numerical computing. This is the benchmark that ranks the fastest Supercomputers in the world. The fastest computer in the world as of November 2018 on The Top500 list is the Summit Supercomputer at Oak ridge National Labs. It has a "Run Peak" of over 200,000 TFLOPS (Trillions of Flopating Point Operations per Second). That's over 200 peta-FLOPS or 200,000,000 GFLOPS (gigaFLOPS). GFLOPS is what we typically rate individual processors with. Summit gets it performance from the (wonderful) IBM Power 9 CPU architecture along with NVIDIA Volta GPU's (most of the performance comes from the GPU's). The Supercomputer that Summit is replacing, was "Titan" which has been in service since 2012 and is ranked number 9 on Top500 with a peak performance of 27,000 TFLOPS. It was utilizing 16-core AMD Opeteron 6274 processors with a total core count of 560,640 (along with NVIDIA K20 GPU's). In this post we'll look at the Linpack benchmark for a processor that you could run in your desktop, the 32-core AMD Ryzen Threadripper 2990WX CPU.


Strategy for testing Threadripper with HPL Linpack

The first thing to note is that this is not a simple task! We will need to build some libraries and jump through some optimization hoops in order to get the best performance. I spent several days working on putting together everything needed to "properly" do a Linpack benchmark for Threadripper. There were lots of failures along the way. The difficulties I encountered are part of what has motivated me to write up this guide. There is no convenient executable to download and run that will give you a good Linpack benchmark result for Threadripper.

The all-important BLAS Library

A highly optimized BLAS library is fundamentally import for performance with the Linpack benchmark. We will have to compile HPL Linpack using an optimal library for AMD "Zen".

Intel, to their credit, does include a pre-compiled set of binaries for doing Linpack testing on their CPU's. It's included with the freely available high performance Math Kernel Library (MKL). The MKL library does not work with AMD processors! (...there is actually a long and interesting story behind that but I wont go into it...) MKL is highly optimized specifically for Intel processor and provides excellent performance.

AMD does provide optimized numerical compute libraries for the Zen architecture processors. The core "BLAS" library is called BLIS. That is what we will use as the optimized library to build HPL Linpack. This is the library for optimal matrix-vector matrix-matrix operations on all of the "Zen-core" processors i.e. Ryzen desktop processors and EPYC "server" processors.

Most Processor manufactures will create an optimized BLAS library for their hardware. BLAS is the "Basic Linear Algebra Subprograms". Level 3 BLAS contains the DGEMM routine. That is "Double precision GEneralized Matrix-Matrix" product. It is generally the most highly optimized piece of code for a Processor architecture. The Linpack benchmark makes heavy, parallel, use of that.

Note: BLAS is fundamental for numerical computing. NVIDIA has cuBLAS for use with CUDA on their GPU's. It's highly optimized and a significant factor in the "stunningly good" compute performance possible on their GPU's. Many of the Top500 supercomputers get the bulk of their performance from (lots of) NVIDIA GPU's.

A look a the Threadripper architecture

It will help to understand the testing strategy if we take a quick look at how the Threadripper processor is laid out. The following image was graciously provided by Dr Ian Cutress over at AnandTech. Be sure to read his thorough and informative review The AMD Threadripper 2990WX 32-Core and 2950X 16-Core Review there is a lot of good information in there!

TR layout

I really like this image for how well is shows the relation between the core clusters and the L3 cache layout. This immediately suggests a strategy for running Linpack. It looks like getting instruction and data mapped into the L3 caches and then using 4 threads for each of those should be pretty efficient. We'll use MPI processes to map to the caches and then openMP threads for each of those. That's 8 MPI "ranks" with 4 omp treads each. This indeed gave the best parallel Linpack performance that I was able to achieve.


The Threadripper Test Platform

The system I used was a test-bed build with the following main components,

Hardware

  • AMD Ryzen Threadripper 2990WX 32-Core @ 3.00GHz
  • Gigabyte X399 AORUS XTREME-CF Motherboard
  • 128GB DDR4 2666 MHz memory
  • Samsung 970 PRO 512GB M.2 SSD
  • NVIDIA Titan V GPU (Yes, I put that in there for my display during the testing because I had it available. That's definite "over-kill" and, no, I didn't tests Linpack on it ... should probably do that sometime...)

Software

Step by Step guide to building and running the HPL Linpack benchmark on AMD Threadripper

The basic outline for doing the build and testing is, (to repeat what I have done)

  • Install Ubuntu 18.04 and build dependencies
  • Compile and install Open MPI
  • Download and setup AMD BLIS libraries
  • Download and compile HPL
  • Setup job configuration files and run the benchmark

I did a lot of experimenting (and failing) when trying to get all of this stuff working. I had installs in various places. For this How-To I'm going to try to keep everything in a single directory in the users home directory to make this as portable as possible and to make it easier to try on an already installed system without making a mess.

Lets do it!


Step 1) -- Ubuntu 18.04 install and dependencies

I did my testing on on a fresh Ubuntu 18.04.1 install. I'm not going to go through the details of the install. It went without any difficulty on the test hardware. I used my "normal" method for the install as outlined in The Best Way To Install Ubuntu 18.04 with NVIDIA Drivers and any Desktop Flavor.

There were some extra packages I installed, (including emacs of course).

Needed dependencies and extra packages

sudo apt-get install build-essential emacs
sudo apt-get install hwloc libhwloc-dev libevent-dev
sudo apt-get install autoconf automake gfortran

The second line above contains important utilities to enable MPI to localize processes to hardware locations.
"The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading." That's how we get processes mapped to the L3 caches on the Threadripper core clusters.

Directory for the software build and install

I'll try to keep things tidy in this post (my build was a bit messy because I tried lots of different approaches). We will have to set environment variables pointing to the right places in the main install directory.

I created a build directory for the downloaded source files,

 mkdir - ~/projects/hpl-build

and, a directory target for the installs,

mkdir ~/AMD-HPL

Step 2) Build and install Open MPI

This is something you are probably not used to unless you have worked with Linux clusters. MPI means "Message Passing Interface". This is a fundamental tool for modern Supercomputing (distributed parallel cluster computing). It provides a communication library with message passing functions that can be included in parallel code to allow execution across separate compute nodes (servers) on a network fabric. It also works well across individual cores on multi-core processors and can often provide a, perhaps surprisingly, good alternative to multithreading. It can also, be used in a hybrid way together with threads to good effect to take advantage of hardware architecture. That is what we will be doing, MPI + OpenMP threads.

Note: I tried the "just released" version 4.0 of Open MPI. That built and tested OK but failed to work with the AMD BLIS library. Version 3.1.3 is also a current version and was just updated on Oct 20, 2018.

Expand the tar file

tar xf openmpi-3.1.3.tar.gz
cd openmpi-3.1.3/

Configure -- Make -- Install

CFLAGS="-Ofast -march=native" ./configure --prefix=$HOME/AMP-HPL
make -j 32
make install

There will be a lot of messages going by while you do this. There are LOTS of options for MPI builds. Keep in mind that it is one of the main ways that programs run in parallel on large supercomputer clusters. Our configuration is a just a simple setup for running HPL Linpack installed in your home directory.

Create an MPI environment script

Since we just installed MPI in a directory that is not on any system paths it will be convenient to make a script that will set a couple of environment variables so we can use it.

Create the file mpi-env.sh in AMD-HPL,

export MPI_HOME=$HOME/AMD-HPL/openmpi-3.1
export PATH=$PATH:$MPI_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/lib

Do a simple MPI test

"Source" the environment script that you created,

source ~/AMD-HPL/mpi-env.sh

Now, create the following program simple-mpi-test.c,

#include    /* PROVIDES THE BASIC MPI DEFINITION AND TYPES */
#include 

int main(int argc, char **argv) {

  MPI_Init(&argc, &argv); /*START MPI */

  // Get the rank of the process
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  printf("Hello world from MPI RANK %d\n", rank);

  MPI_Finalize();  /* EXIT MPI */
}

Compile that program,

mpicc -o simple-mpi-test simple-mpi-test.

and run it,

mpirun -np 4 ./simple-mpi-test

Hello world from MPI RANK 0
Hello world from MPI RANK 2
Hello world from MPI RANK 3
Hello world from MPI RANK 1

You can set the number of processes up to the number of cores you have available. On this Threadripper CPU we are testing that could be -np 32.

Note: MPI processes don't necessarily return in order.

Note: I tried the just released version 4.0 of Open MPI. That built and tested OK but failed to work with the AMD BLIS library. Version 3.1.3 is also a current version and was just updated on Oct 20, 2018.

Congratulations, you just coded up and ran a parallel program on your system!


Step 3) Setup the AMD BLIS library

To get good performance on AMD Threadripper you will need a "BLAS" library that is optimized for the hardware. For AMD processors based on the Zen core the best library will likely be their custom library called "BLIS".

  • Go to https://developer.amd.com/amd-cpu-libraries/blas-library/ and get the file AMD-BLIS-MT-Linux-1.2.tar.gz. Put that file in the projects/hpl-build directory. This is AMD's binary build of the library. The source is available too, and you could compile it yourself but you really don't need to do that.

Notes: That is is the multi-threaded library that we will use in a hybrid approach with openMP threads together with MPI processes (ranks). I found this to give a very good result. There is also a single-threaded version of the library available that works well using only MPI processes for parallelism. It will give a very good result but not quite as good as our hybrid approach. There is also a tempting file available called install_run_hpl-blis.tar.gz I tried using that but had difficulty with it. I was able to use it for a single-threaded build and run but, it failed with a multi-threaded configuration.

Expand the tar file

tar xf AMD-BLIS-MT-Linux-1.2.tar.gz
cd amd-blis-mt-1.2/

Install the BLIS library

This is simple since we are using AMD's pre-compiled binaries. Just copy the directory to the AMD-HPL directory and rename it.

mv amd-blis-mt-1.2 ~/AMD-HPL/blis-mt

When we build the hpl benchmark it will need to be linked to the libraries in $HOME/AMD-HPL/blis-mt/lib.


Step 4) Compile the HPL benchmark

Now on to the actual benchmark code.

Expand the tar file

tar xf hpl-2.2.tar.gz
cd hpl-2.2/

Create a Makefile for Arch=Linux_AMD_BLIS

You need to have a "make" file to setup paths for the libraries we want to link into the hpl executable. I will give you a complete Make.Linux_AMD_BLIS that you can copy into a file with that name. It is a modification of one of the architecture make files included with the hpl source.

I only made a few modification in this file to set the paths for the location where we have installed everything. I've included the entire file for your convenience, sorry it is a little long.

Put the following text in the file in projects/hpl-build/hpl-2.2/Make.Linux_AMD_BLIS

#  
#  -- High Performance Computing Linpack Benchmark (HPL)                
#     HPL - 2.2 - February 24, 2016                          
#     Antoine P. Petitet                                                
#     University of Tennessee, Knoxville                                
#     Innovative Computing Laboratory                                 
#     (C) Copyright 2000-2008 All Rights Reserved                       
#                                                                       
#  -- Copyright notice and Licensing terms:                             
#                                                                       
#  Redistribution  and  use in  source and binary forms, with or without
#  modification, are  permitted provided  that the following  conditions
#  are met:                                                             
#                                                                       
#  1. Redistributions  of  source  code  must retain the above copyright
#  notice, this list of conditions and the following disclaimer.        
#                                                                       
#  2. Redistributions in binary form must reproduce  the above copyright
#  notice, this list of conditions,  and the following disclaimer in the
#  documentation and/or other materials provided with the distribution.
#                                                                       
#  3. All  advertising  materials  mentioning  features  or  use of this
#  software must display the following acknowledgement:                 
#  This  product  includes  software  developed  at  the  University  of
#  Tennessee, Knoxville, Innovative Computing Laboratory.             
#                                                                       
#  4. The name of the  University,  the name of the  Laboratory,  or the
#  names  of  its  contributors  may  not  be used to endorse or promote
#  products  derived   from   this  software  without  specific  written
#  permission.                                                          
#                                                                       
#  -- Disclaimer:                                                       
#                                                                       
#  THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
#  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT
#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
#  OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
#  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT
#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
#  DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY
#  THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# ######################################################################
#  
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = Linux_AMD_BLIS
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = $(HOME)/projects/hpl-build/hpl-2.2
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - MPI directories - library ------------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = $(HOME)/AMD-HPL/openmpi-3.1
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/libmpi.so
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = $(HOME)/AMD-HPL/blis-mt
LAinc        =
LAlib        = $(LAdir)/lib/libblis-mt.a
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib) -lm
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_CALL_VSIPL       call the vsip  library;
# -DHPL_DETAILED_TIMING  enable detailed timers;
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the Fortran 77 BLAS interface
#    *) not display detailed timing information.
#
HPL_OPTS     = -DHPL_CALL_CBLAS
#
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = /usr/bin/gcc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -std=c99 -march=native -fomit-frame-pointer -O3 -funroll-loops -W -Wall -fopenmp
#
LINKER       = /usr/bin/gcc
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------

With that file in place do the build,

make arch=Linux_AMD_BLIS

That should create the xhpl executable and an example HPL.dat file in bin/Linux_AMD_BLIS/.

Now copy those files into the install directory,

mkdir ~/AMD-HPL/hpl
cd bin/Linux_AMD_BLIS/
cp -a xhpl HPL.dat  ~/AMD-HPL/hpl/

The directory AMD-HPL now contains opemMPI 3.1, the optimized AMD-BLIS libraries, and the HPL Linpack executable. We just need to setup the HPL.dat file and start benchmarking!


Step 5) Setup the HPL.dat configuration file for benchmarking

The HPL.dat file contains a bunch of "tuning" and configuration parameters for setting up job runs. If you were benchmarking a supercomputer cluster you might spend weeks messing with this. I'm going to give you one (reasonably good) example and explain a little about the most important parameters. There is a good discussion of the parameters and how to setup a batched grid to sweep through some of them here, [ http://www.netlib.org/benchmark/hpl/tuning.html

I did several experiments but had the best result from the hybrid scheme that was described earlier in the post. I used 1 MPI process (rank) for each L3 cache, and 4 omp threads for each of those processes (1 for each zen core being fed by the cache). Here's the configuration file HPL.dat.

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
84480	     Ns
1            # of NBs
232	     NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
4            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

The most important parameters are N, P and Q, and to a lesser extent NB.

  • N -- The size of the input matrix (i.e. the number of linear equations being solved). This number usually give the best results when it set large enough to use up 80-90% of the available memory. (I tell you how to compute an optimal N later).
  • NB -- The block size for the algorithm. Common values range from 96 to 256 in steps of 8. Good choices to start with are usually 232 or 240.
  • P and Q -- It is important that P x Q = the number of MPI processes being used. This is the process grid size. It's usually best to have P < Q. I was using 8 MPI ranks so I had P x Q = 8. I could have used (1 8) or (2 4).

For a good value of N you want it to be some large percentage of your total memory that is a a multiple of the block size NB. Here's a little python one-liner to compute N with NB = 232 using 86% of 128GB of memory,

python -c 'print( int( (128 * 1024 * 0.86 // 232) * 232 ) )'
112520

Note: that I used a "bad" value for N in the HPL.dat file that I listed above! I had forgotten to change it after an experiment but it ended up giving a good result so I kept it! ( I could have used another day or two for testing. )


Step 6) Run the benchmark!

From the ~/AMD-HPL/hpl directory with the xhpl executable and the HPL.dat file available, the following should start up a Linpack job run for you. Note: Use a value of N that makes sense for the amount of memory that you have in your system! (I have some suggestion near the end of the post.) Be warned that it will take several minutes for the job to run.

First source the MPI environment in the shell that you are running in,

source ~AMD-HPL/mpi-env

The following could be put in a script and run (or you can just enter it at the command line).

export OMP_PROC_BIND=TRUE
export OMP_PLACES=cores
export OMP_NUM_THREADS=4
mpirun -np 8 --map-by l3cache --mca btl self,vader xhpl
  • OMP_PROC_BIND=TRUE is useful if you have not disabled SMT (i.e Hyper-Threading) in the BIOS.
  • OMP_PLACES=cores binds the omp threads to zen cores.
  • OMP_NUM_THREADS=4 assigns the number of threads to use for each executable process i.e MPI rank.
  • mpirun -np 8 starts MPI with 8 ranks (there will be a total CPU load of 32 processes, 8 MPI ranks with 4 omp threads each )
  • --map-by l3cache tells openMPI to map it's ranks to hardware L3 caches
  • the rest is just appropriate MPI parameters for how we built it
  • xhpl is the executable that will run. It reads HPL.dat for input

While running that job this is the output from top on the Threadripper 2990WX system I was using.

top output

That shows the 8 MPI processes and each of them is using 400% cpu (it was changing while I grabbed the screen shot).

The output from that run is,

AMD Threadripper 2990WX HPL Linpack result -- 596.5 GFLOP/s

mpirun -np 8 --map-by l3cache --mca btl self,vader xhpl
================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   84480
NB     :     232
PMAP   : Row-major process mapping
P      :       2
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Right
BCAST  :   2ring
DEPTH  :       1
SWAP   : Spread-roll (long)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR12R2R4       84480   232     2     4             673.87              5.965e+02
HPL_pdgesv() start time Mon Nov 19 15:20:17 2018

HPL_pdgesv() end time   Mon Nov 19 15:31:31 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0028942 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Comments, thoughts and recommendations

There you have it! 597 GFLOP/s for Threadripper 2990WX. That is the best result that I have seen reported for this processor. I have no doubt that it could be improved! My guess is that it will top out around 640 GFLOP/s. I spent most of the time I had with the processor working on trying to get a good setup to do the benchmarking. I actually spent very little time on tuning the benchmark runs. If you want to do your own benchmark runs then, hopefully, this guide will save you most of that time I spent trying to work out a good setup.

Ideas for testing other 2nd Gen Threadripper processors (2990Wx, 2970WX, 2950WX, 2920WX)

I'll give you a few hints on starting values for testing other processors in this Zen 2 family. This would be based on my hybrid scheme with MPI ranks and omp threads mapped to L3 caches.

General:

  • NB = 232
  • 90% memory for various amounts of total memory
    • 16GB -- N = 14616
    • 32GB -- N = 29464
    • 64GB -- N = 58928
    • 128GB -- N = 117856
    • 256GB -- N = 235712
  • export OMP_PROC_BIND=TRUE (assuming SMT is enabled in BIOS)
  • export OMP_PLACES=cores

Specific:

  • 2990WX [32-cores, 8 L3 caches]
    • (P,Q) -- (1,8) or (2,4)
    • export OMP_NUM_THREADS=4
    • mpirun -np 8 --map-by l3cache --mca btl self,vader xhpl
  • 2970WX [24-cores, 8 L3 caches]
    • (P,Q) -- (1,6) or (2,3)
    • export OMP_NUM_THREADS=3
    • mpirun -np 8 --map-by l3cache --mca btl self,vader xhpl
  • 2950WX [16-cores, 4 L3 caches]
    • (P,Q) -- (1,4) or (2,2)
    • export OMP_NUM_THREADS=4
    • mpirun -np 4 --map-by l3cache --mca btl self,vader xhpl
  • 2920WX [12-cores, 4 L3 caches]
    • (P,Q) -- (1,3)
    • export OMP_NUM_THREADS=3
    • mpirun -np 4 --map-by l3cache --mca btl self,vader xhpl

OK, maybe I should spend a day or 2 at Puget Labs swapping out processors and do this!

Other tuning notes

Two things that may improve results that I didn't try (These are recommended by AMD)

  • Disable SMT in the BIOS. The OMP environment I gave you should do the right thing even if SMT is enabled but it might improve results to turn SMT off. This is the kind of heavy CPU load that completely maxes out "real" cores. "Hyper-threads" usually slow things down for this.
  • Lower the memory clock to 2400MHz. YES!, you read that right! AMD says that this will make more power available to the processor cores and help to maintain higher boost clocks and improve performance.

How does AMD Threadripper 2990WX compare to Intel "X-Series" for HPL Linpack?

Well, not so good! Without any special effort I can run the Linpack benchmark executable included with Intel MKL and get 838 GFLOP/s on my Xeon-W 2175 14-core machine. The Core i9-7940X will do about the same and the new Core i9 9980EX 18-core should do much better (but I haven't tested it yet).

Here's why the Intel "X-series" and Xeon -W or -SP are so much faster for this kind of intense compute workload.

  • AMD Threadripper 2990WX -- 32-cores, each core has 1 AVX2 (256bit) vector units == 597 GFLOPS
  • Intel Xeon-W 2175 -- 14-cores, each core has 2 AVX512 (512bit) vector units == 838 GFLOPS

For programs that are well optimized with good vectorization that use hardware specific optimized BLAS libraries the CPU vector units make a BIG difference! The Intel AVX512 vector units are really impressive!

Does any of this matter?

Yes, and No. Like I said above, it matters if you have well optimized (vectorized) code. Being able to link into Intel MKL and use their highly optimized BLAS library on modern Intel hardware is about as good as it gets for floating point performance on CPU. However, there are lots of programs that don't do that! Even programs that make calls to Intel MKL may not benefit if the code itself is not well optimized or is just inherently non-vectorizable. Programs that have good parallel scaling across cores (not necessarily vectorized) will likely do well on Threadripper's many cores. In fact the 2990WX reminds me a lot of a quad-Opteron of a few years back. Programs that would run well on small clusters or multi socket single board systems should do well on Threadripper. I did a little testing with a couple of examples that I thought would do well on the 2990WX and they did. Compiling a large C project, and a multi-core CPU run of a Molecular dynamics program both did really well. I will probably write up another post with some other testing like that ... maybe.

What about using GPU's and forgetting about CPU's?

There are times when you can't use GPU's for compute. If you need large amounts of memory GPU's can be very difficult to work with. However, if you can use GPU accelerated code then that is almost surely going to to give much better performance than running on CPU's. My personal favorite compute device is NVIDIA's Titan V. That card is several times faster than my Xeon-W even when using fp64 (double precision). If you can use single precision, fp32, then the new NVIDIA RTX 2080Ti is great. There are more and more programs that use GPU acceleration and in fact over 90% of the performance of the fastest supercomputer in the world right now (Summit at Oak Ridge) comes from NVIDA GPU's!

My recommendation

For me personally I prefer Intel CPU's with AVX512 and 1 or 2 good NVIDIA GPU's. However, the Threadripper series processors are really interesting. AMD did a nice job with the design and Intel has good reason to be concerned about it. In fact it looks like Intel might even try their hand at a multi-core-cluster design. There are places where the AMD processors are going to give better performance. We are already recommending them for some video post production applications in certain use cases. I would also recommend them for codes that just need lots of CPU cores. An application like a software build platform would be a good use case. Most software is not highly optimized, matrix vector algebra based, and virtually ALL of the modern processors are amazing from an historical perspective. I would certainly not discourage anyone from trying the new AMD processors. I really enjoyed messing around with the 2990WX! I felt like I had small Linux cluster on a chip and it was fun thinking about it that way.

... but my general recommendation for a personal Workstation CPU is still Intel ("X-series" or Xeon). It's just a safer bet for a wide range of applications. Having all of the cores and cache on a single chip with no extra "transport layer" memory hops is less likely to cause problems for application that are memory bound or communication bound to PCIe bus devices like GPU's. I don't even like to recommend multi-socket workstations any more because of the potential for unexpected memory layer problems.

I'm looking forward to getting back to working with PyTorch and the Titan V for scientific computing! I think that combination (or something like it) is going to change the future of how programs are written for scientific computing.

Happy computing --dbk

Tags: Threadripper, Ryzen, 2990WX, Linpack, HPL, HPC, Linux
lemans24

New TItan RTX coming out: full TU102 chip, 24GB vrma for $2500
What do you guys think??

Posted on 2018-12-03 19:02:57

Looks to be a great card, if you only need one (or maybe two) GPUs in your system. They kept the dual-fan cooling design from the GeForce RTX 2000-series Founders Edition cards, which means they will *not* work well when stacked next to each other under load:

https://www.pugetsystems.co...

It will also be interesting to see if there are any other major differences between the Titan RTX and the Quadro RTX 6000, which otherwise has very similar specs. The Quadro does have ECC memory, and a single-fan cooler (making it much better for multi-GPU configurations)... but it costs 2.5 times what the Titan does :/

Posted on 2018-12-03 19:36:01
Donald Kinghorn

Hey this is supposed to be about Threadripper here :-) ... but yea, it's kind of strong point that GPU's are wonderful for compute when you can use them. I'm really looking forward to getting the new RTX Titan under test. As William pointed out it looks similar to Quadro RTX 6000 for specs. My biggest question is about fp64 performance (because I still need it) ??? but, even if the double precision is not great it wont mater for lots of applications, so, 24GB mem and $2500 price tag is pretty welcome!

Posted on 2018-12-03 20:19:40
Donald Kinghorn

just read this note in the TU102 white paper ...
"
Note: The TU102 GPU also features 144 FP64 units (two per SM), which are not depicted in this diagram.
The FP64 TFLOP rate is 1/32nd the TFLOP rate of FP32 operations. The small number of FP64 hardware
units are included to ensure any programs with FP64 code operates correctly.
"
bummer! but it is what I expected...

Posted on 2018-12-03 20:32:10
lemans24

Sorry Don...did not mean to hijack thread but I guess I just like to comment on all of your HPC articles and maximizing GPU performance!!! Excellent article though as always...
Kind of disappointed with dual fan instead blower style as Titan RTX multi-gpu setup does not seem in the cards for now. Probably will be aftermarket water cooling though...

Posted on 2018-12-03 21:11:39
Donald Kinghorn

:-) that is perfectly fine ... the section "What about using GPU's and forgetting about CPU's?" Is mostly how I'm thinking these days. Anyone reading this should probably be thinking about GPU's ... if they can take advantage of them! [I'll get some testing posted on the RTX Titan as soon as I get my hands on one]

I though of a use that may be really nice for the 2990WX together with GPU's. I did a little testing with NAMD while I had the processor (molecular dynamics program) The Threadripper did really well with that code on CPU. The thing is NAMD has good GPU acceleration too. One of the limiting factors in performance with NAMD is getting enough CPU performance (balance) to keep up with the GPU's. I'm thinking a 2990WX together with 2-4 2080Ti's could be great for that code. I should test that! I'll put up another short post on the 2990WX this week with some of the other job runs that I did other than Linpack and if I can get my hands on the processor and a few 2080Ti's I'll do a full NAMD performance test a bit later.

Posted on 2018-12-03 23:37:51