AMD Ryzen 3950x Compute Performance Linpack and NAMD

Table of Contents

Introduction

The, much anticipated, AMD Ryzen 3950x 16-core processor is out! As always the first thing I wanted know was the double precision floating point performance. My two favorite applications for a "first look" at a new CPU are Linpack and NAMD.

At the end of July I did a little testing with Ryzen 3900x 12-core. "AMD 3900X (Brief) Compute Performance Linpack and NAMD" I was pretty impressed with performance but was wishing that there was a more optimal BLAS library for the Zen2 architecture. There is now a newer version 2.0 of the AMD "BLIS" library and it gives significantly better performance with Linpack than v1.3 that was used in the older post.

This post revisits the post from July and adds in new results for the Ryzen 3950x and updated results for the 3900x using BLIS v2.0. I'm also including NAMD Molecular Dynamics results for my usual test molecule, STMV as well as a smaller molecular system, ApoA1. ApoA1 seems to be a popular system for benchmarking on CPU with NAMD.

System Configuration

Hardware:

AMD Ryzen 3900x and 3950x
Motherboard Gigabyte X570 AORUS ULTRA
Memory 4x DDR4-2933 16GB (64GB total)
2TB Intel 660p NVMe M.2
NVIDIA RTX 2080Ti GPU

Software:

Ubuntu 19.10
Compiler gcc 9.2
AMD BLIS library v 2.0
HPL Linpack 2.2 (Using pre-compiled binary at link above)
OpenMPI 3.1.3 (installed from package manager i.e. apt-get)
NAMD 2.13 (Molecular Dynamics)

Notes:

I used the most recent Ubuntu 19.10 for this testing in order to have up-to-date system libraries and a kernel (5.3) with better support for Zen2.
New results in this post are for Ryzen 3900x and 3950x only. The other results are from previous testing.

Linpack

Notes:

The pre-built muit-threaded HPL binary provided by AMD worked well so I didn't bother rebuilding from source. This is the "MT" build but it still looks for MPI header files on start-up and uses the HPL.dat file for job run configuration.
AMD BLIS (a.k.a. AMD's BLAS library) has been updated to version 2.0 with specific support for Zen2.
Several combinations with MPI ranks together with OMP threads were tried. The best results obtained were using only OMP threads and the pre-built binary without MPI. 1 OMP thread per "real" core i.e. 16 and 12 OMP processes gave the best result.
There is a detailed description of HPL Linpack testing for Threadripper 2990WX in the post, How to Run an Optimized HPL Linpack Benchmark on AMD Ryzen Threadripper — 2990WX 32-core Performance The 2990WX testing in this post and the result presented could probably be improved with the new BLIS lib.
The Intel CPU's were tested with the (highly) optimized Linpack benchmark program included with Intel MKL performance library.
A large problem size approx. 90% of available memory (64GB) was used in order to maximize performance results, Ns=90000.

Here is an HPL.dat file used, [this file automates using 3 problems sizes (Ns) and 3 Block sizes (NBs), also note that P and Q are set to 1 i.e. 1 MPI Rank, parallelism was from OMP threads]

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
3            # of problems sizes (N)
40000 80000 90000  Ns
3            # of NBs
240 256 512	 NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1	         Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

The following environment variables were set for the Ryzen Linpack runs

export OMP_PROC_BIND=TRUE
export OMP_PLACES=cores
export OMP_NUM_THREADS=12   (16 for 3950x

Now the good part …

The following plot shows HPL Linpack results (in GFLOPS) for the Ryzen 3950x and 3900x plus a few other CPU's that I have recently tested. Best results fro Ryzen were with Ns=90000 and NB=256.

The Ryzen 3950x and 3900x results are very good for processors with AVX2! Notice that the use of of BLIS v2.0 improved the 3900x result by 15% over v1.3.

The Intel processors with AVX-512 vector units have a big advantage for Linpack. Also,the Linpack used for the Intel processors is built with the BLAS library from Intel's excellent MKL (Math Kernel Library).

NAMD

Now on to the real world! … sort of … NAMD is one of my favorite programs to use for benchmarking because it has great parallel scaling across cores (and cluster nodes). It does not significantly benefit from linking with the Intel MKL library and it runs on a wide variety of hardware and OS platforms. It's also a very important Molecular Dynamics research program.

When is said "sort of" above I'm referring to the fact that NAMD also has very good GPU acceleration. Adding CUDA capable GPU's will increase throughput by an order of magnitude. However, with NAMD and other codes like it, only some of the heavy compute can be offloaded to GPU. A good CPU is necessary to achieved balanced performance. I like NAMD as a CPU benchmark because I believe it is an excellent representative of scientific applications and reflects performance characteristic of many other programs in this domain.

This plot shows the performance of a molecular dynamics simulation on the million atom "stmv" ( satellite tobacco mosaic virus ). These job runs are with CPU only. Performance is in "day/ns" (days to compute a nano second of simulation time) This the standard output for NAMD. If you prefer ns/day then just take the reciprocal.

The Ryzen CPU's gave excellent performance! Also NAMD benefits greatly for the 32 core of the 2990WX I looking forward to seeing results with the new Threadripper!

This last set of results is using the smaller ApoA1 problem (it's still pretty big with 97000 atoms!)

I ran this job for two reasons; 1) to show how well the 3950x does compared to the Xeon-W 2175 14-core AND to provide a reality check about how much adding a GPU can increase performance for programs that have good GPU acceleration. Adding the NVIDIA 2080Ti GPU increased performance by over a factor of ten!

Conclusion

In the earlier post for the Ryzen 3900x I had reservations about the stability of the platform. In the few months that have past since that testing things motherboards and BIOS issues have settled and the systems are looking solid. Welcome back AMD!

I will be doing more CPU testing in a few week after all of this years new processors from Intel and AMD are released. So, expect another post with LOTS of new CPU;s in it!

Happy computing! –dbk @dbkinghorn