Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1617
Dr Donald Kinghorn (Scientific Computing Advisor )

AMD Ryzen 3950x Compute Performance Linpack and NAMD

Written on November 14, 2019 by Dr Donald Kinghorn
Share:

Introduction

The, much anticipated, AMD Ryzen 3950x 16-core processor is out! As always the first thing I wanted know was the double precision floating point performance. My two favorite applications for a "first look" at a new CPU are Linpack and NAMD.

At the end of July I did a little testing with Ryzen 3900x 12-core. "AMD 3900X (Brief) Compute Performance Linpack and NAMD" I was pretty impressed with performance but was wishing that there was a more optimal BLAS library for the Zen2 architecture. There is now a newer version 2.0 of the AMD "BLIS" library and it gives significantly better performance with Linpack than v1.3 that was used in the older post.

This post revisits the post from July and adds in new results for the Ryzen 3950x and updated results for the 3900x using BLIS v2.0. I'm also including NAMD Molecular Dynamics results for my usual test molecule, STMV as well as a smaller molecular system, ApoA1. ApoA1 seems to be a popular system for benchmarking on CPU with NAMD.

System Configuration

Hardware:

  • AMD Ryzen 3900x and 3950x
  • Motherboard Gigabyte X570 AORUS ULTRA
  • Memory 4x DDR4-2933 16GB (64GB total)
  • 2TB Intel 660p NVMe M.2
  • NVIDIA RTX 2080Ti GPU

Software:

Notes:

  • I used the most recent Ubuntu 19.10 for this testing in order to have up-to-date system libraries and a kernel (5.3) with better support for Zen2.
  • New results in this post are for Ryzen 3900x and 3950x only. The other results are from previous testing.

Linpack

Notes:

  • The pre-built muit-threaded HPL binary provided by AMD worked well so I didn't bother rebuilding from source. This is the "MT" build but it still looks for MPI header files on start-up and uses the HPL.dat file for job run configuration.
  • AMD BLIS (a.k.a. AMD's BLAS library) has been updated to version 2.0 with specific support for Zen2.
  • Several combinations with MPI ranks together with OMP threads were tried. The best results obtained were using only OMP threads and the pre-built binary without MPI. 1 OMP thread per "real" core i.e. 16 and 12 OMP processes gave the best result.
  • There is a detailed description of HPL Linpack testing for Threadripper 2990WX in the post, How to Run an Optimized HPL Linpack Benchmark on AMD Ryzen Threadripper -- 2990WX 32-core Performance The 2990WX testing in this post and the result presented could probably be improved with the new BLIS lib.
  • The Intel CPU's were tested with the (highly) optimized Linpack benchmark program included with Intel MKL performance library.
  • A large problem size approx. 90% of available memory (64GB) was used in order to maximize performance results, Ns=90000.

Here is an HPL.dat file used, [this file automates using 3 problems sizes (Ns) and 3 Block sizes (NBs), also note that P and Q are set to 1 i.e. 1 MPI Rank, parallelism was from OMP threads]

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
3            # of problems sizes (N)
40000 80000 90000  Ns
3            # of NBs
240 256 512	 NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1	         Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

The following environment variables were set for the Ryzen Linpack runs

export OMP_PROC_BIND=TRUE
export OMP_PLACES=cores
export OMP_NUM_THREADS=12   (16 for 3950x

Now the good part ...

The following plot shows HPL Linpack results (in GFLOPS) for the Ryzen 3950x and 3900x plus a few other CPU's that I have recently tested. Best results fro Ryzen were with Ns=90000 and NB=256.

Ryzen 3950X Linpack

The Ryzen 3950x and 3900x results are very good for processors with AVX2! Notice that the use of of BLIS v2.0 improved the 3900x result by 15% over v1.3.

The Intel processors with AVX-512 vector units have a big advantage for Linpack. Also,the Linpack used for the Intel processors is built with the BLAS library from Intel's excellent MKL (Math Kernel Library).

NAMD

Now on to the real world! ... sort of ... NAMD is one of my favorite programs to use for benchmarking because it has great parallel scaling across cores (and cluster nodes). It does not significantly benefit from linking with the Intel MKL library and it runs on a wide variety of hardware and OS platforms. It's also a very important Molecular Dynamics research program.

When is said "sort of" above I'm referring to the fact that NAMD also has very good GPU acceleration. Adding CUDA capable GPU's will increase throughput by an order of magnitude. However, with NAMD and other codes like it, only some of the heavy compute can be offloaded to GPU. A good CPU is necessary to achieved balanced performance. I like NAMD as a CPU benchmark because I believe it is an excellent representative of scientific applications and reflects performance characteristic of many other programs in this domain.

This plot shows the performance of a molecular dynamics simulation on the million atom "stmv" ( satellite tobacco mosaic virus ). These job runs are with CPU only. Performance is in "day/ns" (days to compute a nano second of simulation time) This the standard output for NAMD. If you prefer ns/day then just take the reciprocal.

NAMD Ryzen 3950X

The Ryzen CPU's gave excellent performance! Also NAMD benefits greatly for the 32 core of the 2990WX I looking forward to seeing results with the new Threadripper!

This last set of results is using the smaller ApoA1 problem (it's still pretty big with 97000 atoms!)

I ran this job for two reasons; 1) to show how well the 3950x does compared to the Xeon-W 2175 14-core AND to provide a reality check about how much adding a GPU can increase performance for programs that have good GPU acceleration. Adding the NVIDIA 2080Ti GPU increased performance by over a factor of ten!

Ryzen 3950x + 2080Ti NAMD ApoA1

Conclusion

In the earlier post for the Ryzen 3900x I had reservations about the stability of the platform. In the few months that have past since that testing things motherboards and BIOS issues have settled and the systems are looking solid. Welcome back AMD!

I will be doing more CPU testing in a few week after all of this years new processors from Intel and AMD are released. So, expect another post with LOTS of new CPU;s in it!

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of workstations that are tailor-made for your unique workflow. Our goal is to provide the most effective and reliable system possible so you can concentrate on your work and not worry about your computer.

Configure a System!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: AMD, HPL, linpack, NAMD, Ryzen
MagicWax

Nice! I have recently seen some people recommending using MKL on AMD, with the MKL_DEBUG_CPU_TYPE environment variable set to 5, as in:

export MKL_DEBUG_CPU_TYPE=5

This overrides the CPU dispatching in MKL, and forces the AVX2 codepath (the one MKL naturally uses on Intel parts without AVX512), otherwise MKL chooses an unoptimized SSE path with abysmal performance. But with the AVX2 path, MKL performs very well on Zen2, usually even outperforming BLIS and OpenBLAS!

Posted on 2019-11-15 12:50:50
Ned Flanders

This is very interesting Info! Would that be possible in Matlab?

Posted on 2019-11-15 16:19:23
MagicWax

No idea, I am not a Matlab user, and I run everything that requires a lot of number crunching under Linux.
I mean if Matlab is using MKL and has a Linux version, then I am quite sure it would work. If Matlab does not use MKL it will do nothing.

Posted on 2019-11-17 21:54:57
Ned Flanders

Thanks again for mentioning this. Based on this, I found a solution that works like a charm!

https://www.reddit.com/r/ma...

Posted on 2019-11-18 01:45:37
Misha Engel

https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/

How-To force Matlab to use a fast codepath on AMD Ryzen/TR CPUs - up to 250% performance gains

Tips

Hello everyone.

I
wanted to briefly present my tweak here, as I think it might be of
interest for many in this community. Applying the tweak takes less than a
minute.

What is it?

Matlab
runs notoriously slow on AMD CPUs for operations that use the Intel
Math Kernel Library (MKL). This is because the Intel MKL uses a
discriminative CPU Dispatcher that does not use efficient codepath
according to SIMD support by the CPU, but based on the result of a
vendor string query. If the CPU is from AMD, the MKL does not use
SSE3-SSE4 or AVX1/2 extensions but falls back to SSE1 no matter whether
the AMD CPU supports more efficient SIMD extensions like AVX2 or not.

The method provided here does enforce AVX2 support by the MKL, independent of the vendor string result.

Posted on 2019-11-23 23:45:06
Ned Flanders

What a coincidence, the redditor and me share the same nic ;-)

Posted on 2019-11-29 20:25:50
Misha Engel

How many atoms fit in a GByte of VRAM, just trying to figure out when the 48 GB RTX 8000 starts to make sense.

Posted on 2019-11-15 12:54:43
MagicWax

If you use 3 Cartesian coordinates for every atom, and use single precision (32-bit) floating point, then each atom takes 96 bits. But you also need to store a velocity vector for every atom, if you want to run dynamics, so thats another 3 floats. You probably also need some more memory for other arrays beyond the bare position and velocity vectors.

Posted on 2019-11-15 14:10:08
Reuven Meir

I would really like to see HPCG added to your list of go-to CPU benchmarks for scientific applications. A ton of science codes use iterative sparse solvers like the one in HPCG and is a great representative test for memory-bound algorithms. It also is used as an alternative list on the Top500 to rate supercomputers. Any chance you could benchmark HPCG? http://www.hpcg-benchmark.org/

Posted on 2019-11-15 19:12:22
Donald Kinghorn

Yes, I have spent some time with HPCG but have not incorporated it yet into my normal testing. I do agree that this would be good to include ... soon I promise :-) I'll be setting up benchmark environments as docker containers to make test setup easier and more consistent

Posted on 2019-11-20 04:23:57
Nathan Zechar

Thank you Dr.Kinghorn!

Posted on 2019-11-16 08:28:11

As always thanks for the reviews. One particular note I personally was looking for, you managed to address directly: "In the earlier post for the Ryzen 3900x I had reservations about the stability of the platform. In the few months that have past since that testing things motherboards and BIOS issues have settled and the systems are looking solid."

These vital details are some of the reasons I consider Puget’s reviews and articles credible, serious and professional. Thank you.

Posted on 2019-11-18 17:07:50
Donald Kinghorn

Just to let everyone know ... I will be testing with the MKL_DEBUG_CPU_TYPE environment variable (that was suggested in that Reddit thread listed below) ... soon using a new AMD CPU :-) (but I'll say now that the performance with BLIS v2 is impressive!)
If things go well I'll redo the BLAS lib testing with Python numpy too

Lets all hope this works well. ... but if it doesn't I have to say the new AMD BLIS lib is pretty good!

Posted on 2019-11-24 18:35:04