Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1666
Dr Donald Kinghorn (Scientific Computing Advisor )

AMD Threadripper 3990x 64-core Linpack and NAMD Performance (Linux)

Written on February 7, 2020 by Dr Donald Kinghorn
Share:

Introduction

64 cores! The latest AMD Threadripper is out, the 3990x 64-core. I've spent the last couple of days running benchmarks and have some results showing raw numerical compute performance using my standard CPU testing applications HPL Linpack and the molecular dynamics program NAMD. The 3990x is a great processor, however, there were difficulties and some disappointments during the testing.

It is nice having AMD making exceptionally great processors again. This 64-core Threadripper 3990x is the pinnacle of the "consumer" Zen2 core processors. (EPYC Rome is the server line based on Zen2 core)

THESE RESULTS ARE PRELIMINARY!

Version 2.0 of the AMD "BLIS" library was used which gives very good performance with Linpack. I did have some scaling anomalies with the 3990x that I have not resolved yet (but still achieved very good results).

This post revisits the recent Ryzen and Threadripper posts and adds in new results for the Threadripper 3990x I'm including NAMD Molecular Dynamics results for my usual test molecule, STMV as well as a smaller molecular system, ApoA1. ApoA1 seems to be a popular system for benchmarking on CPU with NAMD. GPU acceleration results are reported for the STMV and ApoA1 job runs.

Other recent posts related to this testing are; "AMD Threadripper 3970x Compute Performance Linpack and NAMD", "AMD Ryzen 3950x Compute Performance Linpack and NAMD" and "AMD 3900X (Brief) Compute Performance Linpack and NAMD".

Difficulties

I'll start with some of the problems I encountered during the testing that prompted me to remark that these "results are preliminary". This should temper the results that follow. There is room for improvement!

Install issues

  • Ubuntu with any kernel newer than 5.0.0 would hang during install (on the hardware I was using).
  • Ubuntu 18.04 with HWE kernel would boot but would hang after update
  • Ubuntu 19.10 would hang during install
  • I had to drop back to Ubuntu 18.04 with 4.15 kernel for a stable install. That is too old to be fully "Zen2 aware".

I expect this to be a better platform using the finial release of Ubuntu 20.04 in April.

HPL Linpack anomalies

  • HPL Linpack did not achieve expected performance based on comparison with 32-core 3970x
  • Performance was better with 3990x than 3970x but I could drop 16 cores from the 3990x with only minimally lower performance.

I did experiments with openMP threads and hybrid parallelism with openMP threads and MPI ranks. Results are approximately 25% lower than expected.

NAMD performance was very good but I can't help but think results could be better based on what I saw with Linpack.

System Configuration

Hardware:

(see the posts linked in the Introduction for older test configurations)

  • AMD Threadripper 3990x
  • Motherboard Gigabyte TRX40 AORUS
  • Memory 8x DDR4-2933 16GB (128GB total)
  • 1TB Samsung 960 EVO M.2
  • 2 x NVIDIA RTX Titan GPU's

Software:

Notes:

  • The Ryzen 3900x and 3950x worked well on Ubuntu 19.10. Both the Threadripper 3970x and 3990x required dropping back to 18.04.

  • New results in this post are for Threadripper 3990x only. The other results are from previous testing.

Linpack

Notes:

  • I'm using the same HPL binary that was used for testing the 3970x i.e. the pre-built muit-threaded HPL binary provided by AMD. This is the "MT" build but it still looks for MPI header files on start-up and uses the HPL.dat file for job run configuration. This is why an OpenMPI install is needed to run this benchmark.
  • AMD BLIS (a.k.a. AMD's BLAS library) version 2.0 with specific support for Zen2 was used.
  • Several combinations with MPI ranks together with OMP threads were tried. The best results obtained were using only OMP threads and the pre-built binary without MPI. 1 OMP thread per "real" core gave the best result. (WITH SMT DISABLED IN THE BIOS)

  • There is a detailed description of HPL Linpack testing for Threadripper 2990WX in the post, How to Run an Optimized HPL Linpack Benchmark on AMD Ryzen Threadripper -- 2990WX 32-core Performance The 2990WX testing in this post and the result presented could probably be improved with the new BLIS lib.
  • The Intel CPU's were tested with the (highly) optimized Linpack benchmark program included with Intel MKL performance library.
  • A large problem size approx. 90% of available memory (128GB) was used in order to maximize performance results, Ns=116000.
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR12R2R4      116000  1024     1     1             662.21             1.5714e+03
HPL_pdgesv() start time Thu Feb  6 10:50:10 2020

HPL_pdgesv() end time   Thu Feb  6 11:01:12 2020

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.80355268e-03 ...... PASSED
================================================================================

Here is an HPL.dat file used, [this file automates using 3 problems sizes (Ns) and 3 Block sizes (NBs), also note that P and Q are set to 1 i.e. 1 MPI Rank, parallelism was from OMP threads]

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
3            # of problems sizes (N)
112000 114000 116000 Ns
5            # of NBs
512 640 768 896 1024  NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
...

The following environment variables were set for the Ryzen Linpack runs

export OMP_PROC_BIND=TRUE
export OMP_PLACES=cores
export OMP_NUM_THREADS=32   (16 for 3950x ...)

The AMD Threadripper 3990x results are not as high as expected and can likely be improved.

The following plot shows HPL Linpack results (in GFLOPS).

TR3990X Linpack

The TR3990x results are impressive for a processors with AVX2 (rather than AVX512) but I expect that these results could be better. (see "issues" section) I will repeat this testing after Ubuntu 20.04 is released.

The Intel processors with AVX-512 vector units have an advantage for Linpack. Also,the Linpack used for the Intel processors is built with the BLAS library from Intel's excellent MKL (Math Kernel Library).

NAMD

NAMD is one of my favorite programs to use for benchmarking because it has great parallel scaling across cores (and cluster nodes). It does not significantly benefit from linking with the Intel MKL library and it runs on a wide variety of hardware and OS platforms. It's also a very important Molecular Dynamics research program.

NAMD also has very good GPU acceleration. Adding CUDA capable GPU's will increase throughput significantly. However,with NAMD and other codes like it, only a portion of the heavy compute can be offloaded to GPU. A good CPU is necessary to achieved balanced performance.

This plot shows the performance of a molecular dynamics simulation on the million atom "stmv" ( satellite tobacco mosaic virus ). These job runs are with CPU and with 1 or 2 RTX Titan GPU's added. Performance is in "day/ns" (days to compute a nano second of simulation time) This is the standard output for NAMD. If you prefer ns/day then just take the reciprocal.

NAMD  3990X STMV

The Threadripper 3990x gave excellent performance for NAMD. Results are exceptionally good for CPU alone and with added GPU's. These are the best results I have ever obtained for these job runs.

This last set of results is using the smaller ApoA1 problem (it's still pretty big with 92000 atoms!) These results are CPU only.

*Results with added RTX Titan are 0.031 day/ns and 2 x RTX Titan 0.020 day/ns. These results are similar to those with the 3970x and 1 or 2 2080Ti GPU's. The job runs are so fast that there is little difference because of communication dominating the calculation time.

TR 3990x  NAMD ApoA1

I expected the TR3990x 64-core CPU together with 2-4 high-end NVIDIA GPU's to "set the bar" for performance as a workstation platform for this class of applications. I believe this is indeed the case!

Conclusion

The AMD Threadripper 3990x is mile-stone in computing. A 64-core desktop workstation processor was unimaginable a few years ago. This is definitely a "specialty" processor. This is a processor for large parallel computing problems that have excellent scalability. It will be a very compelling Scientific Workstation processor.

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of workstations that are tailor-made for your unique workflow. Our goal is to provide the most effective and reliable system possible so you can concentrate on your work and not worry about your computer.

Configure a System!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: AMD, HPL, linpack, NAMD, Threadripper
lordtux

Hi Donald, very nice review. One question, did you notice if any application being limited by the memory bandwith? I always see people complaining about this but TR seems to be fighting very well against 6 channels Xeons for example.

Posted on 2020-02-07 18:09:54
Donald Kinghorn

I did some testing with the HPCG benchmark during the last couple of hours I had access to the sys (hopefully I get access again soon)

The results were not wonderful but a colleague that has experience with that benchmark assured that the results were not unexpected and not too bad...

I did tweet those results on Firday (7th) HPCG makes heavy demands on the memory subsystem ... ( I think you may have seen these :-) ... I'm late responding to your comment here) I will probably write this stuff up but maybe not until I get my hands on the system again.

with nx=ny=nz=104 (large enough to not run in cache)
GB/s Summary::Total with convergence and optimization phase overhead=70.3232
Final Summary::HPCG result is VALID with a GFLOP/s rating of=9.27327

(dual Xeon Scalable 16-core does about 4 times that. I'd like to do more optimization work on HPCG for Zen2! my result above was pretty much just a "reference" run)

Posted on 2020-02-10 20:08:23
lordtux

Many tks for the answer Donald, I had seen your tweet.

Posted on 2020-02-11 01:34:05
MichaelSB

How did you cool this beast? Assuming it will run above 80% utilization for a week (a typical deep learning training run)

Posted on 2020-02-09 16:48:56
Methylzero

A Noctua NH-U14S TR4-SP3 cooler should be OK, probably, with a high airflow case. As long as you run it stock, no OC on the cores.

Posted on 2020-02-10 12:12:19
MichaelSB

Thanks, but I'm looking for actual experience rather than "should be", and "probably". In my case, there will also be four Quadro 8000 cards (blower type).

Posted on 2020-02-10 15:48:32

The system Don was testing on used a Noctua NH-U12S TR4-SP3. We had been using that cooler on previous Threadripper 3rd Gen processors up here in Labs, and continued to use it for our initial round of testing in the last couple of weeks. In our open-air test bed systems it was sufficient to keep the 3990X from thermally throttling, but since Don's testing was going to be putting it under extended load for longer periods than most of our other benchmarks I did throw a second fan on (in a push-pull configuration).

Our hardware qualification department found that, when inside one of our normal chassis, the U12S was borderline with a single fan. It was sufficient in both single-threaded and fully-threaded situations, though the temps were quite high, but with the right combination of workloads it could actually have *slightly* thermal performance degradation when some (but not all) of the cores were active and running at a higher clock speed. In such situations, adding a second fan resolved the throttling but still left higher temperatures than they liked... but moving to the U14S dropped the temps by several more degrees Celsius, into a comfortable range. Because of that, we are going to be using the U14S going forward on our systems.

HOWEVER, it may not be a good choice for your specific situation. I say that because that heatsink is so wide that it can block the top PCI-Express slot on many motherboards. It doesn't on the particular board we are using, but that board only supports three full-size GPUs. Since you mentioned using four Quadro RTX 8000 video cards, I assume your board will have a different slot layout than ours - and there is a very high chance that the U14S would then block one of the slots and prevent you from having the video card configuration you want.

In your situation, the U12S with dual fans could be an option - if your chassis has sufficient airflow, which you will want since you are going to have four video cards - or else a nice AIO liquid cooler. Selecting one of those is highly chassis dependent, though, since you need to consider where you can mount the radiator and how that will impact the airflow within the system. Good luck :)

Posted on 2020-02-10 17:46:38
MichaelSB

Thank you for the detailed response! So if you decide to sell a TR 3970X + 4xQuadro 8000 workstation (btw do you plan to sell them?), which combination of motherboard and cooler would you use? There are currently just two boards (Gigabyte Aorus Xtreme and ASRock Creator) that support quad GPU systems with TR3, and only 2-3 AIO coolers with a proper TR base plate (Thermaltake Floe Triple Riing TR4 Edition seems to have the best ratings). As far as cases, there's Corsair Carbide 540 case (not my favorite design tbh), and some nice high airflow cases from Fractal Design (Meshify S2 looks good, but might not fit the XL-ATX Gigabyte board, need to check). Any advice?

Posted on 2020-02-10 18:34:17

Wait... are you going with the 3990X or 3970X? If you are doing the 3970X (per your last comment) then I would go with the U12S... and probably toss a second fan on, just to give you some extra headroom. I personally *far* prefer heatsinks over AIOs, due to a number of factors (less expensive, easier to install, fewer points of failure, no risk of catastrophic failure (leak), etc).

If you are going for the 3990X, then I don't have any specific advice as I don't have much recent experience with the larger AIOs :/

As for which motherboard... I guess the Gigabyte Aorus Xtreme? We used that in our early 3970X and 3960X testing, but it is physically too large for most of our current cases. I haven't done a good job of keeping up with chassis options outside of those we carry, but if Fractal Design has something large enough for that motherboard and with good airflow / fan layout (keeping in mind the potential mounting needs of an AIO) then that is probably the way to go. I've heard good things about Corsair's cases as well, but never used one personally.

In terms of what we actually sell, currently it is limited to three GPUs on the Threadripper platform, which conveniently allows us to use the U14S for cooling. We are still looking into qualifying a quad-GPU capable motherboard + chassis + cooling combination, but I don't know if we will end up finding a setup that passes our qualification process or not :(

Posted on 2020-02-10 18:42:55
MichaelSB

I haven't decided yet on 3990X vs 3970X - 32 higher frequency cores might actually be better to feed four GPUs than 64 slower ones. And if this is true, then 3970X would have higher utilization (and therefore require better cooling) than 3990X.

My main concern is having four GPUs in the case - without them I'd not hesitate to go with an air cooler, but I'm afraid the GPUs will heat up the inside of the case quite a bit, and then the air cooler will struggle. Can you please point me to your build with 3 GPUs on TR platform?

Posted on 2020-02-10 19:09:20

If you are using Quadro GPUs, then they won't be adding much heat inside the chassis. They use blower-style fans, which do a pretty good job of pushing the heat out the back of the card (and thus outside of the system / case). What you do need to ensure, with such cards, is that you have enough fresh air intake to keep those fans fed with cool air from in front or on the side of the system (you don't want to pull air from behind it back in, since that will be heated already).

One of the easiest places to see our options for a Threadripper with up to three GPUs is our V-Ray recommended system. Here is a link to that page, with the 3990X, a lot of RAM, and three RTX 8000s pre-selected (though you can, of course, adjust those options around as you see fit):

http://puget.systems/go/152334

Posted on 2020-02-10 19:23:38
Donald Kinghorn

Thanks for jumping in on this William!

Posted on 2020-02-10 20:09:16