Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1631
Dr Donald Kinghorn (Scientific Computing Advisor )

AMD Threadripper 3970x Compute Performance Linpack and NAMD

Written on November 25, 2019 by Dr Donald Kinghorn


AMD Threadripper 3970x 32-core! ...The, third new AMD processor I've had the pleasure of trying recently. I'm running it through the same double precision floating point performance tests as the recently tested Ryzen processors, Linpack and NAMD.

I've recently done testing with Ryzen 3950x 16-core, "AMD Ryzen 3950x Compute Performance Linpack and NAMD" and with Ryzen 3900x 12-core. "AMD 3900X (Brief) Compute Performance Linpack and NAMD" Performance has been very impressive. The Ryzen 3950x and TR 3970x have benefited from the much improved AMD BLIS (BLAS) library v2.0.

In the past I was pretty impressed with performance but was wishing that there was a more optimal BLAS library for the Zen2 architecture. There is now a version 2.0 of the AMD "BLIS" library and it gives significantly better performance with Linpack than v1.3 that was used in older posts.

This post revisits the recent Ryzen posts and adds in new results for the Threadripper 3970x I'm including NAMD Molecular Dynamics results for my usual test molecule, STMV as well as a smaller molecular system, ApoA1. ApoA1 seems to be a popular system for benchmarking on CPU with NAMD. GPU acceleration results are reported for the ApoA1 job results.

System Configuration


(see the 2 posts linked in the Introduction for the Ryzen configurations)

  • AMD Threadripper 3970x
  • Motherboard Gigabyte TRX40 AORUS EXTREME
  • Memory 8x DDR4-2933 16GB (128GB total)
  • 1TB Samsung 960 EVO NVMe M.2



  • I used Ubuntu 18.04 for this testing rather than 19.10 that was used the the Ryzen testing. I had wanted to use 19.10 in order to have newer libs and kernel, but there was some motherboard issues that kept it from booting and not enough time to sort it out. !8.04 installed and worked fine with the TR 3970x.
  • New results in this post are for Threadripper 3970x only. The other results are from previous testing.



  • The pre-built muit-threaded HPL binary provided by AMD worked well so I didn't bother rebuilding from source. This is the "MT" build but it still looks for MPI header files on start-up and uses the HPL.dat file for job run configuration.
  • AMD BLIS (a.k.a. AMD's BLAS library) has been updated to version 2.0 with specific support for Zen2.
  • Several combinations with MPI ranks together with OMP threads were tried. The best results obtained were using only OMP threads and the pre-built binary without MPI. 1 OMP thread per "real" core i.e. 32 OMP processes gave the best result.
  • There is a detailed description of HPL Linpack testing for Threadripper 2990WX in the post, How to Run an Optimized HPL Linpack Benchmark on AMD Ryzen Threadripper -- 2990WX 32-core Performance The 2990WX testing in this post and the result presented could probably be improved with the new BLIS lib.
  • The Intel CPU's were tested with the (highly) optimized Linpack benchmark program included with Intel MKL performance library.
  • A large problem size approx. 90% of available memory (128GB) was used in order to maximize performance results, Ns=114000.
T/V                N    NB     P     Q               Time                 Gflops
WR12R2R4      114000   768     1     1             746.42             1.3233e+03
HPL_pdgesv() start time Sat Nov 23 12:20:42 2019

HPL_pdgesv() end time   Sat Nov 23 12:33:08 2019

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.04947690e-03 ...... PASSED

Here is an HPL.dat file used, [this file automates using 3 problems sizes (Ns) and 3 Block sizes (NBs), also note that P and Q are set to 1 i.e. 1 MPI Rank, parallelism was from OMP threads]

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
3            # of problems sizes (N)
112000 113000 114000 Ns
3            # of NBs
256 512 768  NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs

The following environment variables were set for the Ryzen Linpack runs

export OMP_PLACES=cores
export OMP_NUM_THREADS=32   (16 for 3950x ...)

The AMD Threadripper 3970x gave better than expected results!

The following plot shows HPL Linpack results (in GFLOPS) Best results fro Ryzen were with Ns=114000 and NB=768.

TR3970X Linpack

The TR3970x results are exceptionally good for a processors with AVX2!

The Intel processors with AVX-512 vector units have a big advantage for Linpack. Also,the Linpack used for the Intel processors is built with the BLAS library from Intel's excellent MKL (Math Kernel Library).


Now on to the real world! ... sort of ... NAMD is one of my favorite programs to use for benchmarking because it has great parallel scaling across cores (and cluster nodes). It does not significantly benefit from linking with the Intel MKL library and it runs on a wide variety of hardware and OS platforms. It's also a very important Molecular Dynamics research program.

When I said "sort of" above I'm referring to the fact that NAMD also has very good GPU acceleration. Adding CUDA capable GPU's will increase throughput by an order of magnitude. However, with NAMD and other codes like it, only a portion of the heavy compute can be offloaded to GPU. A good CPU is necessary to achieved balanced performance. I like NAMD as a CPU benchmark because I believe it is an excellent representative of scientific applications and reflects performance characteristic of many other programs in this domain.

This plot shows the performance of a molecular dynamics simulation on the million atom "stmv" ( satellite tobacco mosaic virus ). These job runs are with CPU only. Performance is in "day/ns" (days to compute a nano second of simulation time) This the standard output for NAMD. If you prefer ns/day then just take the reciprocal.

NAMD Ryzen 3950X

The Threadripper 3970x gave excellent performance, significantly improving the already great result with last generation Threadripper 2990WX.

This last set of results is using the smaller ApoA1 problem (it's still pretty big with 92000 atoms!)

I ran this job for two reasons; 1) to show how well the TR3970x does compared to the Xeon-W 2175 14-core AND to provide a reality check about how much adding a GPU can increase performance for programs that have good GPU acceleration. Adding the NVIDIA 2080Ti GPU's increased performance by over a factor of ten!

TR 3970x + 2080Ti NAMD ApoA1

The 32-core TR3970x is very good with this Molecular Dynamics code! The upcoming 64-core version should scale perfectly and provide the needed CPU performance to keep up with the code running on multiple NVIDIA 2080Ti's. My guess is that the upcoming TR3990x 64-core CPU together with 2-4 NVIDIA 2080Ti's will "set the bar" for performance as a workstation platform for this class of applications. I'm looking forward to testing that!


At this point I really don't have any serious reservation in recommending any of the new AMD Zen 2 core based processors for compute intensive workloads. They give excellent performance and value. This is not to say that I have reservations recommending the new Intel Core-X or Xeon-W processors in this price class. The Intel processors come with a solid mature platform and also offer excellent performance (and now, with the new price cuts, good value). Intel also has the advantage of a very strong development ecosystem. You can get great performance with the new AMD processors but but to achieve that performance there may be times when you have to do a little extra work recompiling code or things like that.

I will be doing more CPU testing in a few week after all of this years new processors from Intel and AMD are released. So, expect another post with LOTS of new CPU's in it!

Happy computing! --dbk @dbkinghorn

Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of poweful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?

Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: AMD, HPL, linpack, NAMD, Ryzen, Threadripper
Sergio Bástian Bishop

What about After Effects?

Posted on 2019-11-28 05:43:07

Our After Effects article for the new AMD Threadripper and Intel X-series is at https://www.pugetsystems.co...

Posted on 2019-11-28 05:44:29

Excellent review and hopefully you are the first to get a 64 core 3990x system for testing!!!

I really hope AMD comes out with a trx80 series as well with 8-channel memory and 128 PCIe 4.0 lanes so that we can have a motherboard with 4x16 pcie slots for massive quad gpu performance and hopefully TB3 for eGPU display when running quad compute only GPUs.

Looks like I will wait some more for 3080ti for quad PCIe 4.0 cards on 3990x to rip my Monte Carlo simulations!!!

Keep up the good work...

Posted on 2019-12-02 05:40:47
Donald Kinghorn

Good to hear from you :-)
We were lucky to get the 24 and 32 core TR's early. They were/are in pretty short supply. I am anxious to try the 64-core! There are several use cases where that could be fantastic. The molecular dynamics stuff should be great together with a couple of high end NV GPU's. I plan on doing a LOT of testing on the 3990x!

If they could a 4 x X16 board going it would be a game changer. We are considering (??) EPYC for this but if it would be extra nice if we could do it on TR

Around 2nd quarter 2020 things should get really interesting ... and your MC code should get a nice boost :-)

Posted on 2019-12-02 21:44:54
Uxia Pavlowa

So it looks like TR3 has higher performance than ASCI Red - the fastest supercomputer in the world from 1997 (used until 2005) :) with over 4,000 Peniutm Pro processors and costing $ 46 million. It has 1.06TFLOPS (1.3 Peak) and TR3 has 1.3TFLOPS (continuous) :)

Posted on 2019-12-02 13:05:55

Assuming those 3970x TFLOPS are double precision, a single rtx 2080ti gpu card for $1200 can do 13TFLOPS (peak single precision only though)!!!
64-core Threadripper 3990x + quad 3080ti PCIe 4.0 cards in a Gigabyte Aorus Extreme by the end of 2020...will be simply amazing

Posted on 2019-12-02 13:19:11

What BIOS version were you running on your Auros Extreme? Just completed a build here on Auros TRX40 Master with a 3960x and can’t get Ubuntu 18.04 TLS to boot...

Thank you for any input you can provide on your environment / experience.

Posted on 2019-12-22 05:43:43
Donald Kinghorn

I'm really not sure! I believe that was an early sample board ( we are going with TRX40 Pro for production) ... I can give you a list of the settings we are using on the Ultra (below) I didn't have trouble with install on that board with either Ubuntu 18.04 or 19.10. On the Ultra I could not get 19.10 to boot but 18.04 went without trouble.

Try using the 18.04 "alternate" installer for server. Then install and use "tasksel" to load a desktop. I've had lots of failures trying to use Ubuntu's new installer! (supposedly they are going to drop the, reliable, Debian installer for 20.04 and only use the "almost-never-works-for-me-on-new-hardware" new one) The old Debian installer is here,

Here's the BIOS settings (other than fan control) we are using on the Ultra Hope this helps! --Don
-Settings Tab
Platform Power
Wake on LAN => Disabled
IO Ports
Above 4G Decoding => Enabled
SATA Configuration
Chipset SATA Port Hot Plug => Disabled
Network Stack Configuration
Network Stack => Enabled
Ipv4 HTTP Support => Disabled
Ipv6 PXE Support => Disabled
Ipv6 HTTP Support => Disabled
LEDs in System Power On State => Off
Trusted Computing
Security Device Support => Disable

-Boot Tab
Full Screen LOGO Show => Disabled
LAN PXE Boot Option ROM => Enabled
Preferred Operating Mode => Advanced Mode

Posted on 2019-12-23 16:27:36
oscar barenys

nice article.. can you do an article using Mathematica 12.0 built in benchmark (MathematicaMark).. it uses MKL so same trick to enable AVX2 on AMD.. benchmark fft, sgemm, linear albegra.. Apple uses it..

Posted on 2020-01-17 02:31:03
Donald Kinghorn

I would love to do that. ... but, I don't have a license for MMA (or MATLAB which I would also do). {if you have any comparison numbers feel free to post them here as a comment. I, and I'm sure others, would be interested in seeing that!}

I have used both Mathematica and Matlab extensively in the past i.e. when I was an academic and had free site license access. Great programs! I highly recommend both of them, but I can't afford them as a "civilian" working in the business world, if I'm just mainly using them for benchmarking and writing blog posts.

I would happily do benchmarking with these, and would love to do "recommended hardware" posts for them. But, I haven't been able to get cooperation from Wolfram or MathWorks in the past. ... I might try again :-) we are making a big benchmarking effort (on Windows) at Puget, so ... maybe...

Posted on 2020-01-17 15:59:04
oscar barenys

thanks for answering..
forgot to say that Mathematica offers a free "15 day" trial ("full featured" in regards to MathematicaMark) so at least a "single review" could be done without a paid license until trial expires..
waiting for a 3990x review :-)

Posted on 2020-02-06 14:19:36
Donald Kinghorn

3990x testing will be soon :-)

Posted on 2020-02-06 20:32:23
Vasista Adupa

Greetings Sir, Are there any CPU bottlenecks observed with 2*NVIDIA RTX 2080Ti with 3970x

Posted on 2020-07-19 04:39:45
Donald Kinghorn

Not that I found... It's a good platform! The Gigabyte MB that was used has a PCIe 4.0 x16/x8/x16/x8 layout so it is great for 2 cards at X16 They are separated enough that you could use cards with side fans ( but I still personally prefer cards with blowers for cooling even if they may be a little louder)

That NAMD result is very good! And, there was a nice improvement from adding the 2nd GPU. (There is enough CPU capability to keep up with the GPU's for that workload ... it is best with a balance of both CPU and GPU)

It is very good platform for a lot of different workloads. It should be ready for the RTX 30xx cards too. Those should be PCIe 4 devices (I hope)

I don't know when we will see Ampere RTX cards but hopefully before the end of the year. I am also very anxious to try the new TR pro! In the mean time something like the 3970x (or xt) with 2 x 2080Ti's would be really nice! I feel that the 32-core 3970x is the sweet spot for the TR's ... I really like it for everything I done with it.

Posted on 2020-07-20 15:20:47


Very interesting! I see that quite recently, Intel pulled the plug on enabling AMD CPUs to run MKL. However, regarding the AMD libraries, e.g., BLIS and others, you have posted links to source files and binary executables, but I have a very naive question -- How does one install these files and make them available to AMD CPUs such as the 3970X?

A somewhat related question: I am running Linux Mint 20 Xfce, which is based on Ubuntu 20.04 LTS. The "matching" CUDA version is 11.2. I would like to run AMBER20, but it currently will not accept CUDA versions greater than 10.2. I have not found a good way to install CUDA 10.2 in LM 20 (or Ubuntu 20.04) without running into conflicts with the Nvidia graphics driver.


Posted on 2020-12-18 19:56:29
Donald Kinghorn

Using the BLIS or OpenBLAS is going to different for different applications. If you have an app that you compile with a config file and make then you may have options to specify the location of the BLAS lib. But in general you are often stuck with whatever the devs used. Since Intel had been dominant for the past 10 years some folks started compiling against MKL by default. In those cases there is not much you can do without the source code or some other documented config option (like setting a environment variable or something) ... the good news is that it usually doesn't make that much difference except for benchmarking. Real world applications are often not too dependent on BLAS/Linpack for overall performance. But it's good to be aware and to check for options.

AMBER 20 does support RTX30 GPU's so it will build against CUDA 11.2 (you would need at least 11.1)
You will need 2 things; 1) A python install and they recommend using Miniconda but you could also use Anaconda but you can use either. A simple default should be OK but you will need to have at least "base" activated so all the Python paths are set.
2) You will need a recent CUDA toolkit. What you will want to do is get the "run" file
I assume you have already installed the NVIDIA driver ... In that case you will want to execute the .run file with "sudo sh", follow the prompts but DO NOT let it install the driver. After you have done that you should have a /usr/local/cuda directory for version 11.2 and your PATH will be set correctly the next time you login. The AMBER build scripts should find your cuda install and do the ring thing (you may have to do export CUDA_HOME=/usr/local/cuda but it really should just find it)

I hope that helps! It's not always easy getting code like that up and running

Posted on 2020-12-22 03:50:43


Thanks again. Now I have a new problem. I am installing the quantum chemistry program, GAMESS, on a system with an AMD 3970X CPU. GAMESS has to be installed from source, and I am trying to optimize the compiling by using AMD's AOCC compiler and optimize the math libraries by using AMD's AOCL libraries.

Unfortunately, after installing AOCL, if I then run ldconfig, I get errors indicating that, for example, "libfftw3.so is not a symbolic link", etc. Presumably, this is because Linux Mint 20 (based on Ubuntu 20.04) has its own FFTW3 libraries already installed, and if I try to uninstall these, I get errors indicating that there are many other packages dependent upon these libraries -- so much so, that it would break the system to uninstall them.

Is there a relatively painless method for dealing with this issue?

Thank you!

-- Hypersphere

Posted on 2021-01-03 18:10:20
Donald Kinghorn

Unfortunately not painless :-) You will probably have to compile your own fftw3 lib with AOCL and link to it with a -L "path" flag at build time. If you can do a static link to that then things should work. If you have to use an .so file then you may need to add

LD_LIBRARY_PATH="path to your fftw lib" at run time (or export the environment variable)

This stuff is almost always painful to some degree. I'm a big fan of "docker" containers and will be doing a lot more work with them this year. Custom builds are a perfect case for this. I can see how doing optimized builds of GAMESS for Intel and AMD would be a service to the community. I'll add that to my list ... I'm planning to do a lot of benchmark work this year and want to put everything in containers. The benchmark containers should be good application containers too!

Posted on 2021-01-04 22:52:42

Hi Donald,

Happy New Year! Thanks for responding so soon to my inquiry.

I also sent a question about AOCC and AOCL to the folks at AMD, and I received a reply today. They said they are working with the
GAMESS developers on this, and that a future update of GAMESS is promised that will facilitate the use of AMD-optimized compilers and math libraries. (Regarding math libraries, as you have no doubt discovered already, Intel has shut the door on the former workaround for using MKL with AMD CPUs).
During this process, I discovered a nifty alternative package management system called "Spack". This worked quite well at installing AOCC and AOCL, but I still could not get the linking with GAMESS to work.

I finally got GAMESS to work on my system by reverting to gfortran and openblas. GAMES was then built with no errors and it passed all 48 of the post-install tests.

Your work with containers looks quite interesting. I have not yet used this approach in any of my work, but I look forward to giving it a try.

Best wishes,


Posted on 2021-01-04 23:08:05
Donald Kinghorn

Happy New Year to you too! :-)
That is great news about AMD and GMAESS! This is exactly what needs to happen. Intel has been dominant for so long that packages haven't been optimized well for AMD's great new hardware. I expect this to change drastically this year.

I was going to suggest gcc/gfortran _ OpenBLAS but it seemed like you were close with AOCC. gcc + OpenBlas is a pretty good build environment in general. The thing is there is almost certainly performance gains that could be made but it will take a fair bit of work to make that happen. I do expect this to be a good year for AMD

I've committed to doing a bunch of work on Intel platform for the start of the year, including OneAPI which I'm anxious to play with ... but I'm looking forward to diving deeper into TR and EPYC after that :-)

Posted on 2021-01-04 23:19:16