Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1692
Dr Donald Kinghorn (Scientific Computing Advisor )

Threadripper 3990x vs 3970x Performance and Scaling (HPL, Numpy, NAMD plus GPUs)

Written on March 6, 2020 by Dr Donald Kinghorn


Is 32-cores enough? I had some testing time again on an AMD Threadripper 32-core 3970x and thought it would be interesting to compare that to the 64-core 3990x.

I have two recent post on the 3990x, Threadripper 3990x 64-core Parallel Scaling and AMD Threadripper 3990x 64-core Linpack and NAMD Performance (Linux). I'll be running the same testing jobs from those posts on the 3970x and generating charts with direct comparisons.

System Configuration

  • AMD Threadripper 3990x and 3970x
  • Motherboard Gigabyte TRX40 AORUS
  • Memory 8x DDR4-2933 16GB (128GB total)
  • 1TB Samsung 960 EVO M.2
  • NVIDIA RTX 2080Ti GPU and RTX Titan
  • Ubuntu 20.04 (pre-release)
  • Kernel 5.4.0-14-generic
  • gcc/g++ 9.2.1
  • AMD BLIS library v 2.0
  • HPL Linpack 2.2 (Using pre-compiled binary at link above)
  • OpenMPI 3.1.3 (installed from source)
  • NAMD 2.13 (Molecular Dynamics)
  • Anaconda Python: numpy with OpenBLAS

Amdhal's Law and Performance Charts

For a discussion and python code examples of how the Amdhal's Law charts are generated please see the 3990x parallel scaling post linked in the introduction. Those plots were used with an additional subplot overlay of the 3970x scaling results.

In case it's useful for you, the performance charts are simple Seaborn bar charts from Pandas dataframes. Here's a code fragment for the HPL plot,

import pandas as pd
import seaborn as sns

dfhpl = pd.DataFrame({'CPU':[
    'TR 3990x 64-core AVX2 BLIS v2.0',
    'TR 3970x 32-Core AVX2 BLIS v2.0'
    'GFLOP/s':[1571, 1326 ]})

# Plot it
clrs = sns.color_palette("Reds_d", 2)
ax = sns.barplot(y="CPU", x="GFLOP/s", data=dfhpl, palette=clrs)
ax.set_title('HPL Linpack Benchmark: 3990x, 3970x n (Higher is better)', fontsize=18)

y = dfhpl['GFLOP/s']
for i, v in enumerate(y):
    ax.text(v , i + .125, str(v), color='black', fontweight='bold')

HPL Linpack Performance and Scaling 3990x vs 3970x

HPL (Linpack) is provided by AMD with the BLIS library. For optimal performance the problem size (114000) was chosen as approx. %88 of the 128GB system memory and using a block size (768). These job runs are using multi-threaded (MT) BLIS without SMT "hyperthreads". `OMP_NUM_THREADS` is set to the number of "real" cores (32 and 64) for performance and is set from 1-to-max-cores for scaling.

Linpack performance

The 32-core 3970x performance is very good but I had hoped for better results from the 64-core 3990x. Ideally it would be near double that of the 3970x.

HPL scaling

The scaling fall-off near max-cores for the 3970x is similar to that of the 3990x.

Numpy OpenBLAS norm(A@B) Performance and Scaling 3990x vs 3970x

This is a simple numpy test computing the frobenius norm of a matrix product.

For a discussion and python code examples of this Numpy job please see the 3990x parallel scaling post linked in the introduction.

numpy performance

Again, I would like to see better relative performance with the 3990x.

Note: For these tests numpy with OpenBLAS gave better performance than using MKL(debug).

numpy scaling

Scaling for the 3970x is very good. It closely follows that of the 3990x up to 32-cores without significant drop-off near max cores.

NAMD ApoA1 Performance and Scaling 3990x vs 3970x

ApoA1 ~ 92000 atoms 500 time steps

NAMD scales well on CPU and have good GPU acceleration. I have included CPU only and CPU + (2)GPU results.

Results are in days per nano-second of simulation time. (the standard NAMD performance label)

NAMD ApoA1 performance

The first thing to notice is that the 3990x + RTX Titan result is slightly worse than the 3970 + 2080Ti result. I believe the limitation here is the number of GPU's for the 3990x. NAMD does best with a balance between CPU cores and GPU's. The 3990x would likely still benefit from 1 or 2 more GPU's. The 3970x is probably near optimal balance with the 2 2080Ti's. The problem is that Threadripper motherboards (that I know of) only support 2 PCIe X16/X16 slots. It may be fine to run 4 GPU's at X8 but I have not tested this.

NAMD ApoA1 scaling

The CPU scaling for the 3970x is very good, mirroring that of the 3990x out to 32-cores.

NAMD STMV Performance and Scaling 3990x vs 3970x

STMV ~ 1 million atoms 500 time steps

NAMD STMV performance

Here we see relative CPU performance similar to that with ApoA1. The GPU performance for the 3990x is better than the 3970x in this case. This is a much larger problem and there are more forces to be computed on the GPU's. Again, I believe there would be a performance increase with more GPU's for the 3990x.

NAMD STMV scaling

The CPU scaling for this job run is similar to the ApoA1 results.


From these results it seems clear that the core utilization for the 3970x is very good. The 64-core 3990x does offer improved performance but the benefit beyond 40-48 cores can diminish significantly ... at least for the tests that I did in this post! I am sure there will be applications where the 3990x will give better performance scaling.

The 32-core 3970x is an easy recommendation for well scaling parallel applications where it should offer good hardware utilization. The 64-core 3990x is certainly worth consideration and does offer increased performance but, my feeling is that we are seeing limitations in scaling because of memory subsystem capability which is limiting utilization of the hardware.

I will be testing AMD EPYC Rome and am happy to announce that Puget Systems will be qualifying platforms soon. My guess at this point is that for the 64+ core systems there will be better hardware utilization and overall performance with EPYC. I'll know soon!

Happy computing! --dbk @dbkinghorn

Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of powerful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?

Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Intel Logo
AMD Logo
Samsung Logo
Tags: AMD, HPL, linpack, NAMD, Threadripper
Avatar Misha Engel

Having both the same TDP (280W) the 3990x seems to be the better choice for above workloads also considering the total system price and the upcoming new data center GPU's from both NVidia and AMD(both on PCIe-4 I guess).

Posted on 2020-03-07 14:57:01
Avatar Donald Kinghorn

I like the 3990x! I felt like I could have gotten more out of it too. It's an amazing processor and even though there is some (expected) scaling fall off it is still very good and many jobs should see an improvement in turn-around time. It probably is worth the extra cost for jobs that scale well and may be good for running multiple simultaneous work-loads ... I didn't test that. The 3970x is also sweet! I would probably spend the extra money on the 3990x if it was not an issue for budget. but, as I mentioned in the post the 3970x is a pretty easy recommendation. It look like it gives really good performance utilization for the modest cost.

Posted on 2020-03-09 21:30:11
Avatar lemans24

Don, we are all crying out for a full comparison of 3990x vs Epyc Rome 64-core with full 8-way memory and quad 2080ti GPU's!!
(Tyan s8030 motherboard maybe??)
I really feel that once NVidia Ampere PCIE 4.0 cards come out, we will see more of a benefit to the 3990x.

For the money, the 3970x hits the sweet spot for price/performance as compared to the 3990x and Amd really should have come out with 8-way memory to fully utilize the 3990x. Obviously from an economic view, it make no sense for Amd to do this as the 3990x really should be priced around $3000.

3990x is currently way overpriced for me but I will be getting the 3960x and dual 3080ti PCIE 4.0 cards as the performance for my Monte Carlo simulations is purely based on the speed of the GPU cards.

Posted on 2020-03-11 22:49:16
Avatar Donald Kinghorn

Coming soon to a blog post near you :-) I am testing the Intel 3265W right now 24-core 64L part (we are using this in our new 4x X16/X16 platform ... no PLX chip needed!) It's pretty good, near the same price and performance as the 3990x. I am trying to get some EPYC Rome testing in the next couple of days ... it will be on a dual in the cloud so not really a good comparison, but it will do until the single socket systems we are qualifying get to me. I expect that we will have a 4x X16/X16 EPYC system too .... and yes, that could be really nice, especially if Ampere supports PCIe 4 ???
The 3970x is definitely in the sweet spot, a lot of compute value in that processor!

Posted on 2020-03-12 14:56:20
Avatar Wangzhi Zheng

Hi Dr. Kinghorn, I am wondering if you can do a comparison between AMD TR vs Intel x cpu on machine learning (e.g. tensorflow) performance? since Intel has a very good optimization on TF but with extreme shortage on x cpu right now, can 3rd TR beat 10th gen intel x cpu and therefore a good replacement?

Posted on 2020-06-09 14:39:20
Avatar Donald Kinghorn

That is a very good question! ... and reason for asking...

The new AMD processors are very good and supply seems to be pretty good right now too. I have done experiments with TensorFlow before, building it from source to add Intel MKL library calls for CPU but I don't think it made much difference. AMD should give good performance and TF is highly parallel so it should scale well on something like the 3970x 32-core. (I like that CPU a lot!)

Of course if you can add something like an RTX 2080Ti or even a 2070super (or 4 of them!) you will get a huge performance boost over CPU. However, not all workloads and workflows are suitable for GPU, and TensorFlow is a very versatile framework.

Also, I have a use-case of needing to support many users on a single server for classes, workshops etc. from a JupyterHub setup. It would be good for me to test the 3970x and 3990x for that.

We don't have many test platforms and they seem to be constantly running Windows benchmarks. I have to get a separate platform for any Linux based testing. I'll see if I can get access to some testing hardware.

This would make a great post, thanks! --Don

Posted on 2020-06-09 16:06:49
Avatar Wangzhi Zheng

Some of my own experience here: the anaconda version of TF 2.1 is nearly twice as fast as standard TF, as anaconda has optimized TF with intel MKL. This result is based on my laptop with Intel 9750H. I am not sure if avx512 in x cpu has even better performance gain in the context of TF, so it would be nice to compare 9900k (avx2) with x cpu (avx512) with optimized TF. Since Intel x is not available, natural alternative will be intel k cpu (go lower) or AMD TR (go higher). Without optimization, not sure if TR 3970x can beat an optimized 10980x (it will be nice if it could). In terms of test platform, most people I know in the data science industry are using windows, so no objection if you use windows to run the test.

Posted on 2020-06-09 16:36:02
Avatar Donald Kinghorn

That's interesting to see you got 2x perf increase with the Anaconda build. That's about max for MKL-on- AVX2 efficiency. But, yes it should vectorize really well. When I did my build tests it was still young code and I didn't get much speedup. That's more motivation for the testing!

I can do this on Windows ... in more than one way i.e. native and on WSL2 (I'm hoping to be testing GPU compute support on WSL2 soon too :-)

Yes, this is interesting enough that I think I'll dive into it. There is a lot of comparisons that can be made. I'll try to do a portable benchmark if I can, something that could be shared ... I have other work going on but could really use an immersive distraction :-)

Also, just to let you know, Core-X is mostly still in shortage but the 10980X is available now, yay! We just removed our shortage tag from it. The Xeon W 32xx are also no longer in shortage. (I like these a lot, 64L with 4 PCIe X16 slots on a board, no PLX!) Supply looks OK for the AMD parts (I particularly like the 3970x TR!) We are also qualifying EPYC Rome (we had some motherboard problems but new boards are looking better)

Posted on 2020-06-10 15:28:07
Avatar Wangzhi Zheng

excited and look forward to seeing your test!

Posted on 2020-06-09 20:05:40
Avatar Hypersphere


How does the AMD 3970X compare with the Intel i9-10980XE with respect to MD simulation performance? For long-term simulations running each CPU at maximum load, does the AMD chip throttle because of its lower max temperature and higher energy consumption?


Posted on 2020-08-15 19:26:47
Avatar Donald Kinghorn

Hey, good to hear from you :-)

I don't have a direct comparison for you but you might find this post interesting
I have NAMD results in there near the end and compare to a 24-core Xeon 3265W (a really nice system for use with 4 GPU's!)

The TR's do really well for NAMD on CPU. For NAMD I personally think a 3970x with 2 2080Ti's is close to optimal for hardware utilization and value.

I'm not any more worried about the TR's than the Intel procs for thermal throttling in a properly cooled system. They should run under sustained load just fine, For a long running heavy load you usually reach thermal equilibrium pretty quickly. You have to watch out for ambient temperature increases because of the generated heat (especially this time of year). For a long heavy job the extra "internal overclocking" stuff will drop out pretty quick because of power draw. I've had testing jobs running on the TR's for 6-8 hours and didn't observe any performance problems. Jobs that go for days or weeks should be fine. ... but I really haven't tested that!

Posted on 2020-08-17 19:21:28
Avatar Hypersphere

What recommendations would you have for RAM configuration and speed for MD simulations with the 3970X and RTX2080Ti? I am considering an Asus ROG Zenith II Extreme Alpha motherboard.

Posted on 2020-08-17 20:07:51
Avatar Donald Kinghorn

That board looks like it is really well done. We settled on Gigabyte TRX40 AORUS PRO for the 3970x

Production and support have been pretty happy with that. That's not to discourage you from the ASUS, that looks like a very nice board!

Looks like that ASUS board supports a some really high clocked memory. I don't have any experience with that. We've been testing with 2933. I would personally be somewhat conservative but probably higher clocked than what we typically use. (as long as it's not over-priced). I would recommend that you use the modules with Samsung chips. Other that I would looks at https://www.asus.com/us/Mot...

Posted on 2020-08-17 21:17:29
Avatar Hypersphere

Thanks. I always enjoy your articles and advice.

Posted on 2020-08-17 22:41:43
Avatar cyberseawolf

Did you manage to play around with MPI/OpenMP core distribution and NUMA domains? I spent a lot of time with HPL on Intel Xeon Skylake CPUs and I got a major boost when core distribution was "close-to-right".

Posted on 2021-01-06 11:58:05
Avatar Donald Kinghorn

I did not spend much time optimizing process layout and such for these tests. ... but, yes! that can sometimes make a significant difference. When I am working things out for best parallel performance I generally try several openMPI mappings and some openMP tuning. Most of the time it doesn't have a significant impact (these days) ... but sometimes there are surprises!

The EPYC and TR CPU are wonderful beasts and it can definitely be good to think of the "core clusters" as independent units. I've had good luck with map-by-L3 cache!

I plan to start in on some significant AMD testing towards the end of March. (new CPU's :-) I plan to make a significant effort for performance optimization!

Posted on 2021-01-06 16:43:05
Avatar cyberseawolf

Thank you Donald for you kind reply.
I totally agree that a small differences between individual runs should be negligible. On the other hand, users performs those calculations several times in a single week so small savings can add up to a significant amount (I'm thinking about at least a couple of beers with a whisky instead of just one glass because "more calculations are needed").
Concerning "core clusters", I usually have the impression that it is an unbeatable crusade to educate common users to proficiently employ HPC resources (I was an admin and a user of my own cluster for a while). Do you have the same feeling during your experiments? In my experience, it indeed give you a tremendous boost in performance if you spend enough time to really understand how the investigated app works.

I wish all the luck for your future studies. I'll definitely keep an eye on Puget articles to read about your new results!

Posted on 2021-01-06 18:40:08
Avatar Donald Kinghorn

I like your time saving analogy :-)
When I tested the first Threadripper I got the best results when I started thinking of it like a Quad-Opteron! It's a bit different now since the interconnect and cache layout is much improved. I think 2021 will be a good year for AMD processors! El Capitan going in a Oak Ridge should spur a lot of development work ... for their GPU's too!

You are absolutely correct about performance gains from performance opts on the "admin" side. When you the option to compile from source and tune for a particular arch it can make a big difference. I've also had time when I was very disappointed that significant effort made little difference ... but is always worth a try when it is long running code that is really important ... cheers :-) --Don

Posted on 2021-01-06 21:05:28