Puget Systems print logo


Read this article at https://www.pugetsystems.com/guides/1692
Dr Donald Kinghorn (Scientific Computing Advisor )

Threadripper 3990x vs 3970x Performance and Scaling (HPL, Numpy, NAMD plus GPUs)

Written on March 6, 2020 by Dr Donald Kinghorn


Is 32-cores enough? I had some testing time again on an AMD Threadripper 32-core 3970x and thought it would be interesting to compare that to the 64-core 3990x.

I have two recent post on the 3990x, Threadripper 3990x 64-core Parallel Scaling and AMD Threadripper 3990x 64-core Linpack and NAMD Performance (Linux). I'll be running the same testing jobs from those posts on the 3970x and generating charts with direct comparisons.

System Configuration

  • AMD Threadripper 3990x and 3970x
  • Motherboard Gigabyte TRX40 AORUS
  • Memory 8x DDR4-2933 16GB (128GB total)
  • 1TB Samsung 960 EVO M.2
  • NVIDIA RTX 2080Ti GPU and RTX Titan
  • Ubuntu 20.04 (pre-release)
  • Kernel 5.4.0-14-generic
  • gcc/g++ 9.2.1
  • AMD BLIS library v 2.0
  • HPL Linpack 2.2 (Using pre-compiled binary at link above)
  • OpenMPI 3.1.3 (installed from source)
  • NAMD 2.13 (Molecular Dynamics)
  • Anaconda Python: numpy with OpenBLAS

Amdhal's Law and Performance Charts

For a discussion and python code examples of how the Amdhal's Law charts are generated please see the 3990x parallel scaling post linked in the introduction. Those plots were used with an additional subplot overlay of the 3970x scaling results.

In case it's useful for you, the performance charts are simple Seaborn bar charts from Pandas dataframes. Here's a code fragment for the HPL plot,

import pandas as pd
import seaborn as sns

dfhpl = pd.DataFrame({'CPU':[
    'TR 3990x 64-core AVX2 BLIS v2.0',
    'TR 3970x 32-Core AVX2 BLIS v2.0'
    'GFLOP/s':[1571, 1326 ]})

# Plot it
clrs = sns.color_palette("Reds_d", 2)
ax = sns.barplot(y="CPU", x="GFLOP/s", data=dfhpl, palette=clrs)
ax.set_title('HPL Linpack Benchmark: 3990x, 3970x n (Higher is better)', fontsize=18)

y = dfhpl['GFLOP/s']
for i, v in enumerate(y):
    ax.text(v , i + .125, str(v), color='black', fontweight='bold')

HPL Linpack Performance and Scaling 3990x vs 3970x

HPL (Linpack) is provided by AMD with the BLIS library. For optimal performance the problem size (114000) was chosen as approx. %88 of the 128GB system memory and using a block size (768). These job runs are using multi-threaded (MT) BLIS without SMT "hyperthreads". `OMP_NUM_THREADS` is set to the number of "real" cores (32 and 64) for performance and is set from 1-to-max-cores for scaling.

Linpack performance

The 32-core 3970x performance is very good but I had hoped for better results from the 64-core 3990x. Ideally it would be near double that of the 3970x.

HPL scaling

The scaling fall-off near max-cores for the 3970x is similar to that of the 3990x.

Numpy OpenBLAS norm(A@B) Performance and Scaling 3990x vs 3970x

This is a simple numpy test computing the frobenius norm of a matrix product.

For a discussion and python code examples of this Numpy job please see the 3990x parallel scaling post linked in the introduction.

numpy performance

Again, I would like to see better relative performance with the 3990x.

Note: For these tests numpy with OpenBLAS gave better performance than using MKL(debug).

numpy scaling

Scaling for the 3970x is very good. It closely follows that of the 3990x up to 32-cores without significant drop-off near max cores.

NAMD ApoA1 Performance and Scaling 3990x vs 3970x

ApoA1 ~ 92000 atoms 500 time steps

NAMD scales well on CPU and have good GPU acceleration. I have included CPU only and CPU + (2)GPU results.

Results are in days per nano-second of simulation time. (the standard NAMD performance label)

NAMD ApoA1 performance

The first thing to notice is that the 3990x + RTX Titan result is slightly worse than the 3970 + 2080Ti result. I believe the limitation here is the number of GPU's for the 3990x. NAMD does best with a balance between CPU cores and GPU's. The 3990x would likely still benefit from 1 or 2 more GPU's. The 3970x is probably near optimal balance with the 2 2080Ti's. The problem is that Threadripper motherboards (that I know of) only support 2 PCIe X16/X16 slots. It may be fine to run 4 GPU's at X8 but I have not tested this.

NAMD ApoA1 scaling

The CPU scaling for the 3970x is very good, mirroring that of the 3990x out to 32-cores.

NAMD STMV Performance and Scaling 3990x vs 3970x

STMV ~ 1 million atoms 500 time steps

NAMD STMV performance

Here we see relative CPU performance similar to that with ApoA1. The GPU performance for the 3990x is better than the 3970x in this case. This is a much larger problem and there are more forces to be computed on the GPU's. Again, I believe there would be a performance increase with more GPU's for the 3990x.

NAMD STMV scaling

The CPU scaling for this job run is similar to the ApoA1 results.


From these results it seems clear that the core utilization for the 3970x is very good. The 64-core 3990x does offer improved performance but the benefit beyond 40-48 cores can diminish significantly ... at least for the tests that I did in this post! I am sure there will be applications where the 3990x will give better performance scaling.

The 32-core 3970x is an easy recommendation for well scaling parallel applications where it should offer good hardware utilization. The 64-core 3990x is certainly worth consideration and does offer increased performance but, my feeling is that we are seeing limitations in scaling because of memory subsystem capability which is limiting utilization of the hardware.

I will be testing AMD EPYC Rome and am happy to announce that Puget Systems will be qualifying platforms soon. My guess at this point is that for the 64+ core systems there will be better hardware utilization and overall performance with EPYC. I'll know soon!

Happy computing! --dbk @dbkinghorn

Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of workstations that are tailor-made for your unique workflow. Our goal is to provide the most effective and reliable system possible so you can concentrate on your work and not worry about your computer.

Configure a System!

Why Choose Puget Systems?

Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: AMD, HPL, linpack, NAMD, Threadripper
Misha Engel

Having both the same TDP (280W) the 3990x seems to be the better choice for above workloads also considering the total system price and the upcoming new data center GPU's from both NVidia and AMD(both on PCIe-4 I guess).

Posted on 2020-03-07 14:57:01
Donald Kinghorn

I like the 3990x! I felt like I could have gotten more out of it too. It's an amazing processor and even though there is some (expected) scaling fall off it is still very good and many jobs should see an improvement in turn-around time. It probably is worth the extra cost for jobs that scale well and may be good for running multiple simultaneous work-loads ... I didn't test that. The 3970x is also sweet! I would probably spend the extra money on the 3990x if it was not an issue for budget. but, as I mentioned in the post the 3970x is a pretty easy recommendation. It look like it gives really good performance utilization for the modest cost.

Posted on 2020-03-09 21:30:11

Don, we are all crying out for a full comparison of 3990x vs Epyc Rome 64-core with full 8-way memory and quad 2080ti GPU's!!
(Tyan s8030 motherboard maybe??)
I really feel that once NVidia Ampere PCIE 4.0 cards come out, we will see more of a benefit to the 3990x.

For the money, the 3970x hits the sweet spot for price/performance as compared to the 3990x and Amd really should have come out with 8-way memory to fully utilize the 3990x. Obviously from an economic view, it make no sense for Amd to do this as the 3990x really should be priced around $3000.

3990x is currently way overpriced for me but I will be getting the 3960x and dual 3080ti PCIE 4.0 cards as the performance for my Monte Carlo simulations is purely based on the speed of the GPU cards.

Posted on 2020-03-11 22:49:16
Donald Kinghorn

Coming soon to a blog post near you :-) I am testing the Intel 3265W right now 24-core 64L part (we are using this in our new 4x X16/X16 platform ... no PLX chip needed!) It's pretty good, near the same price and performance as the 3990x. I am trying to get some EPYC Rome testing in the next couple of days ... it will be on a dual in the cloud so not really a good comparison, but it will do until the single socket systems we are qualifying get to me. I expect that we will have a 4x X16/X16 EPYC system too .... and yes, that could be really nice, especially if Ampere supports PCIe 4 ???
The 3970x is definitely in the sweet spot, a lot of compute value in that processor!

Posted on 2020-03-12 14:56:20
Wangzhi Zheng

Hi Dr. Kinghorn, I am wondering if you can do a comparison between AMD TR vs Intel x cpu on machine learning (e.g. tensorflow) performance? since Intel has a very good optimization on TF but with extreme shortage on x cpu right now, can 3rd TR beat 10th gen intel x cpu and therefore a good replacement?

Posted on 2020-06-09 14:39:20
Donald Kinghorn

That is a very good question! ... and reason for asking...

The new AMD processors are very good and supply seems to be pretty good right now too. I have done experiments with TensorFlow before, building it from source to add Intel MKL library calls for CPU but I don't think it made much difference. AMD should give good performance and TF is highly parallel so it should scale well on something like the 3970x 32-core. (I like that CPU a lot!)

Of course if you can add something like an RTX 2080Ti or even a 2070super (or 4 of them!) you will get a huge performance boost over CPU. However, not all workloads and workflows are suitable for GPU, and TensorFlow is a very versatile framework.

Also, I have a use-case of needing to support many users on a single server for classes, workshops etc. from a JupyterHub setup. It would be good for me to test the 3970x and 3990x for that.

We don't have many test platforms and they seem to be constantly running Windows benchmarks. I have to get a separate platform for any Linux based testing. I'll see if I can get access to some testing hardware.

This would make a great post, thanks! --Don

Posted on 2020-06-09 16:06:49
Wangzhi Zheng

Some of my own experience here: the anaconda version of TF 2.1 is nearly twice as fast as standard TF, as anaconda has optimized TF with intel MKL. This result is based on my laptop with Intel 9750H. I am not sure if avx512 in x cpu has even better performance gain in the context of TF, so it would be nice to compare 9900k (avx2) with x cpu (avx512) with optimized TF. Since Intel x is not available, natural alternative will be intel k cpu (go lower) or AMD TR (go higher). Without optimization, not sure if TR 3970x can beat an optimized 10980x (it will be nice if it could). In terms of test platform, most people I know in the data science industry are using windows, so no objection if you use windows to run the test.

Posted on 2020-06-09 16:36:02
Donald Kinghorn

That's interesting to see you got 2x perf increase with the Anaconda build. That's about max for MKL-on- AVX2 efficiency. But, yes it should vectorize really well. When I did my build tests it was still young code and I didn't get much speedup. That's more motivation for the testing!

I can do this on Windows ... in more than one way i.e. native and on WSL2 (I'm hoping to be testing GPU compute support on WSL2 soon too :-)

Yes, this is interesting enough that I think I'll dive into it. There is a lot of comparisons that can be made. I'll try to do a portable benchmark if I can, something that could be shared ... I have other work going on but could really use an immersive distraction :-)

Also, just to let you know, Core-X is mostly still in shortage but the 10980X is available now, yay! We just removed our shortage tag from it. The Xeon W 32xx are also no longer in shortage. (I like these a lot, 64L with 4 PCIe X16 slots on a board, no PLX!) Supply looks OK for the AMD parts (I particularly like the 3970x TR!) We are also qualifying EPYC Rome (we had some motherboard problems but new boards are looking better)

Posted on 2020-06-10 15:28:07
Wangzhi Zheng

excited and look forward to seeing your test!

Posted on 2020-06-09 20:05:40