Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1321
Dr Donald Kinghorn (Scientific Computing Advisor )

AMD Threadripper and (1-4) NVIDIA 2080Ti and 2070 for NAMD Molecular Dynamics

Written on December 14, 2018 by Dr Donald Kinghorn

In my recent testing with the AMD Threadripper 2990WX is was impressed by the CPU based performance with the molecular dynamics program NAMD. Of course adding NVIDIA GPU's to the system gives a dramatic improvement since NAMD has good GPU acceleration. I like NAMD for many reasons and one of them is that it makes makes a pretty good benchmark for looking at CPU/GPU performance. NAMD requires a balance between CPU and GPU for the best results. It is also not very sensitive to speedup from AVX vector units. NAMD generally scales well with lots of cores (or lots of cluster nodes). After some discussions I decided it would be good to look at multi-GPU performance with NAMD on Threadripper. The assumption being that there would be enough cores to keep up with the NVIDIA's powerful new GPU's.

My last post AMD Threadripper 2990WX 32-core vs Intel Xeon-W 2175 14-core - Linpack NAMD and Kernel Build Time is good background for the present post and has some interesting comparison with an Intel 14-core Xeon-W system.

I spent a long afternoon on the same basic system I used in the last post. I was able to get a little testing done with the 24-core Threadripper 2970WX but most of the results are utilizing the 2990WX 32-core processor.

I had 2 "side fan" cooled NVIDIA RTX 2070 GPU's. It is not practical to use more than 2 of these types of cards in a system because of thermal throttling issues (very bad), see NVIDIA Dual-Fan GeForce RTX Coolers Ruining Multi-GPU Performance. A couple of days after doing the testing we got in our first batch of RTX 2070's with blower fans! You should be able to configure systems with these now.

We did have blower fan versions of the RTX 2080Ti so I was able to test with 1 to 4 of these great cards.

Test systems: AMD 2990WX and Intel Xeon-W 2175

The AMD Threadripper system I used was a test-bed build with the following main components,

AMD Hardware

  • AMD Ryzen Threadripper 2990WX 32-Core @ 3.00GHz (4.2GHz Turbo)
  • AMD Ryzen Threadripper 2970WX 24-Core @ 3.00GHz (4.0GHz Turbo)
  • Gigabyte X399 AORUS XTREME-CF Motherboard
  • 128GB DDR4 2666 MHz memory
  • Samsung 970 PRO 512GB M.2 SSD
  • NVIDIA RTX 2070
  • NVIDIA RTX 2090Ti


Testing Results

When I sat down in front of the system it had a TR 2970WX 24-core processors in it so I did a few job runs with that before I swapped in the 2990WX. The first jobs I ran were CPU only. the results were very satisfying in that the scaling with increasing number of threads was very uniform. It is interesting that NAMD performance improved uniformly with SMT "hyper-threads". This is not always the case and you often see that only "real" cores improve performance.

CPU results

The graph shows how well the SMT threads worked with NAMD. Note that Lower is Better! What is being reported is the default NAMD performance output in day/ns i.e days needed to do 1 nano-second of simulation. Yes, this is a very compute intensive task! Big jobs can run for weeks or months. My job runs were for 500 time steps of the simulation.


That is very good CPU performance for that job run! In an older post, NAMD Performance on Xeon-Scalable 8180 and 8 GTX 1080Ti GPUs, using a dual Xeon 8180 system with a total of 56 CPU cores I had a result of 2.93 day/ns with 32 cores. Those processors cost over $10,000 each. So the 32-core Threadripper is a bargain by comparison. [Using all 56 cores on that Intel system I got 1.68 day/ns]. Note: if you look at that older post you will see that I took the inverse of the normal NAMD output and reported ns/day. Keep that in mind if you make a comparison. (sorry about that)

GPU accelerated results

The first thing I should say about the GPU results is that, even with the good performance from the 32-cores of the 2990WX, it's just not enough to keep up with more than 1 or 2 of the new NVIDIA RTX GPU's. The range of the worst result with 1 2070 to the best result with 4 2080Ti's is only a speedup of 1.6.

I'm not saying these results are bad! They are actually very good and they clearly show how much performance gain there is from adding even a "modest" GPU like the RTX 2070 which gives a speedup of nearly 5 over the CPU only result. However, by the time you have added 2 of the RTX 2070's or 2080Ti's you are being limited by the CPU.

In the older post I mentioned above, the dual Xeon 8180's provided enough CPU capability to get 0.438 day/ns with 1 GTX 1080Ti and using 2 1080Ti's gave 0.248 day/ns. Additional GPU's only made a small performance improvement over that, again being limited by CPU. (I tested with up to 8 GPU's).

Another thing to note in these results is the effect of the SMT "hyper-threads". With the CPU only runs there was a nice improvement with more SMT threads. When the GPU's were added the results were not as predictable. With more than 1 GPU it seemed that the SMT treads were a determent to performance.

The following table has all of the results of the testing.


In this chart you can see that there is not much performance difference for many of the configurations. Also note that there can be significant performance variation between job runs. I only did two job runs on each test configuration and took the best one. It is clear that the TR 2990WX is providing more CPU performance than what is balanced with 1 RTX 2070. Adding a second RT 2070 or 1-2 RTX 2080Ti's provided more GPU performance than the CPU could effectively keep up with.

The following chart gives an easier to see, more uniform, performance scaling as the system specs are improved.



No mater what CPU you have in your system if you are running NAMD then adding an NVIDIA GPU will be a significant performance boost. Hopefully this post shows that, and also makes clear the need for significant CPU performance to efficiently balance with modern GPU's. My recommendation for an AMD CPU based NAMD system would be the TR 2990WX and either 2 RTX 2070's or 1 RTX 2080Ti.

I will be doing a more comprehensive test with many GPU's for jobs including NAMD ( but with more focus on Machine Learning/AI ). That will be using the new Intel Core-X processors. In general I personally prefer an Intel CPU with AVX512 vector units for the basis of any scientific workstation. However, the high core count AMD Threadripper did really well in this NAMD testing. ... but see the note below ...

As a last note, I had to cut my testing short because the system failed after a normal OS update that I did in preparation to install CUDA for CPU-GPU memory bandwidth testing. I had 4 RTX 2080Ti's in the system and had booted to that with no problems. After a simple "apt-get upgrade" the system would no longer get to the boot prompt. I didn't have the time to try to find the problem.

Happy computing! --dbk

Tags: Threadripper, Ryzen, 2990WX, NAMD, HPC, Linux
el farmacéutico

I am a complete noob when it comes to hardware for molecular modeling, and i was wondering how was the performance of AMD graphic cards for this kinds of tasks. I know AMD cards where preferred over nvidia for criptomining because they where suposedly most suited for calculations, but i have never seen a benchmark using AMD cards for molecular modeling.
Would you mind explaining to me why AMD cards are not used for this? And if they are suited, are they better or worse than Nvidia?
I thank you kindly for taking your time (and money) to do these tests, it has been extremely useful!.

Posted on 2018-12-26 22:19:09
Alexey Trubitsyn

I was not the one who had been asked, never the less hope this will help you. AMD cards are not used so widely for molecular modeling due to historical reasons: NVIDIA were the first to deal with GPU computing. People first tried to use triangles and textures to do scientific computations on a GPU. NVIDIA supported that approach and developed a first library which would make it easier to write this kind of software. CUDA was very restricted at first, but due to the lack of competitors it got widely adopted.
The question "are they better or worse" as it always does with such questions boils down to details like:
What exectly do you need in your project?
Performance? Both sides compete with each other time to time. AMD cards may be slightly more efficient in terms of computation per $USD.
Community support? CUDA has been experimented widely by the academic researchers over the years. Have been seeing many supercomputing centres installing Nvidia. Though OpenCL is catching up recently.
Programming productivity? NVIDIA CUDA porgramming is relatively simpler as it only needs to support its own GPU. The Unified Memory also might be a big deal for certain people.
Profiling and debugging capability? Both got their own software tools that are about equal as for me.
Stable driver support? Have been using Nvidia driver on Linux platform with no major problems what so ever. Had some compatibility difficalties with AMD drivers on Linux with various hardware.
Vendor independence? The main great feature of OpenCL is heterogeneous computing: same code can be launched on GPU, CPU, etc. With AMD card you’ll be using OpenCL wich can run NVIDIA card also, but you will certainly face performance issues during such transformation. Take a look at this paper for detailes: https://www.spiedigitallibr...
P.S. If you are new to the field I would strongly recomend NVIDIA + CUDA for your tasks mainly due to tons of tutorials online, bigger community and the ease of getting your system up and running. Best of luck in your research!

Posted on 2019-01-13 17:14:47
Donald Kinghorn

Thank you Alexey! I had forgotten to add this post to my comment monitoring!

Posted on 2019-01-14 19:29:56
Fernando Bachega

Hi Dr Donald Kinghorn. Thanks for such nice review.

I'm a NaMD user and considering buying the following CPU + GPU:

AMD Ryzen 7 2700X c/ Wraith Prism Cooler, Octa Core, Cache 20MB, 3.7GHz (Max Turbo 4.35GHz) AM4 - YD270XBGAFBOX

VGA EVGA NVIDIA GeForce GTX 1080 FTW 8GB, GDDR5, 256 Bits - 08G-P4-6286-KR

Do you think it's a good setup for someone with a limited budget?

Thanks a lot for you attention.

Posted on 2019-01-05 02:10:12
TA Nie

Not the writer, but that will be a decent machine and get the job done for sure.

Posted on 2019-01-14 17:52:04
Donald Kinghorn

Thank you for responding to the question ... I do agree ... and I have added myself to the comment notification list now :-)

Posted on 2019-01-14 19:32:00
Fernando Bachega

Thanks you so much, cheers!

Posted on 2019-02-05 00:37:45
Fernando Bachega

Thanks a lot, cheers!

Posted on 2019-02-05 00:37:23

Really nice thread. Got 2 questions.
1. Is it advisable to also have a third small passive GPU to just run the graphics?
2. What kind of PSU is necessary for this kind of powerhouse, 2990+2*2080Ti? What are you using in your tests? Especially for processes that will last for more than a day?
Thank you for your time and those excellent articles.

Posted on 2019-03-08 18:56:57

I'll let Don answer the GPU question, but for power supply it is usually not too hard to calculate. You can go through and figure out the actual TDP/TDW (power draw) from the specs of each part, but in general I tend to just consider each individual CPU and GPU as needing roughly 250W, plus about 150W for the motherboard/RAM/rest of the system. Obviously won't hold true if you have something like a dozen hard drives, but it is close enough in most cases.

So a single CPU plus dual GPU is 250W*3 = 750W, plus 150W for everything else to get you up to ~900W. After that, I typically tack on about 20% extra to account for efficiency loss and to give a little wiggle room and you get 1080W. So in this case, a 1000W PSU is probably cutting it too close, so I would go with a 1200W PSU. When doing GPU testing, we tend to just throw on a 1600W PSU so that we don't have to switch it out when we test with different GPU configurations, but going overboard on the PSU doesn't really affect anything. It just costs a bit more upfront and *technically* will be slightly less efficient since PSUs tend to be the best at converting AC to DC at about 80% of their peak rating.

Posted on 2019-03-08 19:06:09
Donald Kinghorn

My personal preference is to just use one of the compute cards for display. Modern cards are very good and have lots head-room. A lot of the display hardware is separate from compute. The only thing that has much impact is memory and a typical display in 2D doesn't use much. In the early days of GPU compute we would drop the system into a text based runlevel while running jobs (on Linux) It's not really necessary anymore. ... but ...

On the other hand, when the display GPU is under heavy load there will be some lag on the desktop. If you have a long running heavy load and still want to do other work on the system then. It would be nice to have a separate display card. You will be using up PCIe lanes but on a board with 3-4 X16 slots and 2 compute cards then adding a more modest display card would be nice (try to keep your compute cards on full X16 ). But, don't cheap out too much! Something like a xx70, xx60 or xx50 card should be OK. It's a good idea to keep it in the same (or close) architecture. Remember the CUDA runtime is from the display driver. On Windows I would be more inclined to add the extra card for display.

The other thing that can occasionally give you trouble is hardware device specification for the code you are running. There are inconsistencies in how different code reads device numbers. You could end up with some code that insists on starting on your display card. (There are ways around that but it can be annoying)

The 2080Ti is a great card so what you are thinking of doing looks good! With 2 cards I would recommend getting the NVLINK bridge too. There are some jobs that it wont have much effect on but for jobs that have GPU-GPU communication it makes a significant difference. I see this with the RNN code I test with "Big_LSTM". For things like reinforcement learning it could be even more significant (just guessing on that)

Posted on 2019-03-11 18:57:16
Arthur Gonzales

Hello Dr. Kinghorn, Thank you for this post. I was wondering if I could ask your opinion on choosing a system for research. I'm doing mainly docking (Autodock, Autodock Vina) and MD simulations (GROMACS, NAMD). I have a budget to get one of the systems below. Which one would you recommend and why? Also, do you have any additional comments on the specs? Thanks so much!

System 1: INTEL CORE I9 9900K 8 Cores/16 Threads
32GB DDR4 RAM (16GB x2)

System 2: AMD THREADRIPPER 2990X 32 Cores/64 Threads
32GB DDR4 RAM (16GB X2)

Posted on 2019-04-09 03:49:43
Donald Kinghorn

In general I would stay away from "Coffee Lake" processors for this kind of work but, only for one reason, no AVX512. The 9900K has a high clock which is very nice but it is basically a "haswell AVX2" compute core. The newer compute hardware is in the core-X, Xeon-W and Xeon-SP. For NAMD (on CPU only) this is not a big deal but for GROMACS you should compile it with Intel MKL support! That will take advantage of AVX512. In general for numerically intensive applications I recommend getting a CPU with AVX512. To see this in action check out this post https://www.pugetsystems.co... Also, note that you want to use a new'ish NVIDIA GPU with the MD programs too. That can greatly improve the performance. You will see that in the above post too.

The Threadripper is an interesting option. It is also using AVX2 like Coffee Lake but having all of those cores can be a big plus and it can make up for raw per-core performance. I think TR is a good processor but I would say it's performance is more unpredictable because it is a more complex design than the Intel CPU's.

For Autodock my guess is that the Treadripper could be great! That code should scale really well since it has to try lots and lots of conformers to test and should be able to do those on all cores at once. I say "could be great" only because I haven't tested it myself. I would expect excellent performance.

So you actually are asking a hard question! :-)

[ In your specs I would bump that memory to at least 64GB and you should be fine with just 1 2080Ti. For the stuff you are looking at there is a large CPU dependent (bonded forces) component so the GPU will be waiting on CPU a lot of the time ... and the 2080Ti is amazingly fast! ]

My general recommendation for a multi-purpose scientific workstation is to go with core-X or Xeon-W (either one depending on memory performance sensitivity where Xeon-W may be better). AND, then use GPU acceleration whenever possible. A high core-count, core-X workstation with a 2080TI and 64-128GB mem along with NVMe storage in it is going to be a fantastic machine!

However, Threadripper (and EPYC) complicate that decision because that's a good value for that many raw cores! For programs that are not heavily vectorized or GPU accelerated that can have the advantage. My biggest hesitation is lack of testing.

I expect to do a heavy round of testing when the next gen TR are released. I am collecting ideas for good benchmarks and can hopefully come up a test suite that will have good performance discrimination for many types of applications.

Posted on 2019-04-09 16:28:49
Arthur Gonzales

Thank you so much, Dr. Kinghorn. I've since tested Autodock on someone else's Threadripper (24 cores) and it was fast but I think I will have to optimize it a bit more. I think I will go with the TR build since, as you've demonstrated, I can take advantage of the GPU at least in NAMD. I was also able on install MKL on the test machine (with the owner's permission) so I might be able to run GROMACS using the GPU as well. And thanks for the suggestion, I will bump up the RAM when I have saved enough.
I hope you can add GROMACS to your benchmark tests in the future.
Thanks again!

Posted on 2019-04-10 03:59:57

Thanks for this. With MD simulations, Ilike to think in terms of ns/day insteady of days/ns.

Posted on 2019-04-14 20:44:08
Donald Kinghorn

I agree! I have sometimes reported the reciprocal ns/day but, NAMD has used days/ns from the beginning, and in the early days, performance was usually much better represented as "how many days is it going to take to get a nano-second of simulation" :-)

Posted on 2019-04-15 22:27:14

Donald Kinghorn:

1. Are there any special configurations in Ubuntu linux to get NAMD working with RTX2080Ti GPUs?
2. What about NVlink (with Ubuntu and NAMD)?
3. Have you tried Gromacs?

FYI, I had trouble getting Gromacs 2019.3 to compile correctly on a machine with an i9-9920X CPU and 2x RTX2080Ti GPUs with NVLink installed.

I am running Linux Mint 19.2 Xfce 64-bit, which is based on Ubuntu 18.04 LTS. My CUDA version is 10.1. Although Gromacs seemed to compile okay, when I ran "make check" it would pass 45 out of 46 tests, but it would always fail test 42 (complex regressions, which I think was making use of the GPUs).

Finally, I upgraded to linux kernel 5 and upgraded my gcc, g++, and gfortran to version 8. This resulted in success on all 46 tests.

I have not yet run MD benchmarks with Gromacs or NAMD.

Posted on 2019-09-17 12:37:45
Donald Kinghorn

You shouldn't need anything special other than an up-to-date display driver. I always use the graphics drivers ppa for that i.e. like,
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-driver-435

NAMD GPU should run fine using the binary cuda build download (it's "self contained" because of the charm++ backend, so only the cuda runtime in the driver is needed)

I don't think I have tried any MD codes with NVLINK (something to try :-) However, I wouldn't expect it to help much for this, since it is mostly just the non-bonded forces going to the GPU, and they are partitioned out to each card then accumulated. It's likely during a sim the GPU's will be waiting on CPU (not each other)

I have not done a Gromacs build for awhile. To do the build you would need to have the cuda tool-kit installed ... which you have ... I had trouble with the build before when I used the latest OpenMPI. I would use a slightly older version than the current 4.xx 3.1.4 is probably a good idea. ... but, looks like you have things working.

Thanks for the heads-up about the failed test and using the update to gcc 8 fix

I am working docker containers for GPU accelerated codes. An up-to-date Gromacs would be a good one to do!

Posted on 2019-09-17 16:56:44

Hi Donald,

I forgot to mention that I installed the latest Nvidia driver available for the version 5 linux kernel -- 430.26. After that, I installed all the CUDA components except for the driver. CUDA 10.1 wants a much earlier driver -- 418, I think, and this did not work for me.

Regarding MPI, I remain confused by this. Is the MPI option needed only if you are running Gromacs on more than one "node"? I assume that my PC workstation would be considered a node, and I am not trying to link several PCs in a network or cluster. At first, I thought I might need MPI in order to use more than one CPU core in my machine, but my interpretation of the Gromacs docs is that multi-core functionality is built in. Perhaps you could shed some light on this question. Thanks!

Posted on 2019-09-17 17:36:01
Donald Kinghorn

I've just always compiled it with MPI ... it should also be using OpenMP threads for multi-core parallelism. If the docs are saying you don't need to build with MPI for single node use then, good, that saves you some serious hassel. ... let me look ...

Was my bad for not reading the current install doc a bit closer ... Gromacs has been around forever kind of like me :-) I used mdrun in the past

"GROMACS can run in parallel on multiple cores of a single workstation using its built-in thread-MPI. No user action is required in order to enable this." That's great. (Unless you do want to setup a cluster of course)

If your build fails with a complaint about mpi.h being missing then you would need an MPI install but otherwise it's great if that is all built into the gmx build.

The following notes got my attention too, (referring to SIMD instructions)

"However, certain desktop and server models (e.g. Xeon Bronze and Silver) come with only one AVX512 FMA unit and therefore on these processors AVX2_256 is faster (compile- and runtime checks try to inform about such cases). "

I knew about the CPU's with only 1 AVX512 FMA ( I removed them from config options for "Puget Systems Peak") There are only a few of them (low-end Xeon-SP) I believe all of the core-X and W processors have 2 units. [there are new ones coming out, I'll need to be sure to double check that]

This next note was a bit of surprise to me,

"Additionally, with GPU accelerated runs AVX2_256 can also be faster on high-end Skylake CPUs with both 512-bit FMA units enabled."

-DGMX_SIMD=AVX2_256 vs -DGMX_SIMD=AVX_512 could be an interesting experiment.

It looks like running cmake with defaults will likely do the right thing. Your i9 9920X does have 2 AVX512 FMA units I would be curious to see what cmake picked for that ... it should be in the log file when you run cmake

Posted on 2019-09-18 16:48:19
Mujeeb Mohamed

Hello Sir,

I am a 3d designer and animator. Looking for a HEDT for 3ds max, 3d modelling and rendering in vray, corona and lumion. After several research i came up with the below mentioned configuration. Please suggest if this is good for the purpose or if there is any mismatch. Hoping to get a reply asap. I am a rookie in PC assembling and finding the right components. So your advice will help me a lot.
Thanks and regards.

Processor : AMD Ryzen Threadripper 2990wx 32C/64T
Motherboard : Asus Rog Zenith Extreme AMD X399 threadripper E-Atx Motherboard
Memory : G.Skill 64GB (8x8GB) TridentZ RGB series ddr4 PC4-23400 2933mhz AMD X399 Desktop Memory Model F4-2933C14Q2-64GTZRX
Graphics Card: Asus Strix Geforce RTX 2070 Super Advanced Edition 8GB GDDR6
Storage SSD: 3 x Sabrent 1TB Rocket NVMe PCIe M.2 2280 Internal SSD High Performance Solid State Drive (SB-ROCKET-1TB)
Storage HDD: Seagate 2TB firecuda Gaming SSHD sata 6GB/s HDD
PSU : HXi Series HX1000i High-Performance ATX Power Supply — 1000 Watt 80 Plus PLATINUM Certified PSU
Case: Thermaltake View71 Tempered Glass RGB plus edition CA-1I7-00F1WN-02
Monitor: LG 27UK850-W 27" 4k UHD IPS Monitor LG Led 27inch Monitor
Keyboard & Mouse:Thermaltake Tt esports Knucker 4-in1-3 Color Membrane Keyboard & 2400 Dpi Avago 5050 Optical Gaming Mouse

Posted on 2019-10-10 13:43:14
Donald Kinghorn

I'm not very familiar with those application but we do have recommended systems for them (and lots of performance analysis)
https://www.pugetsystems.co... Have a look at the configurations there and check out some of the articles.
Best wishes --Don

Posted on 2019-10-10 14:44:26
shushan hu

Hi, I'm planning to buy a machine for MD using LAMMPS and am wondering the following specification is well matching. Your paper is very useful but I'm uncertian if therer is difference between NAMD and LAMMPS. Thanks.
CPU: AMD 3960x
RAM: 128GB
GPU: 2 x 2080Ti
HD: 512GB SSD.

Posted on 2019-12-18 08:50:32
Donald Kinghorn

There is some difference in how code is optimized for CPU+GPU but the basic effect is mostly the same. Both will have great CPU core scaling and get significant speed up from the GPU's. What you have spec'd will probably be bottleneck'd by the CPU ... a little. I did test NAMD with the ApoA1 job on the 32-core 3970x (it was fantastic!). I have results for 1 and 2 2080Ti's https://www.pugetsystems.co...

I think you would probably get a good performance balance and hardware utilization with just 1 2080Ti and the 3960x. I did get a reasonable speed up with 2 using the 32-core 3970x but I felt that even with that it would be a better balance with more cores ... I need more testing time! ... On the other hand having 2 2080Ti's is nice! LAMMPS might balance a little better than NAMD (?) and it is also possible that future version of the MD codes will have move load going on to the GPU's.

I'm planning on doing more extensive testing when the 64-core TR shows up at the office (soon I hope) I'm expecting to do a "recommended system" for MD with that and 2-4 2080Ti I'm not sure where the ideal CPU core-count to GPU ratio will be. ... I'll try to test with more than just NAMD too ...

If I were you I would try to stretch the budget up to the 3970x if you can. ... If not, then I think you will still have a great system using the 3960x and either 1 or 2 of the 2080Ti's

Posted on 2019-12-18 18:30:50
shushan hu

Thanks a million Sir, but I have to use 3960X for now due to the budget limitation.

Posted on 2019-12-24 09:18:27
shushan hu


Posted on 2019-12-24 09:16:28