Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1321
Dr Donald Kinghorn (Scientific Computing Advisor )

AMD Threadripper and (1-4) NVIDIA 2080Ti and 2070 for NAMD Molecular Dynamics

Written on December 14, 2018 by Dr Donald Kinghorn
Share:


In my recent testing with the AMD Threadripper 2990WX is was impressed by the CPU based performance with the molecular dynamics program NAMD. Of course adding NVIDIA GPU's to the system gives a dramatic improvement since NAMD has good GPU acceleration. I like NAMD for many reasons and one of them is that it makes makes a pretty good benchmark for looking at CPU/GPU performance. NAMD requires a balance between CPU and GPU for the best results. It is also not very sensitive to speedup from AVX vector units. NAMD generally scales well with lots of cores (or lots of cluster nodes). After some discussions I decided it would be good to look at multi-GPU performance with NAMD on Threadripper. The assumption being that there would be enough cores to keep up with the NVIDIA's powerful new GPU's.

My last post AMD Threadripper 2990WX 32-core vs Intel Xeon-W 2175 14-core - Linpack NAMD and Kernel Build Time is good background for the present post and has some interesting comparison with an Intel 14-core Xeon-W system.

I spent a long afternoon on the same basic system I used in the last post. I was able to get a little testing done with the 24-core Threadripper 2970WX but most of the results are utilizing the 2990WX 32-core processor.

I had 2 "side fan" cooled NVIDIA RTX 2070 GPU's. It is not practical to use more than 2 of these types of cards in a system because of thermal throttling issues (very bad), see NVIDIA Dual-Fan GeForce RTX Coolers Ruining Multi-GPU Performance. A couple of days after doing the testing we got in our first batch of RTX 2070's with blower fans! You should be able to configure systems with these now.

We did have blower fan versions of the RTX 2080Ti so I was able to test with 1 to 4 of these great cards.


Test systems: AMD 2990WX and Intel Xeon-W 2175

The AMD Threadripper system I used was a test-bed build with the following main components,

AMD Hardware

  • AMD Ryzen Threadripper 2990WX 32-Core @ 3.00GHz (4.2GHz Turbo)
  • AMD Ryzen Threadripper 2970WX 24-Core @ 3.00GHz (4.0GHz Turbo)
  • Gigabyte X399 AORUS XTREME-CF Motherboard
  • 128GB DDR4 2666 MHz memory
  • Samsung 970 PRO 512GB M.2 SSD
  • NVIDIA RTX 2070
  • NVIDIA RTX 2090Ti

Software


Testing Results

When I sat down in front of the system it had a TR 2970WX 24-core processors in it so I did a few job runs with that before I swapped in the 2990WX. The first jobs I ran were CPU only. the results were very satisfying in that the scaling with increasing number of threads was very uniform. It is interesting that NAMD performance improved uniformly with SMT "hyper-threads". This is not always the case and you often see that only "real" cores improve performance.

CPU results

The graph shows how well the SMT threads worked with NAMD. Note that Lower is Better! What is being reported is the default NAMD performance output in day/ns i.e days needed to do 1 nano-second of simulation. Yes, this is a very compute intensive task! Big jobs can run for weeks or months. My job runs were for 500 time steps of the simulation.

TR NAMD CPU

That is very good CPU performance for that job run! In an older post, NAMD Performance on Xeon-Scalable 8180 and 8 GTX 1080Ti GPUs, using a dual Xeon 8180 system with a total of 56 CPU cores I had a result of 2.93 day/ns with 32 cores. Those processors cost over $10,000 each. So the 32-core Threadripper is a bargain by comparison. [Using all 56 cores on that Intel system I got 1.68 day/ns]. Note: if you look at that older post you will see that I took the inverse of the normal NAMD output and reported ns/day. Keep that in mind if you make a comparison. (sorry about that)

GPU accelerated results

The first thing I should say about the GPU results is that, even with the good performance from the 32-cores of the 2990WX, it's just not enough to keep up with more than 1 or 2 of the new NVIDIA RTX GPU's. The range of the worst result with 1 2070 to the best result with 4 2080Ti's is only a speedup of 1.6.

I'm not saying these results are bad! They are actually very good and they clearly show how much performance gain there is from adding even a "modest" GPU like the RTX 2070 which gives a speedup of nearly 5 over the CPU only result. However, by the time you have added 2 of the RTX 2070's or 2080Ti's you are being limited by the CPU.

In the older post I mentioned above, the dual Xeon 8180's provided enough CPU capability to get 0.438 day/ns with 1 GTX 1080Ti and using 2 1080Ti's gave 0.248 day/ns. Additional GPU's only made a small performance improvement over that, again being limited by CPU. (I tested with up to 8 GPU's).

Another thing to note in these results is the effect of the SMT "hyper-threads". With the CPU only runs there was a nice improvement with more SMT threads. When the GPU's were added the results were not as predictable. With more than 1 GPU it seemed that the SMT treads were a determent to performance.

The following table has all of the results of the testing.

TR NAMD CPU GPU

In this chart you can see that there is not much performance difference for many of the configurations. Also note that there can be significant performance variation between job runs. I only did two job runs on each test configuration and took the best one. It is clear that the TR 2990WX is providing more CPU performance than what is balanced with 1 RTX 2070. Adding a second RT 2070 or 1-2 RTX 2080Ti's provided more GPU performance than the CPU could effectively keep up with.

The following chart gives an easier to see, more uniform, performance scaling as the system specs are improved.

TR NAMD CPU GPU

Recommendation

No mater what CPU you have in your system if you are running NAMD then adding an NVIDIA GPU will be a significant performance boost. Hopefully this post shows that, and also makes clear the need for significant CPU performance to efficiently balance with modern GPU's. My recommendation for an AMD CPU based NAMD system would be the TR 2990WX and either 2 RTX 2070's or 1 RTX 2080Ti.

I will be doing a more comprehensive test with many GPU's for jobs including NAMD ( but with more focus on Machine Learning/AI ). That will be using the new Intel Core-X processors. In general I personally prefer an Intel CPU with AVX512 vector units for the basis of any scientific workstation. However, the high core count AMD Threadripper did really well in this NAMD testing. ... but see the note below ...

As a last note, I had to cut my testing short because the system failed after a normal OS update that I did in preparation to install CUDA for CPU-GPU memory bandwidth testing. I had 4 RTX 2080Ti's in the system and had booted to that with no problems. After a simple "apt-get upgrade" the system would no longer get to the boot prompt. I didn't have the time to try to find the problem.

Happy computing! --dbk

Tags: Threadripper, Ryzen, 2990WX, NAMD, HPC, Linux
el farmacéutico

Hello!
I am a complete noob when it comes to hardware for molecular modeling, and i was wondering how was the performance of AMD graphic cards for this kinds of tasks. I know AMD cards where preferred over nvidia for criptomining because they where suposedly most suited for calculations, but i have never seen a benchmark using AMD cards for molecular modeling.
Would you mind explaining to me why AMD cards are not used for this? And if they are suited, are they better or worse than Nvidia?
I thank you kindly for taking your time (and money) to do these tests, it has been extremely useful!.

Posted on 2018-12-26 22:19:09
Alexey Trubitsyn

I was not the one who had been asked, never the less hope this will help you. AMD cards are not used so widely for molecular modeling due to historical reasons: NVIDIA were the first to deal with GPU computing. People first tried to use triangles and textures to do scientific computations on a GPU. NVIDIA supported that approach and developed a first library which would make it easier to write this kind of software. CUDA was very restricted at first, but due to the lack of competitors it got widely adopted.
The question "are they better or worse" as it always does with such questions boils down to details like:
What exectly do you need in your project?
Performance? Both sides compete with each other time to time. AMD cards may be slightly more efficient in terms of computation per $USD.
Community support? CUDA has been experimented widely by the academic researchers over the years. Have been seeing many supercomputing centres installing Nvidia. Though OpenCL is catching up recently.
Programming productivity? NVIDIA CUDA porgramming is relatively simpler as it only needs to support its own GPU. The Unified Memory also might be a big deal for certain people.
Profiling and debugging capability? Both got their own software tools that are about equal as for me.
Stable driver support? Have been using Nvidia driver on Linux platform with no major problems what so ever. Had some compatibility difficalties with AMD drivers on Linux with various hardware.
Vendor independence? The main great feature of OpenCL is heterogeneous computing: same code can be launched on GPU, CPU, etc. With AMD card you’ll be using OpenCL wich can run NVIDIA card also, but you will certainly face performance issues during such transformation. Take a look at this paper for detailes: https://www.spiedigitallibr...
P.S. If you are new to the field I would strongly recomend NVIDIA + CUDA for your tasks mainly due to tons of tutorials online, bigger community and the ease of getting your system up and running. Best of luck in your research!

Posted on 2019-01-13 17:14:47
Donald Kinghorn

Thank you Alexey! I had forgotten to add this post to my comment monitoring!

Posted on 2019-01-14 19:29:56
Fernando Bachega

Hi Dr Donald Kinghorn. Thanks for such nice review.

I'm a NaMD user and considering buying the following CPU + GPU:

AMD Ryzen 7 2700X c/ Wraith Prism Cooler, Octa Core, Cache 20MB, 3.7GHz (Max Turbo 4.35GHz) AM4 - YD270XBGAFBOX

VGA EVGA NVIDIA GeForce GTX 1080 FTW 8GB, GDDR5, 256 Bits - 08G-P4-6286-KR

Do you think it's a good setup for someone with a limited budget?

Thanks a lot for you attention.

Posted on 2019-01-05 02:10:12
TA Nie

Not the writer, but that will be a decent machine and get the job done for sure.

Posted on 2019-01-14 17:52:04
Donald Kinghorn

Thank you for responding to the question ... I do agree ... and I have added myself to the comment notification list now :-)

Posted on 2019-01-14 19:32:00
Fernando Bachega

Thanks you so much, cheers!

Posted on 2019-02-05 00:37:45
Fernando Bachega

Thanks a lot, cheers!

Posted on 2019-02-05 00:37:23
trotos

Really nice thread. Got 2 questions.
1. Is it advisable to also have a third small passive GPU to just run the graphics?
2. What kind of PSU is necessary for this kind of powerhouse, 2990+2*2080Ti? What are you using in your tests? Especially for processes that will last for more than a day?
Thank you for your time and those excellent articles.

Posted on 2019-03-08 18:56:57

I'll let Don answer the GPU question, but for power supply it is usually not too hard to calculate. You can go through and figure out the actual TDP/TDW (power draw) from the specs of each part, but in general I tend to just consider each individual CPU and GPU as needing roughly 250W, plus about 150W for the motherboard/RAM/rest of the system. Obviously won't hold true if you have something like a dozen hard drives, but it is close enough in most cases.

So a single CPU plus dual GPU is 250W*3 = 750W, plus 150W for everything else to get you up to ~900W. After that, I typically tack on about 20% extra to account for efficiency loss and to give a little wiggle room and you get 1080W. So in this case, a 1000W PSU is probably cutting it too close, so I would go with a 1200W PSU. When doing GPU testing, we tend to just throw on a 1600W PSU so that we don't have to switch it out when we test with different GPU configurations, but going overboard on the PSU doesn't really affect anything. It just costs a bit more upfront and *technically* will be slightly less efficient since PSUs tend to be the best at converting AC to DC at about 80% of their peak rating.

Posted on 2019-03-08 19:06:09
Donald Kinghorn

My personal preference is to just use one of the compute cards for display. Modern cards are very good and have lots head-room. A lot of the display hardware is separate from compute. The only thing that has much impact is memory and a typical display in 2D doesn't use much. In the early days of GPU compute we would drop the system into a text based runlevel while running jobs (on Linux) It's not really necessary anymore. ... but ...

On the other hand, when the display GPU is under heavy load there will be some lag on the desktop. If you have a long running heavy load and still want to do other work on the system then. It would be nice to have a separate display card. You will be using up PCIe lanes but on a board with 3-4 X16 slots and 2 compute cards then adding a more modest display card would be nice (try to keep your compute cards on full X16 ). But, don't cheap out too much! Something like a xx70, xx60 or xx50 card should be OK. It's a good idea to keep it in the same (or close) architecture. Remember the CUDA runtime is from the display driver. On Windows I would be more inclined to add the extra card for display.

The other thing that can occasionally give you trouble is hardware device specification for the code you are running. There are inconsistencies in how different code reads device numbers. You could end up with some code that insists on starting on your display card. (There are ways around that but it can be annoying)

The 2080Ti is a great card so what you are thinking of doing looks good! With 2 cards I would recommend getting the NVLINK bridge too. There are some jobs that it wont have much effect on but for jobs that have GPU-GPU communication it makes a significant difference. I see this with the RNN code I test with "Big_LSTM". For things like reinforcement learning it could be even more significant (just guessing on that)

Posted on 2019-03-11 18:57:16
Arthur Gonzales

Hello Dr. Kinghorn, Thank you for this post. I was wondering if I could ask your opinion on choosing a system for research. I'm doing mainly docking (Autodock, Autodock Vina) and MD simulations (GROMACS, NAMD). I have a budget to get one of the systems below. Which one would you recommend and why? Also, do you have any additional comments on the specs? Thanks so much!

System 1: INTEL CORE I9 9900K 8 Cores/16 Threads
GIGABYTE Z390 MOTHERBOARD
32GB DDR4 RAM (16GB x2)
2x NVIDIA GEFORCE RTX 2080Ti 11GB

System 2: AMD THREADRIPPER 2990X 32 Cores/64 Threads
MSI X399 MEG CREATION
32GB DDR4 RAM (16GB X2)
1x NVIDIA GEFORCE RTX 2080Ti 11GB

Posted on 2019-04-09 03:49:43
Donald Kinghorn

In general I would stay away from "Coffee Lake" processors for this kind of work but, only for one reason, no AVX512. The 9900K has a high clock which is very nice but it is basically a "haswell AVX2" compute core. The newer compute hardware is in the core-X, Xeon-W and Xeon-SP. For NAMD (on CPU only) this is not a big deal but for GROMACS you should compile it with Intel MKL support! That will take advantage of AVX512. In general for numerically intensive applications I recommend getting a CPU with AVX512. To see this in action check out this post https://www.pugetsystems.co... Also, note that you want to use a new'ish NVIDIA GPU with the MD programs too. That can greatly improve the performance. You will see that in the above post too.

The Threadripper is an interesting option. It is also using AVX2 like Coffee Lake but having all of those cores can be a big plus and it can make up for raw per-core performance. I think TR is a good processor but I would say it's performance is more unpredictable because it is a more complex design than the Intel CPU's.

For Autodock my guess is that the Treadripper could be great! That code should scale really well since it has to try lots and lots of conformers to test and should be able to do those on all cores at once. I say "could be great" only because I haven't tested it myself. I would expect excellent performance.

So you actually are asking a hard question! :-)

[ In your specs I would bump that memory to at least 64GB and you should be fine with just 1 2080Ti. For the stuff you are looking at there is a large CPU dependent (bonded forces) component so the GPU will be waiting on CPU a lot of the time ... and the 2080Ti is amazingly fast! ]

My general recommendation for a multi-purpose scientific workstation is to go with core-X or Xeon-W (either one depending on memory performance sensitivity where Xeon-W may be better). AND, then use GPU acceleration whenever possible. A high core-count, core-X workstation with a 2080TI and 64-128GB mem along with NVMe storage in it is going to be a fantastic machine!

However, Threadripper (and EPYC) complicate that decision because that's a good value for that many raw cores! For programs that are not heavily vectorized or GPU accelerated that can have the advantage. My biggest hesitation is lack of testing.

I expect to do a heavy round of testing when the next gen TR are released. I am collecting ideas for good benchmarks and can hopefully come up a test suite that will have good performance discrimination for many types of applications.

Posted on 2019-04-09 16:28:49
Arthur Gonzales

Thank you so much, Dr. Kinghorn. I've since tested Autodock on someone else's Threadripper (24 cores) and it was fast but I think I will have to optimize it a bit more. I think I will go with the TR build since, as you've demonstrated, I can take advantage of the GPU at least in NAMD. I was also able on install MKL on the test machine (with the owner's permission) so I might be able to run GROMACS using the GPU as well. And thanks for the suggestion, I will bump up the RAM when I have saved enough.
I hope you can add GROMACS to your benchmark tests in the future.
Thanks again!

Posted on 2019-04-10 03:59:57
Hypersphere

Thanks for this. With MD simulations, Ilike to think in terms of ns/day insteady of days/ns.

Posted on 2019-04-14 20:44:08
Donald Kinghorn

I agree! I have sometimes reported the reciprocal ns/day but, NAMD has used days/ns from the beginning, and in the early days, performance was usually much better represented as "how many days is it going to take to get a nano-second of simulation" :-)

Posted on 2019-04-15 22:27:14