Molecular Dynamics Benchmarks GPU Roundup GROMACS NAMD2 NAMD 3alpha on 12 GPUs

Introduction

Molecular Dynamics is an important tool in many areas of scientific research. We get deep satisfaction from providing workstations for this research. It's a feeling of contributing to the common good and "making a difference". To this end we make a serious effort to recommend appropriate hardware. Part of that effort is benchmarking!

We have a new collection of GPU accelerated Molecular Dynamics benchmark packages put together for GROMACS, NAMD 2, and NAMD 3-alpha10. (The benchmark packages will be available to the public soon.) In this post we present results for,

  • 3 applications: GROMACS, NAND 2 and NAMD 3alpha10
  • 8 MD simulations
  • 12 different NVIDIA GPUs
  • 96 total results

GROMACS and NAMD Benchmarks

GROMACS Logo

GROMACS 2022:

Benchmark References:

MEM:

de Groot, B. L.; Grubmueller, H.: Water permeation across biological membranes: Mechanism and dynamics of Aquaporin-1 and GlpF. Science 294, pp. 2353 – 2357 (2001)

RIB:

Bock, L. V.; Blau, C.; Schröder, G. F.; Davydov, I. I.; Fischer, N.; Stark, H.; Rodnina, M. V.; Vaiana, A. C.; Grubmüller, H.: Energy barriers and driving forces in tRNA translocation through the ribosome. Nature Structural and Molecular Biology 20 (12), pp. 1390 – 1396 (2013)

PEP:

Matthes, D.; Gapsys, V.; de Groot, B. L.: Driving forces and structural determinants of steric zipper peptide oligomer formation elucidated by atomistic simulations. Journal of Molecular Biology 421 (2-3), pp. 390 – 416 (2012)


NAMD Logo

NAMD v2.14 and v3 alpha10

Benchmark Source:

apoa1:

ApoA1 benchmark (92,224 atoms, periodic, PME)

f1atpase:

ATPase benchmark (327,506 atoms, periodic, PME)

stmv:

STMV (virus) benchmark (1,066,628 atoms, periodic, PME)

NAMD 3 is GPU resident, soon, no more CPU bottleneck!

NAMD v3 is currently at alpha 10 and is already seeing significant performance gains over v2. Being GPU resident it will not be limited by CPU performance which has been a bottleneck with v2 for several years since recent NVIDIA GPUs have such high performance. The forces computed on GPU would finish so quickly that the GPUs would starve for work waiting on multi-CPU cores to finish.

Not all methods from NAMD 2 are implemented yet. The f1atpase benchmark failed with the following message,

( Bribe us with coffee to get your feature GPU-Resident! :)
FATAL ERROR: CUDASOA integration is incompatible with the following options:
minimization; pressure profiling; Berendsen pressure;
multigrator; Lowe-Andersen; fixed atoms; GBIS; LCPO;
zero momentum; TCL forces; Colvars;
temperature coupling, rescaling, or reassignment;
water models other than TIP3; lonepairs or Drude.

Methodology

The benchmarks were run using Python wrapper scripts that collect system information, take user input for "jobs", gpu selection, cpu cores, repeats, output, etc. These scripts run the external benchmark applications after constructing appropriate command-lines and then gather job run performance output and statistics. The output files were then loaded to pandas data-frames in a jupyter notebook and charts generated with Plotly.

All jobs were run on the following CPU platform;

(TR Pro 3995WX 64 core was chosen to help limit CPU bottlenecks and give better discrimination of GPU performance)

GPUs: (NVIDIA Founders Editions)

  • RTX 3090 Ti
  • RTX 3090
  • RTX 3080 Ti
  • RTX 3080
  • RTX 3070 Ti
  • RTX 2080 Ti
  • GTX 1080 Ti
  • RTX A6000
  • RTX A5500
  • RTX A5000
  • RTX A4500
  • RTX A4000

Representative Command Lines

GROMACS

commandline: gmx mdrun -gpu_id 0 -pin on -noconfout -ntomp 64 -s gromacs/MEM/benchMEM.tpr -nsteps 10000

NAMD 2

commandline: namd2 +p128 +setcpuaffinity +idlepoll +isomalloc_sync +devices 0 namd/f1atpase/f1atpase.namd

NAMD 3alpha

commandline: namd3 +p1 +setcpuaffinity +idlepoll +isomalloc_sync +devices 0 namd3/apoa1/apoa1.namd

The input files for v3 alpha needs this additional directive,

CUDASOAintegrate on

Note the "+p1" on the command-line. NAMD 3 is "GPU resident", essentially everything runs on the GPU and only 1 CPU core is used per GPU!

GROMACS did well without Hyper/SMT threads and NAMD2 did better with them when using GPU acceleration.

Results

There are 16 charts. 8 for the GeForce GPUs and 8 for the Professional GPUs. X-axis, i.e. performance, is the same scale for both.

There are anomalies!

  • GROMACS did not perform well on A6000. This may be some quirk with how GROMACS loads memory of the 48GB on available on the A6000.
  • The A4000 gave surprisingly poor performance on all test. I had naively expected it to perform relative to the (excellent) A4500.
  • The RTX 3080 gave excellent performance with NAMD 2 on larger problem sizes. This was unexpected. I'm used to seeing flat performance with the higher-end GeForce GPUs since job runs often bottleneck by CPU performance limitations. I cannot explain the 3080s good performance.
  • NAMD 3 alpha10 was a nice addition to the testing. This is GPU resident code so the performance scaling was more consistent with GPU specs. It is still early development code and by the excellent performance illustrated in the charts below speaks well for future releases.

GROMACS:


GROMACS benchmark MEM bar chart GeForce GPUs
GROMACS MEM benchmark bar chart Pro GPUs

This is the first of the anomalies. GROMACS did not do well with the A6000. I repeated the benchmarks may times but it was consistent. If you have an explanation or observation please add it to the comments.


GROMACS RIB benchmark bar chart GeForce GPUs
GROMACS RIB benchmark bar chart Pro GPUs

GROMACS PEP benchmark bar chart GeForce GPUs
GROMACS PEP benchmark bar chart Pro GPUs

NAMD 2.14


NAMD apoa1 benchmark bar chart GeForce GPUs
NAMD apoa1 benchmark bar chart Pro GPUs

NAMD f1atpase benchmark bar chart GeForce GPUs
NAMD apoa1 benchmark bar chart Pro GPUs

This is another anomaly. I cannot explain the great performance of the RTX 3080. This was a "Founders edition" card like all the rest.

Also not the apparent CPU bottleneck with the higher GPUs. The ATPase (f1atpase) benchmark uses more methods that are CPU dependent than the other NAMD benchmarks. Some of these methods have not yet been ported to NAMD 3, possibly because they are difficult to port entirely to GPU.


NAMD 2 stmv benchmark bar chart GeForce GPUs
NAMD 2 stmv benchmark bar chart Pro GPUs

NAMD v3 alpha10:


NAMD v3 alpha10 apoa1 benchmark bar chart GeForce GPUs
NAMD v3 alpha10 apoa1 benchmark bar chart Pro GPUs

Notice how the GPU resident code of NAMD 3 alpha scales more smoothly with the specs of the GPUs and the significantly improved performance over NAMD 2. I'm Looking forward to seeing the "production" version!


NAMD 3 alpha10 stmv benchmark bar chart GeForce GPUs
NAMD 3 alpha10 stmv benchmark bar chart Pro GPUs

GeForce RTX vs Pro RTX

There is no denying that GeForce "gaming" GPUs can make powerful compute devices. It was experimentation on gaming GPUs that started the whole idea of general GPU compute acceleration. NVIDIA capitalized on this in the best way possible eventually leading to them becoming the most important scientific hardware vendor in history. (In my opinion.) Early adopters used gaming GPUs and Molecular Dynamics was one of the first areas of success.

Of course NVIDIA has higher-end Professional GPUs that used to be labeled as "Quadro" (after the quad buffering that was useful for stereo imaging). These (and AMD FirePro) have been the choice for serious workstation use for decades.

Beyond that there is the pinnacle of GPU compute acceleration offered by what was formerly "Tesla". GPUs like the A100 and upcoming Hopper H100 are astounding compute devices offering many advanced feature such as excellent double precision (FP64) floating point.

However, Molecular Dynamics force calculations are generally performed at FP32. Thus GeForce and Professional GPUs usually suffice. CPUs have been needed for some parts of the code that was difficult to port to GPU but the trend is to move as much as possible to the GPU since that is where the most compute performance exists. NAMD 3 alpha is an example of that.

In this post I used both GeForce and Professional GPUs since they are both worthy of consideration. For compute they are mostly the same since they use the same GPU cores. However, the RTX A series Pro GPUs are more expensive than the GeForce RTX at the same performance level. However, they offer a couple of advantages that might be worth the expenditure. They are certainly great GPUs!

  • The RTX Professional GPUs are highly reliable. We sell thousands of GPUs annually and the Pro series have very low failure rates in the field. We track all failures in the field and "dead on arrival" or during stress testing. We looked at the failures for NVIDIA Professional and realized that the failure rate was almost entirely due to a batch of cards that came in with a bad video socket. There was only 1 or 2 failures other than that. The RTX GeForce cards are also very reliable but we have many more failures in the field and during testing. In the early to mid 2010's I would advise folks who were doing multi-GPU systems with GeForce to buy extra in anticipation of a failure approximately every 6-12 months. They are more reliable these days but the Professional cards are much better in this regard.
  • The Professional cards are better engineered and use the best selected components. They are the best choice for multi-GPU configurations because of superior cooling offered with their blower fans and often more compact form. Most of the GeForce GPUs use a cooling system that exhausts into the chassis and requires at least one card space between devices. The Professional GPUs exhaust out the rear of the case and can be placed next to each other.

Conclusions

I hope this testing is useful for you! It can be difficult to find performance test results on current hardware with a large enough test suite to inform decisions for system configurations. It's not feasible to test every job type and setup variation with these feature rich applications. Hopefully there is enough information in the text and charts to assist decisions.

This GPU performance "roundup" for Molecular Dynamics has been on my ToDo list for some time. The new benchmarking environment and automation was a big help. I had several ideas for improving the benchmark usability and will refactor my current code and then make it available to the public.

If I had to pick a "winner" it would have to be the RTX3080 however. Really all of these GPUs, except the A4000, give several factors of performance improvement over CPU alone.(see for example) AMD Threadripper Pro 3995x HPL HPCG NAMD Performance Testing (Preliminary)

Happy computing! –dbk @dbkinghorn


CTA Image
Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of powerful and reliable systems that are tailor-made for your unique workflow.

Configure a System!
CTA Image
Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?

gears icon

Built Specifically for You

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

people icon

We’re Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

delivery icon

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry-leading ship time.

repair icon

Lifetime Labor & Tech Support

Even when your parts warranty expires, we continue to answer your questions and service your computer with no labor costs.
Click here for even more reasons!