Molecular Dynamics Performance on GPU Workstations — NAMD

Introduction

This is the first of a series of posts on GPU accelerated Molecular Dynamics programs. Molecular dynamics codes spend a large part of their run-time calculating forces between atoms using relatively simple potential functions that describe the interactions of the atoms in the molecular system being simulated. These calculations are performed repeatedly at very small time steps to simulate the dynamics of the system being studied. Many of these calculation can be computed to sufficient accuracy using single precision floating-point and are ideal for execution on GPU’s. The result is that these codes can achieve very good performance on a modern GPU accelerated workstation giving job performance that was only achievable using CPU compute clusters only a few years ago.

NAMD

NAMD is a molecular dynamics program developed and maintained by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign.

It is proprietary software licensed by the University of Illinois and is made freely available, including source code, under a non-exclusive, non-commercial use license.

NAMD is a widely used molecular dynamics program capable of performing simulations on systems with millions of atoms. It is also highly parallel and is often installed on large compute clusters. The underlying parallelism is achieved by using the (very interesting) parallel objects framework charm++.

The group at UIUC working on NAMD were early pioneers of using GPU’s for compute acceleration and NAMD has very good performance acceleration using NVIDIA CUDA.

Obtaining NAMD

NAMD is available as source that you can compile yourself or in a variety of binary builds. I highly recommend that you just grab an appropriate binary un-tar it and go to work! Simple is good 🙂

The binary builds that I will use for testing are from the version 2.10 builds.

You will need to register on the site to obtain the programs. They need you to do that for for several reasons including having a record of downloads that they can use to help with obtaining their funding … always a good thing!

I’m doing my testing in Linux but there are also binary builds for Mac OS X and Windows. I have tested the CUDA accelerated Windows version and it works well with or without Cygwin, but Cygwin will make life easier in general on Windows.

Test configurations

We are looking at single node GPU accelerated workstation performance and will test on two base computer systems.

The Peak Tower Single — single CPU with 1-4 GPU’s
CPU: Intel Xeon E5 1660v3 8-core @ 3.0GHz
Memory: 64 GB DDR4 2133MHz Reg ECC
PCIe: (4) X16-X16 v3
The Peak Tower Dual — dual CPU with 1-4 GPU’s
CPU: (2) Intel Xeon E5 2687v3 10-core @ 3.1GHz
Memory: 256 GB DDR4 2133GHz Reg ECC
PCIe: (4) X16-X16 v3

OS

The PC software for the testing was CentOS 7.1 1536 plus updates. NVIDIA drivers and CUDA environment were setup using the cuda 7.5 repo for CentOS 7 cuda-repo-rhel7-7-5-local-7.5-18.x86_64.rpm That includes the NVIDIA 352.39 kernel modules.

GPU’s

This testing is focused on NVIDIA GeForce GPU’s. We are really only interested in single precision performance from the GPU’s. The newer NVIDIA GPU’s (Maxwell based) have exceptionally good single precision performance and GPU accelerated molecular dynamics codes are carefully crafted to take advantage of this without degradation in computational accuracy. The CPU provides double precision accuracy where it is needed.

There has been some debate in the past about using GeForce cards for scientific computation. Tesla model cards from NVIDIA are designed for compute and offer features such as ECC error correction, better cooling design, and are built from higher “binned” parts — it’s the best stuff! In cluster applications it is highly recommended to use Tesla. However, newer GeForce cards have very good thermal and power design. Our experience using cards from top tier vendors like EVGA and ASUS is that they have been excellent with very low failure rates. It used to be that the most important consideration when picking cards for compute was to avoid anything that was overclocked. These days it’s actually hard to find any cards that are not overclocked! There is a reason for that … the newer chips and designs are just really good! Manufactures will by default overclock cards that are well within design specs. You can argue that an overclocked card is using a higher bined part. Still it makes me nervous because I’ve been doing this for a long time, but I conceed that the newer cards like the 9xx GeForce cards and Titan X are just excellent cards. I would not be too concerned about overclocked cards but I would still probably recommend avoiding “superclocked” cards. [ I think “superclocked” is the new “overclocked” ] Video cards from top tier manufacturers with good cooling hardware will give good performance and hold up well to heavy load. They are also inexpensive enough that if they show any sign of failure or inconsistency you should plan to replace them without hesitation. Budget for that! My personal expectation when using GeForce cards for compute is to assume you will be replacing some cards in 6 to 9 months if they are under constant heavy load. They may very well hold up for several years! Over the years I have had varying luck with GeForce cards but it looks like the 900 series cards are excellent and performance is outstanding! However, If you are considering GPU’s for multi-node clusters we would be much more inclined to recommend NVIDIA Tesla GPU’s!

Video cards used for testing. Data from nvidia-smi -q

Card CUDA cores GPU clock MHz Memory clock MHz* Application clock MHz** FB Memory MiB
Titan Black 2880 1202 3500 888 6143
GTX 960(OC) 1024 1493 3600 1228 2047
GTX 970(OC) 1664 1455 3505 1114 4095
GTX 980 2048 1392 3505 1126 4095
GTX 980 Ti 2816 1392 3505 1000 6143
GTX TITAN X 3072 1392 3505 1000 12287

Notes:
* Marketing magic often reports twice that number as MT/s.
** a.k.a. base clock.
The 960 and 970 used were factory overclocked versions.
Titan Black is a “Kepler” based GPU all others are “Maxwell” based.

Testing Simulations

The test simulations data and configuration files can be downloaded from the NAMD utilities page. All jobs were run using the default configuration files (500 time steps).

ApoA1 benchmark [ apoa1.namd ]
Apolipoprotein A-I
2,224 atoms, periodic, PME (Particle Mesh Ewald)
ATPase benchmark [ f1atpase.namd ]
Adenosine tri-phosphate (ATP) synthase
92,224 atoms, periodic, PME
STMV benchmark [ stmv.namd ]
Satellite Tobacco Mosaic Virus
1,066,628 atoms, periodic, PME

Results

The numbers mostly speak for themselves but do read the “Notes:” section at the end of each table. The following tables present the most interesting data however, I ran a lot more jobs than are listed here so I’ve added an appendex to the post that has more job runs. I ran GPU accelerated jobs with as few as 2 CPU cores and up to full hyperthreading. I was looking for balance between CPU and GPU i.e. when is a job limited by CPU or GPU resources.

There are results for each of the two test platforms for “CPU only” job runs that show the excellent thread scaling of NAMD. It is essentially linear scaling with CPU core count. Also, hyperthreading improved performance in every case for CPU only job runs. However, with GPU accelerated job runs the Peak Single mostly benefited from hyperthreading but the Peak dual did better using only “real” cores with the exception of the run with 4 TITAN X cards.

The GPU acceleration for NAMD is very good. Adding nearly any NVIDIA CUDA capable GPU will significantly improve performance. There is diminishing returns when the GPU capability exceeds the CPU’s ability to keep up. I didn’t do a price/performance analysis but it should be pretty obvious what configuration offered the best performance value. I’ll comment at the end of the post.

Caveats:

1) I mentioned it before and I want to reiterate — heavy compute on GeForce cards can shorten their lifetime! I believe it is perfectly fine to use these cards but keep in mind that you may fry one now and then, especially any overclocked cards!

2) The numbers should not be taken as definitive results. There can be considerable variation in both runtime and the all important day/ns numbers. The “days per nano-second” has the most variability since I just used the number that was reported at the last “benchmark” phase of each job run and that is not necessarily the best result of all of the “benchmark” reports during the job run. Also, these jobs were only run for 500 time steps a “real” job would run significantly longer. I feel best about that “wall time” as a measure of performance.

Peak Tower Single — Xeon E5 1660v3 8-core @ 3.0GHz
[CPU only Results]

CPU
cores
apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
CPU 1 417.5 9.58 1251.3 28.45 4177.1 95.32
2 209.8 4.78 630.3 14.22 2124.2 48.07
4 114.1 2.56 342.8 7.68 1148.7 25.98
8 59.0 1.29 176.2 3.87 585.9 13.14
16(HT) 51.5 1.11 151.9 3.29 514.8 11.33
Notes:
(HT) indicates that hyperthreading is being used.
“wall time” is in seconds

Peak Tower Single –Xeon E5 1660v3 8-core @ 3.0GHz
[GPU Acceleration Results]

Card CPU
cores
apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
Titan Black 16(HT) 14.4 0.29 44.7 0.92 107.2 2.11
GTX 960 16(HT) 21.0 0.44 64.1 1.36 141.9 2.89
GTX 970 16(HT) 14.7 0.30 45.5 0.94 104.3 2.03
GTX 980 16(HT) 13.2 0.26 40.4 0.82 95.1 1.82
GTX 980 Ti 16(HT) 11.1 0.21 32.9 0.64 82.3 1.53
TITAN X 16(HT) 10.3 0.20 32.3 0.63 80.7 1.46
(2)GTX 970 16(HT) 9.1 0.16 27.1 0.50 71.3 1.25
(2)TITAN X 16(HT) 7.9 0.13 23.1 0.38 66.7 1.10
Notes:
Hyperthreading was generally beneficial on this system but see page 2 data for a comparison.
The Titan Black is listed first since it is the only Kepler based card in the tests. The 900 series cards are listed by increasing performance.

Peak Tower Dual — Intel Xeon E5 2687v3 10-core @ 3.1GHz
[CPU Results]

Card CPU
cores
apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
CPU 1 418.5 9.53 1264.7 28.83 4220.0 96.29
2 214.5 4.89 643.9 14.47 2167.0 48.66
4 108.1 2.41 365.0 8.00 1150.1 26.91
8 60.7 1.33 182.2 3.96 612.9 13.59
10 48.8 1.08 155.6 3.21 494.6 11.65
16 31.7 0.67 93.6 2.01 313.9 6.93
20 25.7 0.54 78.6 1.62 268.3 5.51
40(HT) 23.6 0.47 67.5 1.38 228.0 4.80
Notes:
Parallel scaling is essentially linear even when utilizing all CPU cores!

Peak Tower Dual — Intel Xeon E5 2687v3 10-core @ 3.1GHz
[GPU Acceleration Results]

Card CPU
cores
apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
Titan Black 20 13.4 0.28 42.2 0.89 97.3 2.04
GTX 960 20 21.0 0.44 62.3 1.34 132.3 2.78
GTX 970 20 13.7 0.29 42.6 0.91 95.0 1.95
GTX 980 20 12.3 0.25 37.2 0.78 83.7 1.68
GTX 980 Ti 20 9.7 0.19 29.4 0.60 69.3 1.35
TITAN X 20 10.1 0.20 29.2 0.59 65.4 1.26
(2)GTX 970 20 8.1 0.15 24.1 0.47 55.9 1.03
(2)TITAN X 20 6.6 0.12 16.9 0.30 46.0 0.75
(4)GTX 970 20 6.1 0.10 15.3 0.25 42.5 0.68
(4)TITAN X 20 5.6 0.09 13.5 0.20 41.3 0.63
40(HT) 6.5 0.09 15.0 0.19 42.7 0.54
Notes:
Hyperthreading degraded performance for all job runs except the 4 TITAN X run.
Beyond 2 TITAN X adding more GPU’s had marginal benefit. Using 4 cards effectively would likely require higher core count CPU’s to balance the GPU performance.

Conclusions and Recommendations

Running NAMD with GPU acceleration can increase performance by nearly 6 fold over CPU alone! This is enough performance to facilitate moderate sized MD simulations to be run in a reasonable amount of time on a single node workstation.

What’s my favorite? Sure I want the Peak Dual with 4 TITAN X’s 🙂 However, I think either a Peak Single with a GTX 980 Ti or a Peak Dual with 2 GTX 980 Ti’s or TITAN X’s would be hard to argue with. I didn’t test using (2) GTX 980 Ti’s but looking at the results I would expect performance near that of the TITAN X’s. Using 4 GTX 970’s on a dual CPU system is not bad either!

Happy computing –dbk

Appendix 1 — Build NAMD from source (or don’t)

You can download the source and do your own build. I generally do this with scientific codes that I care about. You sometimes learn a lot about a program by digging into the source and looking at the various configure options. For programs that are available in binary format you may be able to do a build that is more tailored to the hardware you are going to be running on. If you have the Intel compilers and MKL libraries you can often get a better performing build than a gcc built executable.

I did a few experiments building NAMD from source and was only able to achieve a slightly better performing build … less than 10% speedup … you may or may not consider that worth the effort! I’m sure if you did more work on configuration tweaking your could improve on my results. The other advantage of using a pre-build version is that it will likely be well tested. If you are building your own you need to be sure to check your results to be sure that you haven’t broken anything! Now having said all that here’s some of details of what I did to build from source.

This is not a How-To! It’s only some changes to the standard build processes that I tried. If you are not familiar with building large programs from source then this may not help you that much.

1. Get the source

Download the source tar-ball from the references I have listed earlier, un-tar it and cd into the NAMD_2.10_Source directory.

2. build charm++

The charm++ tar file is included in the NAMD source tar-ball and it has a very friendly question guided build process. Just run ./build in the the top level charm++ directory and make appropriate selections. ( Doing ./build –help will show you a full list of the build options )

This is what I ended up with to build charm++,

./build charm++ multicore-linux64 iccstatic   ifort  -j4  --with-production

3. Take care of the tcl dependency

I just used my system installed tcl and tcl-lib for this. [ yum install tcl tcl-devel ]

4. Edit the config files in the arch/ directory

You will need to take care of of the files for your CUDA, tcl and fftw environment, and create a main “arch” config file with the compiler options you want to use. I used a cuda 7 install instead of the 6.5 version that is used in the binary build. For tcl I used my system installed version. For the fft code I used the fftw3 implementation in the Intel MKL. I used Intel Parallel Studio XE 2015 update 3 (same as used for the charm++ build)

file : Linux-x86_64.cuda


CUDADIR=/usr/local/cuda
CUDAINCL=-I$(CUDADIR)/include
CUDALIB=-L$(CUDADIR)/lib64 -lcudart_static -lrt
CUDASODIR=$(CUDADIR)/lib64
LIBCUDARTSO=
CUDAFLAGS=-DNAMD_CUDA
CUDAOBJS=$(CUDAOBJSRAW)
CUDA=$(CUDAFLAGS) -I. $(CUDAINCL)
CUDACC=$(CUDADIR)/bin/nvcc -O3 --maxrregcount 32 $(CUDAGENCODE) -Xcompiler "-m64" $(CUDA)
CUDAGENCODE=-gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=compute_20 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_50,code=compute_50
file: Linux-x86_64.tcl


TCLDIR=
TCLINCL=
TCLLIB=-ltcl8.5 -ldl -lpthread
TCLFLAGS=-DNAMD_TCL
TCL=$(TCLINCL) $(TCLFLAGS)
file: Linux-x86_64.fftw3


FFTDIR=$(MKLROOT)
FFTINCL=-I$(MKLROOT)/include -I$(FFTDIR)/include/fftw
FFTLIB= -mkl
FFTFLAGS=-DNAMD_FFTW -DNAMD_FFTW_3
FFT=$(FFTINCL) $(FFTFLAGS)

Then create the main “arch”

file: Linux-x86_64-icc-cuda-op.arch


NAMD_ARCH = Linux-x86_64
CHARMARCH = multicore-linux64-ifort-iccstatic


FLOATOPTS = -O2 -xCORE-AVX2 -openmp  -ip


CXX = icpc
CXXOPTS =  $(FLOATOPTS)
CXXNOALIASOPTS =  $(FLOATOPTS)


CC = icc
COPTS =   $(FLOATOPTS)

With these files in place you can cd up from the arch directory into the main source directory and run the following to create your build configuration.

./config Linux-x86_64-icc-cuda-op --with-cuda --with-fftw3

My compiler options are -O2 for standard optimizations, -xCORE-AVX2 as an attempt to get the compiler to vectorize the code for Haswell AVX2, -openmp for the threading and -ip for single file interprocedural optimizations.

The result: …less than 10% speedup over the normal binary builds … maybe not worth the effort!??

If you replace the -xCORE-AVX2 flag with -no-vec you get essentially the same result. That means the the code is not vectorizing on the CPU. That’s a shame really, but that’s how it is. NAMD scales very well across cores (and nodes too but we wont test that) We are also getting a wonderful speedup from the cuda code on the GPU and that’s the main point of this testing!

Note:I did get one compile warning related to vectorization. The following message is referring to an IBM compiler specific pragma.

src/OptPme.C(996): warning #161: unrecognized #pragma
  #pragma disjoint (*qmsg, *data)

I tried replacing this pragma with something similar for Intel but without performance gain. [ The pragma is basically telling the compiler it’s OK to vectorize because the pointers referred to are independent.]

Appendix 2 — Extra Data

More data for your number viewing enjoyment!

Peak Tower Single –Xeon E5 1660v3 8-core @ 3.0GHz
[GPU Acceleration Results]

Card CPU
cores*
apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
Titan Black 4 14.2 0.30 45.1 0.96 126.1 2.57
8 14.0 0.29 44.4 0.95 103.6 2.14
16 14.4 0.29 44.7 0.92 107.2 2.11
GTX 960 4 20.7 0.45 63.8 1.39 151.7 3.22
8 20.3 0.44 62.1 1.36 136.4 2.90
16 21.0 0.44 64.1 1.36 141.9 2.89
GTX 970 4 14.5 0.30 46.1 0.98 125.6 2.53
8 14.0 0.30 44.9 0.95 100.6 2.06
16 14.7 0.30 45.5 0.94 104.3 2.03
GTX 980 4 13.0 0.27 42.6 0.88 123.9 2.44
8 12.8 0.27 38.3 0.81 91.6 1.86
16 13.2 0.26 40.4 0.82 95.1 1.82
GTX 980 Ti 4 12.2 0.23 39.3 0.77 123.1 2.41
8 10.9 0.22 32.5 0.64 80.2 1.57
16 11.1 0.21 32.9 0.64 82.3 1.53
TITAN X 4 11.9 0.22 39.4 0.78 123.0 2.40
8 10.5 0.20 34.5 0.63 77.5 1.54
16 10.3 0.20 32.3 0.63 80.7 1.46
(2)GTX 970 4 12.2 0.22 37.9 0.70 120.7 2.34
8 8.4 0.16 29.4 0.55 72.6 1.36
16 9.1 0.16 27.1 0.50 71.3 1.25
(2)TITAN X 4 12.1 0.22 37.7 0.70 120.5 2.34
8 8.5 0.14 25.0 0.46 71.6 1.31
16 7.9 0.13 23.1 0.38 66.7 1.10

Peak Tower Dual — Intel Xeon E5 2697v3 10-core @ 3.1GHz
[GPU Acceleration Results]

Card CPU
cores
apoa1
wall time
day/ns f1atpase
wall time
day/ns stmv
wall time
day/ns
Titan Black 10 13.7 0.29 43.1 0.91 102.4 2.09
20 13.4 0.28 42.2 0.89 97.3 2.04
40 14.2 0.28 44.3 0.89 105.0 2.07
GTX 960 10 21.3 0.44 63.4 1.37 136.7 2.88
20 21.0 0.44 62.3 1.34 132.3 2.78
40 21.8 0.44 64.9 1.35 140.3 2.80
GTX 970 10 14.1 0.29 43.6 0.93 99.8 2.06
20 13.7 0.29 42.6 0.91 95.0 1.95
40 14.7 0.29 45.0 0.91 101.3 1.97
GTX 980 10 12.6 0.26 38.5 0.80 90.7 1.83
20 12.3 0.25 37.2 0.78 83.7 1.68
40 13.1 0.26 39.6 0.79 90.4 1.69
GTX 980 Ti 10 10.3 0.20 30.7 0.63 77.5 1.52
20 9.7 0.19 29.4 0.60 69.3 1.35
40 10.6 0.20 32.1 0.61 77.5 1.40
Titan X 10 10.5 0.21 30.1 0.61 76.5 1.60
20 10.1 0.20 29.2 0.59 65.4 1.26
40 10.9 0.21 31.3 0.60 72.4 1.29
(2)GTX 970 10 8.5 0.16 25.8 0.50 69.5 1.31
20 8.1 0.15 24.1 0.47 55.9 1.03
40 9.0 0.16 26.4 0.48 61.5 1.02
(2)Titan X 10 7.4 0.13 32.7 0.36 62.3 1.09
20 6.6 0.12 16.9 0.30 46.0 0.75
40 7.4 0.12 19.3 0.31 50.6 0.76
(4)GTX 970 10 7.7 0.12 22.5 0.38 67.6 1.21
20 6.1 0.10 15.3 0.25 42.5 0.68
40 7.0 0.10 17.5 0.25 45.9 0.64
(4)GTX Titan X 10 7.4 0.11 22.0 0.36 67.3 1.20
20 5.6 0.09 13.5 0.20 41.3 0.63
40 6.5 0.09 15.0 0.19 42.7 0.54