Table of Contents
The new NVIDIA GeForce GTX 1080 and GTX 1070 GPU’s are out and I’ve received a lot of questions about NAMD performance. The short answer is — performance is great! I’ve got some numbers to back that up below. We’ve got new Broadwell Xeon and Core-i7 CPU’s thrown into the mix too. The new hardware refresh gives a nice step up in performance.
This post is a follow-up/refresh of an earlier post on NAMD performance on workstations with GPU acceleration. This post will mostly focus on new performance numbers
NAMD
NAMD is a widely used molecular dynamics program developed and maintained by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign.
It is proprietary software licensed by the University of Illinois and is made freely available, including source code, under a non-exclusive, non-commercial use license.
The group at UIUC working on NAMD were early pioneers of using GPU’s for compute acceleration and NAMD has very good performance acceleration using NVIDIA CUDA.
Obtaining NAMD
NAMD is available as source that you can compile yourself or in a variety of binary builds.
The binary builds that I will use for testing are from the version 2.11 builds.
- Linux-x86_64-multicore for CPU based SMP parallel tests
- Linux-x86_64-multicore-CUDA for GPU accelerated parallel tests
Test configurations
We are looking at single node GPU accelerated workstation performance and will test on three base system configurations.
- The Peak Tower Dual
- CPU: (2) Intel Xeon E5 2687v4 12-core @ 3.0GHz (3.2GHz All-Core-Turbo)
- Memory: 256 GB DDR4 2133GHz Reg ECC
- PCIe: (4) X16-X16 v3
- The Peak Tower Single
- CPU: (1) Intel Xeon E5 2690v4 14-core @ 2.6GHz (3.2GHz All-Core-Turbo
- Memory: 64 GB DDR4 2133MHz Reg ECC
- PCIe: (4) X16-X16 v3
- The Peak Tower Single
- CPU: Intel Core-i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
- Memory: 64 GB DDR4 2133MHz Reg ECC
- PCIe: (4) X16-X16 v3
Note: All-Core-Turbo is the “real” clock speed i.e. the speed when all of the cores are at 100% load.
OS
The system software for the testing was Ubuntu 16.04 plus updates. The NVIDIA driver was version 367.27 from the “Graphics-Drivers ppa”.
GPU’s
I firmly believe using GeForce cards for scientific computation is OK especially for workstations. However, I should note that Tesla model cards from NVIDIA are designed for compute and offer some additional features and are a better choice for multi-node cluster applications. NVIDA will be releasing a PCIe version of Pascal based Tesla later this year. These cards should have the same 1:2 ratio single to double precision performance of the GP100 SMX modules. The GeForce Pascal cards have very poor double precision performance. (However, most GPU accelerated application make very good use of single precision floating point on the GPU!)
Newer GeForce cards have very good thermal and power design. Our experience using cards from top tier vendors like EVGA and ASUS is that they have been excellent with very low failure rates even under heavy computational load. It used to be that the most important consideration when picking cards for compute was to avoid anything that was overclocked. These days it’s actually hard to find any cards that are not overclocked! Manufactures will by default overclock cards that are well within design specs. That makes me nervous because I’ve been doing this for a long time, but I concede that the newer cards like the Maxwell and Pascal GeForce cards are excellent designs. I would not be too concerned about overclocked cards but I would still probably recommend avoiding “superclocked” cards. [ I think “superclocked” is the new “overclocked” ] Video cards from top tier manufacturers with good cooling hardware will give good performance and hold up well to heavy load. They are also inexpensive enough that if they show any sign of failure or inconsistency you should plan to replace them without hesitation. Budget for that! My personal expectation when using GeForce cards for compute is to assume you will be replacing some cards in 6 to 9 months if they are under constant heavy load. They may very well hold up for several years and by then you will be replacing them with faster cards anyway!
The GTX 1080 and 1070 cards used in this testing are “Founders Edition” cards
Video cards used for testing. ( data from nvidia-smi )
Card | CUDA cores | GPU clock MHz | Memory clock MHz* | Application clock MHz** | FB Memory MiB |
---|---|---|---|---|---|
GTX 1070 | 1920 | 1506 | 4004 | 1506 | 8110 |
GTX 1080 | 2560 | 1607 | 5005 | 1607 | 8192 |
TITAN X | 3072 | 1392 | 3505 | 1000 | 12287 |
Notes: * Marketing magic often reports twice that number as MT/s, ** a.k.a. base clock
Testing Simulations
The test simulations data and configuration files can be downloaded from the NAMD utilities page. All jobs were run using the default configuration files (500 time steps).
- ApoA1 benchmark [ apoa1.namd ]
- Apolipoprotein A-I
- 2,224 atoms, periodic, PME (Particle Mesh Ewald)
- ATPase benchmark [ f1atpase.namd ]
- Adenosine tri-phosphate (ATP) synthase
- 92,224 atoms, periodic, PME
- STMV benchmark [ stmv.namd ]
- Satellite Tobacco Mosaic Virus
- 1,066,628 atoms, periodic, PME
Results
The numbers mostly speak for themselves but do read the “Notes:” section at the end of each table.
There are results for each of the test platforms for “CPU only” and “CPU + GPU” job runs. NAMD is essentially linearly scaling with CPU core count. Also, hyperthreading improved performance in every case for CPU only job runs. However, with GPU accelerated job runs hyperthreading slowed job run times.
The GPU acceleration for NAMD is very good. Adding nearly any NVIDIA CUDA capable GPU will significantly improve performance. There is diminishing returns when the GPU capability exceeds the CPU’s ability to keep up. These results indicate that with fast modern GPU’s this version of NAMD ( 2.11 ) is mostly CPU bound.
Caveats:
Heavy compute on GeForce cards can shorten their lifetime! I believe it is perfectly fine to use these cards but keep in mind that you may fry one now and then!
The numbers should not be taken as definitive benchmark results! There can be considerable variation in both runtime and the all important day/ns numbers. The “days per nano-second” has the most variability since I just used the number that was reported at the last “benchmark” phase of each job run and that is not necessarily the best result of all of the “benchmark” reports during the job run. Also, these jobs were only run for 500 time steps a “real” job would run significantly longer.
Peak Tower Dual –Xeon (2) E5 2687W v4 12-core
@ 3.0GHz ( 3.2GHz )[CPU and GPU Acceleration Results]
apoa1 wall time |
day/ns | f1atpase wall time |
day/ns | stmv wall time |
day/ns | |
---|---|---|---|---|---|---|
CPU | 19.7 | 0.370 | 55.7 | 1.09 | 181.4 | 3.73 |
Titan X | 5.67 | 0.0954 | 16.4 | 0.289 | 48.3 | 0.851 |
(2) Titan X | 4.27 | 0.0593 | 11.9 | 0.168 | 36.7 | 0.548 |
GTX 1070 | 4.94 | 0.0757 | 14.7 | 0.246 | 56.0 | 0.796 |
(2) GTX 1070 | 4.19 | 0.0477 | 11.7 | 0.154 | 36.4 | 0.532 |
GTX 1080 | 4.45 | 0.0653 | 13.1 | 0.207 | 40.3 | 0.652 |
(2)GTX 1080 | 4.08 | 0.0472 | 11.7 | 0.147 | 35.4 | 0.504 |
- Notes:
- Hyperthreading was enabled for the CPU results ( 48 HT core ). For the GPU results only “real” cores were used ( 24 cores ).
- It is notable that the GTX 1070 outperformed the Titan X
Peak Tower Single –Xeon E5 2690 v4 14-core
@ 2.6GHz ( 3.2GHz )[CPU and GPU Acceleration Results]
apoa1 wall time |
day/ns | f1atpase wall time |
day/ns | stmv wall time |
day/ns | |
---|---|---|---|---|---|---|
CPU | 31.3 | 0.629 | 88.0 | 1.83 | 297.2 | 6.33 |
Titan X | 5.48 | 0.0896 | 17.0 | 0.278 | 52.5 | 0.944 |
(2) Titan X | 5.29 | 0.0749 | 15.8 | 0.237 | 47.7 | 0.808 |
GTX 1070 | 5.25 | 0.0785 | 16.3 | 0.261 | 51.4 | 0.908 |
(2) GTX 1070 | 5.19 | 0.0708 | 16.1 | 0.238 | 47.9 | 0.811 |
GTX 1080 | 5.11 | 0.0731 | 16.1 | 0.243 | 48.5 | 0.831 |
(2)GTX 1080 | 5.13 | 0.0716 | 16.2 | 0.239 | 47.7 | 0.809 |
- Notes:
- Hyperthreading was enabled for the CPU results ( 28 HT core ). For the GPU results only “real” cores were used ( 14 cores ).
- These results are CPU bound when GPU acceleration is used!
Peak Tower Single –Core-i7 6900K 8-core
@ 3.2GHz ( 3.5GHz )[CPU and GPU Acceleration Results]
apoa1 wall time |
day/ns | f1atpase wall time |
day/ns | stmv wall time |
day/ns | |
---|---|---|---|---|---|---|
CPU | 45.8 | 0.975 | 134.0 | 2.92 | 457.0 | 10.0 |
Titan X | 6.16 | 0.102 | 20.4 | 0.338 | 62.4 | 1.14 |
(2) Titan X | 6.44 | 0.102 | 20.4 | 0.338 | 62.6 | 1.13 |
GTX 1070 | 6.26 | 0.103 | 20.5 | 0.342 | 62.2 | 1.13 |
(2) GTX 1070 | 6.38 | 0.102 | 20.6 | 0.338 | 62.7 | 1.13 |
GTX 1080 | 6.41 | 0.102 | 20.5 | 0.343 | 61.9 | 1.12 |
(2)GTX 1080 | 6.43 | 0.102 | 20.5 | 0.337 | 63.2 | 1.12 |
- Notes:
- Hyperthreading was enabled for the CPU results ( 16 HT core ). For the GPU results only “real” cores were used ( 8 cores ).
- These results are CPU bound when GPU acceleration is used!
Conclusions and recommendations
Running NAMD with GPU acceleration can increase performance by a factor of 8-10 over CPU alone! This is enough performance to facilitate moderate sized MD simulations to be run in a reasonable amount of time on a single node workstation. The folks at UUIC are constantly working on NAMD and I’m sure that future versions will have more of the work load moving to the GPU since there is still significant performance to be gained. Even though the Peak systems can accommodate 4 X16 GPU’s it seems that more than 2 ( or even just 2 ) may be more GPU compute performance than what the CPU’s can keep up with. I think either a Peak Single with a GTX 1080 or a Peak Dual with 2 GTX 1080’s or 2 GTX 1070’s with as much CPU as your budget will allow would be excellent for NAMD at this point in time.
Happy computing –dbk