Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1124
Dr Donald Kinghorn (Scientific Computing Advisor )

NAMD Performance on Xeon-Scalable 8180 and 8 GTX 1080Ti GPUs

Written on March 9, 2018 by Dr Donald Kinghorn
Share:


I have been doing validation and performance testing on a very nice dual Xeon-Scalable system that supports up to 8 GPU's. It's been a very impressive system! This post will look at the molecular dynamics program, NAMD. NAMD has good GPU acceleration but is heavily dependent on CPU performance as well. It achieves best performance when there is a proper balance between CPU and GPU. The system under test has 2 Xeon 8180 28-core CPU's. That's the current top of the line Intel processor. We'll see how many GPU's we can add to those Xeon 8180 CPU's to get optimal CPU/GPU compute balance with NAMD.

I recently wrote about performance with TensorFlow that shows off the GPU performance on this system, TensorFlow Scaling on 8 1080Ti GPUs - Billion Words Benchmark with LSTM on a Docker Workstation Configuration. For that testing the Xeon 8180's were way more capable CPU's than were needed for that workload. Here I'll focus more on the CPU performance (but still include GPU's for some of the testing).


Hardware

The relevant components of the system under test was as follows,

  • Mother board: TYAN S7109GM2NR-2T [Dual root complex with 4 PLX PEX8747 PCIe switches] In chassis B7109F77DV14HR-2T-N
  • CPU's: 2 x Intel Xeon Scalable Platinum 8180 CPU @ 2.50GHz 28-Core
  • Memory: 768GB DDR4 REG ECC 12 x 64GB 2666MHz
  • GPU's: 8 x NVIDIA 1080Ti

Why 1080Ti's? That's what I had 8 of! ... besides, they are great GPU's for code that is optimized with single precision GPU acceleration like NAMD.


NAMD

NAMD is a molecular dynamics program developed by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign. The group at UIUC working on NAMD were early pioneers of using GPU’s for compute acceleration and NAMD has very good GPU acceleration but has a large CPU performance dependence. I consider it "CPU bound".

Two NAMD versions used for testing

I used two NAMD builds for this testing.

  • For the GPU acceleration and scaling testing I used the GPU optimized version 2.12 docker image from the NVIDIA NGC registry. (information below).
  • For the CPU scaling testing I used the CPU binary build version 2.12 from the NAMD site at UIUC.

There is a detailed described of how to setup an Ubuntu 16.04 workstation for use with docker and NVIDIA NGC in the following posts,


Running the Million Atom Simulation, STMV (Satellite Tobacco Mosaic Virus) Benchmark.

GPU Job Runs

In the post in the links above, "Part 4 Accessing the NGC Registry", there is information about about accessing the NGC registry. For this testing I used a container instance from the HPC directory on NGC, nvcr.io/hpc/namd:2.12-171025. That container instance contains a directory with NAMD builds for multicore + CUDA in 3 varieties, "standard", "memory optimized", and a build with Infiniband support. There is an "examples" directory with the needed files for jobs apoa1 and stmv. apoa1 is way too small for testing on this machine but the stmv job is a good standard benchmark. The example configuration for stmv in this container image is setup to use the memory optimized runtime. I tested that and the performance was not as good as with the "standard" CUDA build.

I used the stmv input file that I have used in all of my past NAMD testing posts. I have it configured for 500 time steps and I use the last reported "day/ns" value as my benchmark. This input file and it's supporting files are available from the NAMD Utilities page.

After docker log-in to the NGC registry the NAMD container is started with the following command,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/hpc/namd:2.12-171025

The stmv job directory I use is in "projects" on the host so I bind that directory into the container.

From the "projects" directory, where I have the stmv input files, the command to run the job is,

/opt/namd/namd-multicore +p56 +setcpuaffinity +idlepoll +devices 0,1,2,3,4,5,6,7  stmv.namd
  • The +devices flag is followed by the list of GPU's to be used. In that example it would be all 8 GPU's. I varied that from 1 to 8 GPU's for the testing results.
  • +p56 was used for the GPU testing to provide all 56 CPU cores.

CPU Job Runs

For CPU performance scaling testing the non-CUDA multicore NAMD build discussed in the NAMD version section was used on the host system i.e. not with docker.

The way I had my directory layout configured the job run command lines were,

../../NAMD_2.12_Linux-x86_64-multicore/namd2 +p56 +setcpuaffinity +idlepoll stmv.namd
  • The +p flag was varied from 1 to 56 cores.

Note: It really doesn't make sense in most cases to run NAMD on CPU-only, because there is a very good performance boost from adding GPU's. However, I wanted to look at CPU performance on these high-end Xeon 8081 CPU's and NAMD scales very well on CPU (and multi-node).


Results

The NAMD performance on this system is better than any system I have tested previously, by a factor greater than 2. That includes previous generation quad-socket high-end Xeon systems with multiple GPU's.

For a performance measure of the job run results in the tables and plots I am using Nano-seconds Per Day (ns/day) rather than "Day Per Nano-second" (day/ns) which is what is reported in the NAMD job output.

GPU accelerated results

NAMD STMV Benchmark on Dual Xeon 8180 (56 total cores) and 1-8 1080Ti GPU's

Number of GPU'sSimulation Nano-seconds Per DayPerformance Increase% Efficiency
12.288 1 100%
24.0321.7688.1%
44.9752.1754.4%
65.2912.3138.5%
85.8822.5732.1%

The multi-GPU scaling in this table is not surprising since this code is CPU bound when so many GPU's are used. What is surprising, is that nearly 6 nano-seconds of dynamics simulation for this million atom system can be achieved in 1 day on a single node. That is very good! The plot below will better illustrate the scaling.

Fitting the data to an Amdhal's Law curve gives a parallel fraction of P = 0.70. [That means that the maximum speedup achievable is unlikely to exceed 1/(1-P) = 3.3 with any number of GPU's in the system.] More CPU performance to balance the GPU's would likely be needed to get better overall scaling.

Here's the expression of Amdhal's Law that was used for a regression fit of the data to,

performance_(ns/day) = 2.288_(ns/day)/((1-P)+(P/num_GPU's))

Below is a plot of that curve,

NAMD GPU scaling

CPU Only Results

Part of the purpose of this testing is to look at the scaling performance of these high-end Xeon 8180 CPU's. NAMD actually scales nearly perfectly in parallel. However, the Intel-Scalable CPU's reduce the core clock as the number of in-use cores increases. (And, there are are separate clocks for core-only, AVX2 and the new AVX512 vector units!) On the Xeon 8180 CPU's the AVX512 clock changes from 3.5GHz for 1-2 cores down to 2.3GHz for 25-28 cores. It's that clock reduction that actually reduces the apparent parallel scaling with NAMD. Still, it is interesting to see the "real" performance scaling on these (amazingly good) processors. I added the AVX512 core clock speed in the last column on the table.

NAMD STMV Benchmark on Dual Xeon 8180 with from 1 to 56 total cores

Number of CPU coresSimulation Nano-seconds Per DayPerformance Increase% Efficiency AVX512 Clock (GHz)
10.0141 1 100%3.5
20.02751.9597.6%3.5
40.05203.6992.3%3.5
80.1027.2090.1% 3.3
160.19714.087.6%3.2
240.26318.777.9%3.1
320.34124.275.6%2.8
400.44831.879.5%2.6
480.50736.075.0%2.4
560.59542.275.4%2.3

A regression plot using a fit to Amdhal's Law similar to how it was done for the GPU scaling case gives the following,

NAMD CPU scaling

I hope you found this interesting. This is definitely the highest performance single node system I've ever tested!

Happy computing --dbk

Tags: Docker, NAMD, NVIDIA, Linux, NGC
Tugrul_512bit

Thank you for this informative benchmark. How efficiently does CPU add to 1 GPU to get close to value of (2.288 + 0.595) ns per day? I guess benchmarks were done with single precision setting. How would CPU compare to GPU when with double precision?

Posted on 2018-03-12 11:30:10
Donald Kinghorn

I keep forgetting to check comments ... sorry ...
Good questions? Jobs were run with default mixed precision for NAMD, single on the GPU. If I had tried to use double on the GPU side with those 1080Ti's it would have been a waste because it's badly crippled. (the new Titan V is not crippled!) The GPU's just add some much performance to NAMD it's hard not to use them. But, NAMD scales really well on CPU clusters too... I wish I could have had a little more time on the system! That would have been a pretty good systems for finding the optimal "sweet spot" of CPU + GPU. I tested on a TitanV recently in this post
https://www.pugetsystems.co...
That system had a 16-core SkylakeX CPU which I'm guessing is around 70% of one of those crazy 8180's. That box with 1 Titan V did .480 day/ns with NAMD. That is not enough CPU really??? I don't know if there would be any improvement by adding a second Titan V or not to that. ... it may not be a enough CPU to balance that GPU as it is.

I think I'll get my hands on another dual CPU box with more modest but still great Xeon's in it and a board that will do 4 GPU's ... and hopefully those GPU's will be Titan V's. I love those cards! The double precision is not crippled and it's really impressive. We had someone recently get a system from us for running jobs with LAMMPS in double precision on the TitanV ... I have not test that myself but would like to.

Posted on 2018-03-19 17:07:33
MagicWax

In the GPU scaling test, were all CPU core utilized? If not, then maybe having a CPU with a very high single threaded performance would be better if you have many GPUs. Maybe try this on a 7900X (possibly with all cores locked at the turbo clock, just to see if more clockspeed really helps)

Posted on 2018-03-14 17:39:50
Donald Kinghorn

... take a look at the post I mentioned in the comment above. That was using a SkylakeX 16-core and a TitanV I feel that it was not quite enough CPU for the Titan V but would probably be near optimal with a 1080Ti.

There is definite balance for optimal performance on a demanding code like NAMD. The new batch of hardware has a lot of quirks (like multiple clocks) so it's hard to guess what is really the most optimal. I love the new single socket Intel CPU's Skylake-X,W. I think a box with the 18-core X or W and a Titan V would be an incredible scientific workstation. I'm old enough to be completely blown away by performance ... A single CPU that will do around 1TFLOP Linpack double and 8+ TFLOP single and 4+ TFLOP double on the GPU! That is really amazing if you think about it. That was top 500 super computer range not long ago.

I hear you in general! I'd like to do a good round of testing with NAMD, LAMMPS and maybe GROMACS. To see what is optimal for a workstation class system. It would be good to look at double precision on the GPU too!

Posted on 2018-03-19 17:18:30
nishank93

Hi Donald, would you be able to try this benchmark (or the ATPase benchmark preferably, since its system size is closer to stuff we normally run) again with NAMD 2.13? The changelog on 2.13 states they have offloaded more tasks off the CPU and to the GPU, so we might see better scaling.

Posted on 2018-06-14 05:52:17
Donald Kinghorn

Hi, I used the STMV job since it was large enough to get some reasonable scaling. I did try ATPase but multi-GPU scaling was not very good. The GPU's are just so fast. However, you are helping to motivate me for some testing I was thinking about getting started on today! I want to look at performance pros and cons of using a dual socket Xeon-SP vs a single Xeon-W I would add in GPU's until CPU performance became completely limiting. I am really impressed with the MD performance you can get with something like an 18-core i9 CPU together with a single NVIDIA 1080TI. It would be interesting to see if I can double that performance with an appropriately configured dual Xeon. It would much more than double the cost though!

I don't have access to that (great) 8 x X16 dual Xeon system anymore but we are finalizing qualification on some dual Xeon motherboards and I was going to see if I can grab one for testing today.

I did just run 500 time steps of ATPase on my personal system which is a 14-core Xeon-W 2175 with a 1080Ti GPU. You may already know this level of performance,
---
Info: Benchmark time: 14 CPUs 0.0146536 s/step 0.169602 days/ns 997.723 MB memory
TIMING: 500 CPU: 7.66165, 0.01456/step Wall: 7.69987, 0.0145519/step, 0 hours remaining, 997.722656 MB of memory in use.
---
That's pretty good!
That's around what I saw with a dual Xeon E5 2687 v4 and 2 Titan X GPU's (or 2 1070's) in the following post
https://www.pugetsystems.co...

Posted on 2018-06-14 18:25:27
nishank93

Thank you for the quick response! We're seeing about a 1.6 to 2-fold performance increase with NAMD 2.13 vs 2.12 using a 8700K+GTX 1080 combination, but performance scaling drops off quickly after enabling the 5th and 6th cores.

We're simply trying to the number of CPU cores per GPU required in NAMD 2.13 for the kinds of jobs we run before purchasing something like a dual E5-2650 v4 with 8 1080Ti GPUs system.

Posted on 2018-06-15 05:42:50
Donald Kinghorn

That is a really a nice speedup from 2.12 to 2.13 ... I'll have to check that out. What I ran was 2.12 from the NVIDIA NGC docker container.
[ I'm not going the get the dual for testing for a while probably at least a week or so. ]

It's puzzling that you are seeing a drop off when going to 5,6 cores. I would expect it to help because the 1080 should easily out perform the CPU.

I'll do an experiment on my system using 2.13 with my 2175-W 14-core and 1080Ti. It may be a couple of days before I get to it but I am curious ... I'll put the results here in the comments but it may be a few days ...

Posted on 2018-06-15 21:21:31
nishank93

We tried the same set of tests I described, except with 32 GB RAM this time (vs 16 GB before). We no longer see the dropoff after enabling the 5th and 6th cores, and the ns/day figure also improves with 32 GB RAM.
Looking forward to your 14-core+1080Ti benchmark! :)

Posted on 2018-06-21 04:03:05
Hypersphere

I was curious to see how my now rather old system (i7-5960X OC @ 4.4 Gz + 1 x GTX 1080Ti) would perform with the STMV system in explicit solvent and NVT conditions using YASARA 18.4.24, which employs OpenCL for GPU acceleration rather than CUDA. It got 2.04 ns/day. This would seem to compare favorably with the 2.288 ns/day achieved with NAMD on the dual Xeon 8180 and 1x GTX 1080Ti. Now I am in the process of configuring a new system with an i9-9920X CPU and 2x GTX 2080 Ti GPUs. I am eager to see how this new configuration will perform in MD simulations with both NAMD and YASARA.

Posted on 2019-04-20 19:27:21
Donald Kinghorn

That was a pretty nice result on your old system with YASARA. The 9920X is nice and you will be happy to have the 2080Ti's but those GPU's are really fast. They will likely be waiting on the CPU ... At least in NAMD not sure about YASARA ... be sure to let me (us) know... Thanks!

Posted on 2019-04-22 17:13:56