Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1958
Dr Donald Kinghorn (Scientific Computing Advisor )

RTX3070 (and RTX3090 refresh) TensorFlow and NAMD Performance on Linux (Preliminary)

Written on October 29, 2020 by Dr Donald Kinghorn
Share:

Introduction

This post is a results refresh to include "preliminary" findings for the new RTX3070 GPU. Results from the RTX3090 post will be included, with a few job refreshes.

RTX3090 TensorFlow and NAMD Performance on Linux (Preliminary)

My colleagues have had mostly good results on various Windows applications with the RTX3070 and I believe it is also a very good gaming card. My testing is concerned with compute performance! (ML/Ai and molecular modeling)

The RTX3070 has only 8GB of memory making it less suitable for ML/AI and other computing work. However, at $500 I was hopeful that it would be a nice GPU for entry level compute tasks in a modest workstation build. From my current testing at this point I would recommend saving up for a RTX3080 or 3090. (This recommendation may change after new drivers and CUDA updates are released.)

This round of testing had much fewer problems than previously seen. There are new drivers now and updates on the NVIDIA NGC containers I've been using.

I used my favorite container platform, NVIDIA Enroot. This is a wonderful user space tool to run docker (and other) containers in a user owned "sandbox" environment. Which I plan to write about soon.

There were no significant job run problems! The NGC containers tagged 20.10 for TF1 and TF2 are working correctly.

  • TensorFlow 2 is now running properly. NGC container tagged 20.10-tf2-py3 is working (but not tested in this post)
  • The ptxas assembler is running correctly.

I used the latest containers from NVIDIA NGC for TensorFlow 1.15.

Test system

Hardware

  • Intel Xeon 3265W: 24-cores (4.4/3.4 GHz)
  • Motherboard: Asus PRO WS C621-64L SAGE/10G (Intel C621-64L EATX)
  • Memory: 6x REG ECC DDR4-2933 32GB (192GB total)
  • NVIDIA RTX3070 RTX3090, (old results for RTX3080, TITAN and RTX2080Ti)

Software

  • Ubuntu 20.04 Linux
  • Enroot 3.3.1
  • NVIDIA Driver Version: 455.38
  • nvidia-container-toolkit 1.3.0-1
  • NVIDIA NGC containers
  • nvcr.io/nvidia/tensorflow:20.10-tf1-py3
  • nvcr.io/hpc/namd:2.13-singlenode
  • nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4 for HPCG)

Test Jobs

  • TensorFlow-1.15: ResNet50 v1, fp32 and fp16
  • NAMD-2.13: apoa1, stmv
  • HPCG (High Performance Conjugant Gradient) "HPCG 3.1 Binary for NVIDIA GPUs Including Ampere based on CUDA 11"

Example Command Lines

  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/nvidia/tensorflow:20.10-tf1-py3
  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/hpc/namd:2.13-singlenode
  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=32 --precision=fp32
  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=64 --precision=fp16
  • namd2 +p24 +setcpuaffinity +idlepoll +devices 0 apoa1.namd
  • OMP_NUM_THREADS=24 ./xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80

Note: I listed docker command lines above for reference. I actually ran the containers with enroot

Job run info

  • The batch size used for TensorFlow 1.15 ResNet50 v1 was 32 at fp32 and 64 at fp16 for the RTX3070. GPUs The RTX3090 used 192 for both fp32 and fp16.
  • The HPCG benchmark used problem dimensions 128x128x128 (reduced for the 8GB mem on the RTX3070)

HPCG output for RTX3070

1x1x1 process grid
128x128x128 local domain
SpMV  =   64.2 GF ( 404.3 GB/s Effective)   64.2 GF_per ( 404.3 GB/s Effective)
SymGS =   77.5 GF ( 598.2 GB/s Effective)   77.5 GF_per ( 598.2 GB/s Effective)
total =   73.3 GF ( 555.9 GB/s Effective)   73.3 GF_per ( 555.9 GB/s Effective)
final =   72.3 GF ( 548.7 GB/s Effective)   72.3 GF_per ( 548.7 GB/s Effective)

HPCG output for RTX3090,

1x1x1 process grid
256x256x256 local domain
SpMV  =  132.1 GF ( 832.1 GB/s Effective)  132.1 GF_per ( 832.1 GB/s Effective)
SymGS =  162.5 GF (1254.3 GB/s Effective)  162.5 GF_per (1254.3 GB/s Effective)
total =  153.8 GF (1166.5 GB/s Effective)  153.8 GF_per (1166.5 GB/s Effective)
final =  145.9 GF (1106.4 GB/s Effective)  145.9 GF_per (1106.4 GB/s Effective)

Results

These results we run on the system, software and GPU's listed above.

Benchmark Job RTX3090 RTX3080 (old) RTX Titan (old)RTX 2080Ti (old)RTX3070
TensorFlow 1.15, ResNet50 FP32 577 images/sec 462 images/sec 373 images/sec 343 images/sec 258 images/sec
TensorFlow 1.15, ResNet50 FP16 1311 images/sec 1023 images/sec 1082 images/sec 932 images/sec 254 images/sec
NAMD 2.13, Apoa1 (old) 0.0264 day/ns
(37.9 ns/day)
0.0285 day/ns
(35.1 ns/day)
0.0306 day/ns
(32.7 ns/day)
0.0315 day/ns
(31.7 ns/day)
0.0352 day/ns
(28.4 ns/day)
NAMD 2.13, STMV (old) 0.3398 day/ns
(2.94 ns/day)
0.3400 day/ns
(2.94 ns/day)
0.3496 day/ns
(2.86 ns/day)
0.3528 day/ns
(2.83 ns/day)
0.355 day/ns
(2.82 ns/day)
HPCG Benchmark 3.1 145.9 GFLOPS 119.3 GFLOPS Not run 93.4 GFLOPS 72.3 GFLOPS

Note: (old) means that the results were not updated from those presented in the first RTX3090 performance post. The RTX3090 results are updated using the new driver and updated NGC TF1 container. The HPCG and NAMD results for the 3090 are from the older post (I did recheck them but there was little change).

Performance Charts

Results from GPU testing pryor to the release of RTX3080 are not included in the charts since they are not strictly comparable because of improvements in CUDA and TensorFlow for the RTX20 series GPU's


TensorFlow 1.15 (CUDA11) ResNet50 benchmark. NGC container nvcr.io/nvidia/tensorflow:20.10-tf1-py3

TensorFlow ResNet50 FP32

The FP32 results for the RTX3070 show performance on par with an older RTX 2080 (not tested)

TensorFlow ResNet50 FP16

The fp16/Tensorcore performance is very poor for the RTX3070. I'm assuming that this is an issue with the early release driver(?) I will retest at a later time when I do a more complete GPU performance roundup.

Note, that the fp16 performance for the RTX3090 is significantly improved with the new container build and driver update. The previous post for the RTX3090 had 1163 img/sec.


NAMD 2.13 (CUDA11) apoa1 and stmv benchmarks. NGC container nvcr.io/hpc/namd:2.13-singlenode

NAMD Apoa1

NAMD STMV

These Molecular Dynamics simulation tests with NAMD are almost surely CPU bound. There needs to be a balance between CPU and GPU. These GPU are so high performance that even the excellent 24-core Xeon 3265W is probably not enough. I will do testing at a later time using AMD Threadripper and EPYC high core count platforms.


HPCG 3.1 (xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80) nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4)

HPCG

HPCG is an interesting benchmark as it is significantly memory bound. The high performance memory on the GPUs has a large performance impact. The Xeon 3265W yields 14.8 GFLOPS. The RTX3090 is nearly 10 times that performance! the performance of the RTX3070 is limited by it's memory size and lower memory bandwidth.

Conclusions

The new RTX3070 GPU is lacking in compelling performance from this current testing. That may change when better updates for GA103 chip are available. For now I would recommend going with RTX3080 or better, the RTX3090 for compute rather than the RTX3070.

I can tell you that some of the nice features on the Ampere Tesla GPUs are not available on the GeForce RTX30 series. There is no MIG (Multi-instance GPU) support and the double precision floating point performance is very poor compared to the Tesla A100 ( I compiled and ran nbody as a quick check). There is also no P2P support on the PCIe bus. However, for the many applications where fp32 and fp16 are appropriate these new GeForce RTX30 GPUs look like they will make for very good and cost effective compute accelerators.

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of poweful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: NAMD, NVIDIA, TensorFlow, RTX30 series, Machine Learning, Molecular Dynamics
Ampere

https://developer.nvidia.co...
https://docs.nvidia.com/cud...

Last updated October 29, 2020
CUDA 11.1 Update 1 is a minor update that is binary compatible with CUDA 11.1. This release will work with all versions of the R450 NVIDIA driver.

Posted on 2020-10-30 13:03:49
Xep

Hi Dr.

Very good testing as always. Although GA103 dont exist, I think you meant GA104.

I am curious to see if you can take a 3090, RTX Titan and A100 instance through a good batch of ML benchmarks specifically NLP models at FP16. I think it might really interesting to see how 3090 does there as it have driver capped tensor core performance to half its original performance whereas the older RTX Titan does not.

Posted on 2020-10-30 16:59:26
Steve Phillips

I’m getting 30% slower results using the tf 1.15 container than running without docker. I’m not using enroot, but otherwise looks identical. I’m testing with a gtx 1070 though, in readiness for if I manage to get a 3000 series card. Any ideas as to which I’m getting this performance hit are most welcome, as I’m stumped

Posted on 2020-11-02 20:58:58
hoohoo

Thanks, Professor!

Posted on 2020-11-04 00:21:10
Angga Febrian Sahid

Could you please tell me how to install tensorflow 1.15 alongside RTX 3080.
since i found so many people posting about tf 1.15 could not used with RTX 3080

Thank you

Posted on 2020-11-20 02:04:33
Hypersphere

Donald,

Thanks for these results. I am especially interested in the STMV MD benchmarks. I am in the process of configuring and testing an AMD Threadripper 3970X water-cooled CPU with an RTX 3090 GPU. With YASARA-Structure 20.10.04 on Linux Mint 20, I was able to get up to 7 ns/day with 32-48 out of 64 threads and the GPU. YASARA uses OpenCL rather than CUDA for GPU acceleration. You might want to give YASARA-Structure a try. It is a commercial program, but the prices are quite reasonable and the licensing policies are very flexible.

I've been looking for STMV benchmark results for somewhat comparable systems, i.e., single or perhaps double processor CPU and a single GPU card. Most results fall in the range 0.5 to 7.5 ns/day. However, results posted for AMBER20 appear to be as high as 31 ns/day. I am puzzled by the AMBER results and wonder if I am missing something about how they ran the benchmark.

The AMBER results may be found at the following link:

https://ambermd.org/GPUPerf...

Best wishes,

--Hypersphere

Posted on 2020-11-30 19:52:15