RTX3070 (and RTX3090 refresh) TensorFlow and NAMD Performance on Linux (Preliminary)

Table of Contents

Introduction

This post is a results refresh to include "preliminary" findings for the new RTX3070 GPU. Results from the RTX3090 post will be included, with a few job refreshes.

RTX3090 TensorFlow and NAMD Performance on Linux (Preliminary)

My colleagues have had mostly good results on various Windows applications with the RTX3070 and I believe it is also a very good gaming card. My testing is concerned with compute performance! (ML/Ai and molecular modeling)

The RTX3070 has only 8GB of memory making it less suitable for ML/AI and other computing work. However, at $500 I was hopeful that it would be a nice GPU for entry level compute tasks in a modest workstation build. From my current testing at this point I would recommend saving up for a RTX3080 or 3090. (This recommendation may change after new drivers and CUDA updates are released.)

This round of testing had much fewer problems than previously seen. There are new drivers now and updates on the NVIDIA NGC containers I've been using.

I used my favorite container platform, NVIDIA Enroot. This is a wonderful user space tool to run docker (and other) containers in a user owned "sandbox" environment. Which I plan to write about soon.

There were no significant job run problems! The NGC containers tagged 20.10 for TF1 and TF2 are working correctly.

TensorFlow 2 is now running properly. NGC container tagged 20.10-tf2-py3 is working (but not tested in this post)
The ptxas assembler is running correctly.

I used the latest containers from NVIDIA NGC for TensorFlow 1.15.

IMPORTANT NOTICE

The RTX3070 is a new variation on Ampere architecture, i.e. GA103. The RTX3080 and RTX3090 are GA102. There was a new Linux NVIDIA driver released on launch day (Thur. OCt 29, 2020) This does properly recognize the RTX3070 but it is still an early "short term" release. I believe this driver (and CUDA version) is not working correctly for fp16/Tensorcores on this GPU!

Test system

Hardware

Intel Xeon 3265W: 24-cores (4.4/3.4 GHz)
Motherboard: Asus PRO WS C621-64L SAGE/10G (Intel C621-64L EATX)
Memory: 6x REG ECC DDR4-2933 32GB (192GB total)
NVIDIA RTX3070 RTX3090, (old results for RTX3080, TITAN and RTX2080Ti)

Software

Ubuntu 20.04 Linux
Enroot 3.3.1
NVIDIA Driver Version: 455.38
nvidia-container-toolkit 1.3.0-1
NVIDIA NGC containers
nvcr.io/nvidia/tensorflow:20.10-tf1-py3
nvcr.io/hpc/namd:2.13-singlenode
nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4 for HPCG)

Test Jobs

TensorFlow-1.15: ResNet50 v1, fp32 and fp16
NAMD-2.13: apoa1, stmv
HPCG (High Performance Conjugant Gradient) "HPCG 3.1 Binary for NVIDIA GPUs Including Ampere based on CUDA 11"

Example Command Lines

docker run –gpus all –rm -it -v $HOME:/projects nvcr.io/nvidia/tensorflow:20.10-tf1-py3
docker run –gpus all –rm -it -v $HOME:/projects nvcr.io/hpc/namd:2.13-singlenode
python nvidia-examples/cnn/resnet.py –layers=50 –batch_size=32 –precision=fp32
python nvidia-examples/cnn/resnet.py –layers=50 –batch_size=64 –precision=fp16
namd2 +p24 +setcpuaffinity +idlepoll +devices 0 apoa1.namd
OMP_NUM_THREADS=24 ./xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80

Note: I listed docker command lines above for reference. I actually ran the containers with enroot

Job run info

The batch size used for TensorFlow 1.15 ResNet50 v1 was 32 at fp32 and 64 at fp16 for the RTX3070. GPUs The RTX3090 used 192 for both fp32 and fp16.
The HPCG benchmark used problem dimensions 128x128x128 (reduced for the 8GB mem on the RTX3070)

HPCG output for RTX3070

1x1x1 process grid
128x128x128 local domain
SpMV  =   64.2 GF ( 404.3 GB/s Effective)   64.2 GF_per ( 404.3 GB/s Effective)
SymGS =   77.5 GF ( 598.2 GB/s Effective)   77.5 GF_per ( 598.2 GB/s Effective)
total =   73.3 GF ( 555.9 GB/s Effective)   73.3 GF_per ( 555.9 GB/s Effective)
final =   72.3 GF ( 548.7 GB/s Effective)   72.3 GF_per ( 548.7 GB/s Effective)

HPCG output for RTX3090,

1x1x1 process grid
256x256x256 local domain
SpMV  =  132.1 GF ( 832.1 GB/s Effective)  132.1 GF_per ( 832.1 GB/s Effective)
SymGS =  162.5 GF (1254.3 GB/s Effective)  162.5 GF_per (1254.3 GB/s Effective)
total =  153.8 GF (1166.5 GB/s Effective)  153.8 GF_per (1166.5 GB/s Effective)
final =  145.9 GF (1106.4 GB/s Effective)  145.9 GF_per (1106.4 GB/s Effective)

Results

These results we run on the system, software and GPU's listed above.

Benchmark Job	RTX3090	RTX3080 (old)	RTX Titan (old)	RTX 2080Ti (old)	RTX3070
TensorFlow 1.15, ResNet50 FP32	577 images/sec	462 images/sec	373 images/sec	343 images/sec	258 images/sec
TensorFlow 1.15, ResNet50 FP16	1311 images/sec	1023 images/sec	1082 images/sec	932 images/sec	254 images/sec
NAMD 2.13, Apoa1 (old)	0.0264 day/ns (37.9 ns/day)	0.0285 day/ns (35.1 ns/day)	0.0306 day/ns (32.7 ns/day)	0.0315 day/ns (31.7 ns/day)	0.0352 day/ns (28.4 ns/day)
NAMD 2.13, STMV (old)	0.3398 day/ns (2.94 ns/day)	0.3400 day/ns (2.94 ns/day)	0.3496 day/ns (2.86 ns/day)	0.3528 day/ns (2.83 ns/day)	0.355 day/ns (2.82 ns/day)
HPCG Benchmark 3.1	145.9 GFLOPS	119.3 GFLOPS	Not run	93.4 GFLOPS	72.3 GFLOPS

Note: (old) means that the results were not updated from those presented in the first RTX3090 performance post. The RTX3090 results are updated using the new driver and updated NGC TF1 container. The HPCG and NAMD results for the 3090 are from the older post (I did recheck them but there was little change).

Performance Charts

Results from GPU testing pryor to the release of RTX3080 are not included in the charts since they are not strictly comparable because of improvements in CUDA and TensorFlow for the RTX20 series GPU's

TensorFlow 1.15 (CUDA11) ResNet50 benchmark. NGC container nvcr.io/nvidia/tensorflow:20.10-tf1-py3

The FP32 results for the RTX3070 show performance on par with an older RTX 2080 (not tested)

The fp16/Tensorcore performance is very poor for the RTX3070. I'm assuming that this is an issue with the early release driver(?) I will retest at a later time when I do a more complete GPU performance roundup.

Note, that the fp16 performance for the RTX3090 is significantly improved with the new container build and driver update. The previous post for the RTX3090 had 1163 img/sec.

NAMD 2.13 (CUDA11) apoa1 and stmv benchmarks. NGC container nvcr.io/hpc/namd:2.13-singlenode

These Molecular Dynamics simulation tests with NAMD are almost surely CPU bound. There needs to be a balance between CPU and GPU. These GPU are so high performance that even the excellent 24-core Xeon 3265W is probably not enough. I will do testing at a later time using AMD Threadripper and EPYC high core count platforms.

HPCG 3.1 (xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80) nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4)

HPCG is an interesting benchmark as it is significantly memory bound. The high performance memory on the GPUs has a large performance impact. The Xeon 3265W yields 14.8 GFLOPS. The RTX3090 is nearly 10 times that performance! the performance of the RTX3070 is limited by it's memory size and lower memory bandwidth.

Conclusions

The new RTX3070 GPU is lacking in compelling performance from this current testing. That may change when better updates for GA103 chip are available. For now I would recommend going with RTX3080 or better, the RTX3090 for compute rather than the RTX3070.

I can tell you that some of the nice features on the Ampere Tesla GPUs are not available on the GeForce RTX30 series. There is no MIG (Multi-instance GPU) support and the double precision floating point performance is very poor compared to the Tesla A100 ( I compiled and ran nbody as a quick check). There is also no P2P support on the PCIe bus. However, for the many applications where fp32 and fp16 are appropriate these new GeForce RTX30 GPUs look like they will make for very good and cost effective compute accelerators.

Happy computing! –dbk @dbkinghorn