Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1902
Dr Donald Kinghorn (Scientific Computing Advisor )

RTX3090 TensorFlow, NAMD and HPCG Performance on Linux (Preliminary)

Written on September 24, 2020 by Dr Donald Kinghorn
Share:

Introduction

The second new NVIDIA RTX30 series card, the GeForce RTX3090 has been released.

The RTX3090 is loaded with 24GB of memory making it a good replacement for the RTX Titan... at significantly less cost! The performance for Machine Learning and Molecular Dynamics on the RTX3090 is quite good, as expected.

This post is a follow-on to the post from last week on the RTX3080

RTX3080 TensorFlow and NAMD Performance on Linux (Preliminary)

Testing with the RTX3090 went smoother than with the RTX3080, which had been uncomfortably rushed and problematic.

I was able to use my favorite container platform, NVIDIA Enroot. This is a wonderful user space tool to run docker (and other) containers in a user owned "sandbox" environment. Last week I had some difficulties that were related to incomplete installation of all driver components. Expect to see a series of posts soon introducing and describing usage of Enroot!

The HPCG (High Performance Conjugate Gradient) benchmark was added for this testing.

There were the same failures with the RTX3090 as with the RTX3080;

  • TensorFlow 2 failed to run properly with a fatal error in BLAS calls
  • My usual LSTM benchmark failed with mysterious memory allocation errors
  • The ptxas assembler failed to run. This left ptx compilation to the driver which caused slow start up times for TensorFlow (a few minutes). See the output below,
2020-09-22 11:42:03.984823: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312]
Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. This message will be only logged once.

The reference to "sm_86" is referring to the "compute level", 8.6, for the GA102 chip. The Ampere GA100 chip has the code "8.0" i.e. sm_80.

I used containers from NVIDIA NGC for TensorFlow 1.15, NAMD 2.13 and CUDA for HPCG. All of these applications were built with CUDA 11.

The current CUDA 11.0 does not have full support for the GA102 chips used in the RTX 3090 and RTX3080 (sm_86).

The results in this post are not optimal for RTX30 series. These are preliminary results that will likely improve with an update to CUDA and the driver.

Test system

Hardware

  • Intel Xeon 3265W: 24-cores (4.4/3.4 GHz)
  • Motherboard: Asus PRO WS C621-64L SAGE/10G (Intel C621-64L EATX)
  • Memory: 6x REG ECC DDR4-2933 32GB (192GB total)
  • NVIDIA RTX3090 RTX3080, RTX TITAN and RTX2080Ti

Software

  • Ubuntu 20.04 Linux
  • Enroot 3.3.1
  • NVIDIA Driver Version: 455.23.04
  • nvidia-container-toolkit 1.3.0-1
  • NVIDIA NGC containers
  • nvcr.io/nvidia/tensorflow:20.08-tf1-py3
  • nvcr.io/hpc/namd:2.13-singlenode
  • nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4 for HPCG)

Test Jobs

  • TensorFlow-1.15: ResNet50 v1, fp32 and fp16
  • NAMD-2.13: apoa1, stmv
  • HPCG (High Performance Conjugant Gradient) "HPCG 3.1 Binary for NVIDIA GPUs Including Ampere based on CUDA 11"

Example Command Lines

  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/nvidia/tensorflow:20.08-tf1-py3
  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/hpc/namd:2.13-singlenode
  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=96 --precision=fp32
  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=192 --precision=fp16
  • namd2 +p24 +setcpuaffinity +idlepoll +devices 0 apoa1.namd
  • OMP_NUM_THREADS=24 ./xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80

Note: I listed docker command lines above for reference. I actually ran the containers with enroot

Job run info

  • The batch size used for TensorFlow 1.15 ResNet50 v1 was 96 at fp32 and 192 at fp16 for all GPUs except for the RTX3090 which used 192 for both fp32 and fp16 (using batch_size 384 gave worse results!)
  • The HPCG benchmark used defaults with the problem dimensions 256x256x256

HPCG output for RTX3090,

1x1x1 process grid
256x256x256 local domain
SpMV  =  132.1 GF ( 832.1 GB/s Effective)  132.1 GF_per ( 832.1 GB/s Effective)
SymGS =  162.5 GF (1254.3 GB/s Effective)  162.5 GF_per (1254.3 GB/s Effective)
total =  153.8 GF (1166.5 GB/s Effective)  153.8 GF_per (1166.5 GB/s Effective)
final =  145.9 GF (1106.4 GB/s Effective)  145.9 GF_per (1106.4 GB/s Effective)

Results

These results we run on the system, software and GPU's listed above.

Benchmark Job RTX3090 RTX3080 RTX TitanRTX 2080Ti
TensorFlow 1.15, ResNet50 FP32 561 images/sec 462 images/sec 373 images/sec 343 images/sec
TensorFlow 1.15, ResNet50 FP16 1163 images/sec 1023 images/sec 1082 images/sec 932 images/sec
NAMD 2.13, Apoa1 0.0264 day/ns
(37.9 ns/day)
0.0285 day/ns
(35.1 ns/day)
0.0306 day/ns
(32.7 ns/day)
0.0315 day/ns
(31.7 ns/day)
NAMD 2.13, STMV 0.3398 day/ns
(2.94 ns/day)
0.3400 day/ns
(2.94 ns/day)
0.3496 day/ns
(2.86 ns/day)
0.3528 day/ns
(2.83 ns/day)
HPCG Benchmark 3.1 145.9 GFLOPS 119.3 GFLOPS Not run 93.4 GFLOPS

Note: that the results using TensorFlow 15.1 are much improved for the older RTX20 series GPUs compared to past testing that I have done using earlier versions of the NGC TensorFlow 1.13 container. This is especially true for the fp16 results. I feel there is a possibility of significantly better results for RTX30 after they have become fully supported.

Performance Charts

Results from past GPU testing are not included since they are not strictly comparable because of improvements in CUDA and TensorFlow


TensorFlow 1.15 (CUDA11) ResNet50 benchmark. NGC container nvcr.io/nvidia/tensorflow:20.08-tf1-py3

TensorFlow ResNet50 FP32

The FP32 results show a good performance increase for the RTX30 GPUs and I expect performance to improve when they are more full supported.

TensorFlow ResNet50 FP16

I feel that the FP16 results should be much higher for the RTX30 GPUs since this should be a strong point, I expect improvement with CUDA a update. The surprising results were how much better the RTX20 GPUs performed with CUDA 11 and TensorFlow 1.15. My older results with CUDA 10 and TensorFlow 1.13 where 653 img/s for the RTXTitan and 532 img/s for the 2080Ti!


NAMD 2.13 (CUDA11) apoa1 and stmv benchmarks. NGC container nvcr.io/hpc/namd:2.13-singlenode

NAMD Apoa1

NAMD STMV

These Molecular Dynamics simulation tests with NAMD are almost surely CPU bound. There needs to be a balance between CPU and GPU. These GPU are so high performance that even the excellent 24-core Xeon 3265W is probably not enough. I will do testing using a a later time using AMD Threadripper platforms.


HPCG 3.1 (xhpcg-3.1_cuda-11_ompi-4.0_sm_60_sm70_sm80) nvcr.io/nvidia/cuda:11.0-runtime-ubuntu20.04 (with the addition of OpenMPI 4)

HPCG

I did not have the HPCG benchmark setup when I had access to the RTX Titan. HPCG is an interesting benchmark as it is significantly memory bound. The high performance memory on the GPUs has a large performance impact. The Xeon 3265W yields 14.8 GFLOPS. The RTX3090 is nearly 10 times that performance!

Conclusions

The new RTX30 series GPUs look to be quite worthy successors to the already excellent RTX20 series GPUs. I am also expecting that the compute performance exhibited in this post will improve significantly after the new GPUs are fully supported with a CUDA and driver update.

I can tell you that some of the nice features on the Ampere Tesla GPUs are not available on the GeForce RTX30 series. There is no MIG (Multi-instance GPU) support and the double precision floating point performance is very poor compared to the Tesla A100 ( I compiled and ran nbody as a quick check). However, for the many applications where fp32 and fp16 are appropriate these new GeForce RTX30 GPUs look like they will make for very good and cost effective compute accelerators.

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of poweful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: NAMD, NVIDIA, TensorFlow, RTX30 series, Machine Learning, Molecular Dynamics
hoohoo

Thanks, Professor! Looking forward to your full analysis.

Posted on 2020-09-24 16:01:29
ibmua

Did you enable XLA in TensorFlow? This is critical for adequate comparison of 30xx to Titan RTX.

I feel that the FP16 results should be much higher for the RTX30 GPUs since this should be a strong point, I expect improvement with CUDA a update.


Opposite. Apparently NVidia castrated 30xx GPUs for us to have to buy Quatros this time around. On 20xx series they had FP16 perf = 2x FP32 perf for CUDA cores, which is not the case for 30xx series. And Tensor Core perf is neutered for both 30xx and 20xx series so that FP16 Tensor Core TFLOPS with FP32 Accumulate, which is used for mixed precision training, is half that of FP16 accumulate which is apparently only considered good for inference. https://www.nvidia.com/cont... On full uncut cards they have same FP32 accumulate as FP16 accumulate and for A100, FP16 is not even 2x as for Volta and full uncut Turing cards, but actually 4x the FP32 performance. But that is probably special to A100, it has very low FP32, my guess is Quatros will still have FP16 perf = 2x Fp32.

Titan RTX still has a much much higher FP16 speed. So XLA should be enabled for us to know which is actually better.

Posted on 2020-09-24 16:10:34
Donald Kinghorn

Thanks (again) , our discussion on the the 3080 post was enlightening. https://www.pugetsystems.co...
Just to clarify for readers on this post XLA did not compile for this round of RTX30 testing.

I'm still hopeful for some improvements ... I'm a bit surprised by the info you shared (I appreciate it)! We'll be getting in some Tesla A100's soon ... and Quadro's when they are released so I'll try to get a bunch of consistent/repeatable perf data together.

Posted on 2020-09-25 17:29:44
ibmua

I've tried running on nvcr.io/nvidia/tensorflow:2... . Any clue why my fp16 perf is very different? Around 600 images/sec for both 2080 ti and 3080. Also, reguarding XLA compilation, you can try a different test like this:
python nvidia-examples/resnet50v1.5/main.py --arch=se-resnext101-32x4d --batch_size=64 --warmup_steps 200 --data_dir=/imagenet/tf/ --gpu_memory_fraction 0.95 --precision fp32 --results_dir=/results_dir/ --mode=training_benchmark --use_tf_amp --use_xla
--use_xla switches XLA and --use_tf_amp switches mixed precision on. Strangely, for me training is 2x faster without XLA on 2080 ti and while it's faster on 3080, it's still 114 img/sec while non-xla mixed precision 2080 ti perf for me is 220. V100 perf is repotedly 475 https://github.com/NVIDIA/D... , so >2x higher than what I got.

Posted on 2020-10-16 19:41:03
Donald Kinghorn

hummm, I have access to a sys with a founders edition RTX3090 let me check out the 20.09 NGC container build ...

using TF1.15 with tag 20.08-tf1-py3 gives me 1132 img/sec
python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=192 --precision=fp16

10 10.0 105.6 3.775 4.748 1.62000
20 20.0 1128.6 0.054 1.031 1.24469
30 30.0 1130.3 0.093 1.074 0.91877
40 40.0 1128.8 0.014 0.997 0.64222
50 50.0 1126.2 0.076 1.060 0.41506
60 60.0 1132.4 0.145 1.131 0.23728
70 70.0 1124.6 0.097 1.084 0.10889
80 80.0 1128.3 0.037 1.025 0.02988
90 90.0 868.1 0.001 0.989 0.00025

using TF1.15 with tag 20.09-tf1-py3 gives me 747 img/sec !!!
python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=192 --precision=fp16

10 10.0 100.4 3.807 4.779 1.62000
20 20.0 741.8 0.063 1.040 1.24469
30 30.0 741.7 0.071 1.049 0.91877
40 40.0 743.8 0.055 1.035 0.64222
50 50.0 746.6 0.365 1.347 0.41506
60 60.0 747.0 0.163 1.148 0.23728
70 70.0 744.9 0.030 1.016 0.10889
80 80.0 745.6 0.005 0.992 0.02988
90 90.0 616.0 0.001 0.989 0.00025

yikes! that's a massive performance drop. I don't know what is going on in the 20.09-tff1 build.

Thanks for letting me know about this. I've been waiting for a container built with links to cuda 11.1 and cuDNN 8.4 ... I did look at the 20.09 container but only to see what libs were linked in. Hopefully the 20.10 container will fix this. I really really hope that the 20.08 fp16 results aren't junk!!

Posted on 2020-10-16 22:41:10
Donald Kinghorn

I'm off next week and have a busy week after that so I won't get back to it for a while but I think we need to do a more real-world evaluation to make sure that we are getting valid results!

Posted on 2020-10-16 22:49:16
ibmua

Yeah, I decided to test 20.08 and also had a much better result for 3080 than with 20.09 - 870 img/sec without OC, 900 img/sec with OC, batch size 128 (I can't easily run 192-batched due to lack of GPU mem complaint). 1000/sec for 160-batch. And for se-resnext101 a relatively massive 395 img/sec with some OC or 375 without - 2x higher than 2080 ti and 80% of NVidia's stated V100 performance for a larger batch size - they run 96-batch, I run 64-batch.
The OC I'm doing is mostly memory overclocking. Mem overclocks easily on 3080, +1Ghz is basically guaranteed, ~+1,5Ghz is possible.

Opened an issue on TF https://github.com/tensorfl...

So apparently the memory is about 2x faster de-facto, as also witnessed by ~2x higher Ethereum hash rate, so depending on math intensity of a neural network the speedup vs 2080ti is going to be between 15% (for completely math-bound) and 100% (for completely mem speed bound).

Posted on 2020-10-17 15:34:28
James

Amazing job! Thank you for the very early comparison!

According to NVidia the 3rd generation Tensor Cores are highly optimized for FP32 and in A100 reach up to 160TFLOPs FP32 or 10x the performance of V100. It looks like RTX 30xx don't have exact the same Tensor Cores, but instead have way more CUDA and RT Cores.

So RTX 30xx for Gaming. A100 and Quadro for HPC.

Posted on 2020-09-25 08:48:10
Donald Kinghorn

Yes, this is looking like how it's going to play out. Sometime over the next month or so I'll try to a good comparative testing in on all 3 "flavors" of the GPUs

Posted on 2020-09-25 17:32:10
Maximilian Schneider

Thank you for the detailed benchmarks Dr. Kinghorn, really appreciate that you reran the benchmarks for the 2080ti with updated drivers and frameworks!

A lot of deep learning research resolves around transformer networks, that are showing very different scaling than CNNs. Huggingface.co has a simple benchmark, that can be easily setup for tf/pytorch: https://huggingface.co/tran...

I would really appreciate, if you could publish those numbers as well, as we are currently holding off any further hardware purchases until those are out :)

Posted on 2020-09-25 13:51:33
ibmua

Yeah, definitely thanks a lot for 2080 Ti update. Quite stunning, I expected the difference to be 25%, but no, 9-10% and that's with -1GB. LOL! 2080 Ti FTW! You saved me quite a lot of money there, as I was just going to go exchange $400 1080ti Strix OC for 3080 Gaming OC on the aftermarket with +$800 on my part. =) Happy I ordered 2x 2080 Ti Turbos with NVLink for less than $1300 today.

Posted on 2020-09-25 17:50:13
Donald Kinghorn

:-) 2080Ti has just been a great card! We still have a bunch for builds but expect them to go fast for multi-GPU configs

Multi-GPU is going to be a challenge for RTX30 ... looking forward to getting the Gigabyte blower cards in!

Posted on 2020-09-25 17:57:18
Donald Kinghorn

Nice! Yes, I'd like to get a transformer network benchmark going. Thanks for the link! I'll give it a shot and see how it goes.

I may end up rebuilding TF (and PyTorch) against CUDA 11.1 to see if I can get things working better with RTX30 ... Ideally I'd use NVIDIA's containers on NGC but I don't know when they will get updated (I expect soon since they are keeping up a pretty good release cadence)

I'll keep you posted!

Posted on 2020-09-25 17:52:23
mert gölcük

Thanks! Your benchmarks are amazing.

Difference between RTX2080Ti and RTX30** GPUs not much for the STMV system. This result very surprising and unusual I think. When I compare the CUDA core numbers.

I just wonder, do you think to add NAMD3 to benchmarks? I know it is still on alpha version but with new GPUs it could be surprising results.

Thanks

Posted on 2020-09-25 18:21:52
QwwqWq

Maybe the reason for the poor FP16 performance is because of this::
RTX 3090 has been purposely nerfed by Nvidia at driver level.
https://www.reddit.com/r/Ma...

Posted on 2020-09-25 18:26:07

Does the 3090 shows up in lspci as a 3090? In my case---a Gigabyte Eagle 3090 OC on Ubuntu 20.04---it's showing up as "NVIDIA Corporation Device 2204", and nvidia-smi 455.23.04 does not recognize it at all.

Posted on 2020-09-25 19:49:54
Brian Scanlon

Many thanks for such an insightful review. I find it interesting that the FP32 capabilities have improved greatly, albeit at the loss of power efficiency and absence of half precision training boost. While it may provide an advantage in some scenarios, I will not be upgrading from my 2x2080ti setup for a while. Thanks also for rerunning tests with CUDA11.

I am also a fan of the large memory, and hope this trend holds true in years to come. While games may not have a use case for more memory, it opens the door to different modelling approaches in the ML space.

Posted on 2020-09-26 10:18:36
David Manouchehri

Have you had the chance to test out NVLink with the RTX 3090 yet?

Posted on 2020-09-28 04:15:24

NVLink bridges for these new RTX 3090 cards do not seem to exist yet, and I have yet to see any substantial info about them from NVIDIA. Moreover, the multi-fan cards available so far are not ideal for use in multi-GPU configurations which NVLink would imply.

To more directly answer your question, though - no, we have not yet tested NVLink on these cards :)

Posted on 2020-09-28 04:19:21
Donald Kinghorn

we're curious about this too. We are also curious about P2P over PCIe 3 vs 4 ... as William suggested we'll check all of this out when the NVLINK bridge shows up

Posted on 2020-09-28 16:16:52
Kyungmin Lee

Thanks, Donald Kinghorn! (I decided to buy RTX 3090 cards because of your experiment results!)

I think still CuDNN does not fully support the Ampere architecture.
And Tensorflow building scripts also do not support cuda-11.1 yet.(https://github.com/tensorfl...

Is it already fully supported by the current version of CuDNN?
Otherwise, could you give me any advice to fully utilize the RTX30 series cards?

Posted on 2020-09-28 12:56:09
ibmua

To fully utilize those cards you'll have to wait till people write sparse kernels for them for 2x boost.

Posted on 2020-09-28 14:45:27
Donald Kinghorn

I think you will be quite happy with the 3090! I may get one too.

As ibumua suggests below, we're going to need to have some patience while folks get a chance to work out optimizations for the RTX30's (and Ampere in general). I think we'll see good progress after GTC in a few weeks.

Posted on 2020-09-28 16:27:51
Donald Kinghorn

... I just looked at CuDNN release notes 8.0.3 does list A100 but does not mention GA102. Probably need another update. Thanks for mentioning that! I was thinking about trying a TF build with cuda 11.1 this week but looks like it might be better to wait a bit.

Posted on 2020-09-28 16:48:18
Kyungmin Lee

Thanks~! I'm trying to re-build my TF using Cuda 11.1 and Cuda 8.0.4 (libcudnn8_8.0.4.30-1+cuda11.1_amd64.deb is now available).

Posted on 2020-09-29 00:52:12
Donald Kinghorn

Cool, I see it's available from dev ... I may take a shot at this too :-) Thanks!

Posted on 2020-09-29 15:57:08
KYUNG MIN LEE

This cudart SO version issue has been fixed about 6 hours ago. (I would not expect this change can actually improve the performance though)
https://github.com/tensorfl...

Posted on 2020-10-08 00:12:51
Donald Kinghorn

thanks for keeping me posted!

I'm holding off on doing a build for now. I expect NV to update the NGC container images before the end of the month so they might save us the trouble :-)

Posted on 2020-10-08 01:08:32
Shreeyak Sajjan

It's been a month!! :)
Any updates to this post? I'd love to see if the numbers have improved since Sep.

Posted on 2020-10-23 10:47:51
Ernst Stavro Blofeld

Would you please benchmark the card with a big Transformer? Thanks!

Posted on 2020-10-23 19:18:12