Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1885
Dr Donald Kinghorn (Scientific Computing Advisor )

RTX3080 TensorFlow and NAMD Performance on Linux (Preliminary)

Written on September 17, 2020 by Dr Donald Kinghorn
Share:

Introduction

The much anticipated NVIDIA GeForce RTX3080 has been released.

How good is the performance for Machine Learning and Molecular Dynamics on the RTX3080?

  • First tests look exceptionally promising! Initial results with TensorFlow running ResNet50 training looks to be significantly better than the RTX2080Ti.
  • NAMD molecular dynamics performance was as good as I've seen and was basically CPU bound with just one RTX3080 GPU on an Intel Xeon 24-core 3265W.

The testing was problematic though.

  • Had to wait until the official launch on sept 17th to get a usable Linux display driver. (arrrggg!) This was annoying because the Tesla A100 had been supported for a couple of months.
  • Only had a couple of hours of access to the card for testing and the scripts I had set up to automate the testing failed on the RTX3080 with the container application I was using. (had libnvidia-container error)
  • I tried using TensorFlow from the Anaconda build but it failed on the RTX3080 with an error in the cuBLAS library. (that was with Google's ResNet50 benchmark)
  • I had also tried the above earlier in the week using a pre-release Windows driver and had the same error.

But, I got things working by doing a quick install of docker and then using containers from NVIDIA NGC.(which I tried earlier with an alternative to docker).

This is very brief early testing! I expect performance and compatibility to improve considerably soon. I'll be doing more thorough testing after a new driver revision and after software developers have had a chance to do more optimizations and debugging.

Test system

Hardware

  • Intel Xeon 3265W: 24-cores (4.4/3.4 GHz)
  • Motherboard: Asus PRO WS C621-64L SAGE/10G (Intel C621-64L EATX)
  • Memory: 6x REG ECC DDR4-2933 32GB (192GB total)
  • NVIDIA RTX3080 and RTX TITAN

Software

  • Ubuntu 20.04 Linux
  • Docker version 19.03.12
  • NVIDIA Driver Version: 455.23.04
  • nvidia-container-toolkit 1.3.0-1
  • NVIDIA NGC containers
    • nvcr.io/nvidia/tensorflow:20.08-tf1-py3
    • nvcr.io/hpc/namd:2.13-singlenode

Test Jobs

  • TensorFlow-1.15: ResNet50 v1, fp32 and fp16
  • NAMD-2.13: apoa1, stmv

Example Command Lines

  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/nvidia/tensorflow:20.08-tf1-py3
  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/hpc/namd:2.13-singlenode
  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=96 --precision=fp32
  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=192 --precision=fp16
  • namd2 +p24 +setcpuaffinity +idlepoll +devices 0 apoa1.namd

Results

These results we run on the system and software listed above using the RTX Titan and RTX 3080.

Benchmark Job RTX3080 RTX Titan
TensorFlow 1.15, ResNet50 FP32 462 images/sec 373 images/sec
TensorFlow 1.15, ResNet50 FP16 1023 images/sec 1082 images/sec
NAMD 2.13, Apoa1 0.0285 day/ns 35.11 ns/day 0.0306 day/ns (32.68 ns/day)
NAMD 2.13, STMV 0.3400 day/ns 2.941 ns/day 0.3496 day/ns (2.860 ns/day)

I had tried to run the Big_LSTM benchmark that I have run in the past, but it failed with out-of-memory errors no matter how small I made the batch_size. I had also run with TensorFlow 2.2 on the RTX Titan but did not have time to do this on the RTX 3080.

You can see that the $700 RTX 3080 gave excellent performance compared to the much more expensive RTX Titan (which has 24GB of expensive memory). I didn't compare directly to other cards because the RTX Titan was what I had available at the time.

Note: that these results for the RTX Titan are much improved over past testing that I have done using earlier versions of the NGC TensorFlow container. This is especially true for the fp16 result as you will will see in the chart below.

Comparative Charts

I will not present any charts with NAMD results because more CPU cores will be needed to balance performance to show better discrimination between different GPUs. My guess is that the (1 or 2) RTX 3080 would be an excellent on an AMD TR 3990x or 3970x platform for NAMD.

These are results from older testing with new results mixed in!

TensorFlow ResNet50 FP32

You can see from the plot that RTX3080 is approaching the performance of 2 RTX 2080Ti GPUs!

TensorFlow ResNet50 FP16

Again the RTX3080 is doing very well with mixed precision fp16. I expect this number to improve with a new driver and some CUDA patches. There is a dramatic improvement for the RTX Titan at fp16 1082 img/sec vs 653 img/sec from the older testing!

NOTE: These are mixed results using numbers from testing using an older NGC TensorFlow-1.13 container. There has been significant improvements to the new TensorFlow-1.15 build.

Conclusions

NVIDIA is keeping the "spirit" of Moore's Law alive! The "Ampere" GPU based RTX 3080 is a significant step forward in performance-per-dollar. The results presented in this post are preliminary. They will only get better as the driver matures and as software developers tune their applications for better performance on the architecture.

I can tell you that some of the nice features on the Ampere Tesla GPUs are not available on the GeForce 30 series. There is no MIG (Multi-instance GPU) support and the double precision floating point performance is very poor compared to the Tesla A100 ( I compiled and ran nbody as a quick check). However, for the many applications where fp32 and fp16 are appropriate these new GeForce RTX30 GPUs look like they will make for very good and cost effective compute accelerators.

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of poweful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: NAMD, NVIDIA, TensorFlow, RTX30 series, Machine Learning, Molecular Dynamics
MisterWU

OMG

Posted on 2020-09-18 04:45:18
Donald Kinghorn

yes, that was my feeling too :-) Expect more (better) testing over the next few weeks. Looking forward to the 3090 and multi-GPU!

Posted on 2020-09-18 19:04:34
Adam

Could you please report time to a specific accuracy (or at least accuracy on the validation set after some time)

Posted on 2020-09-18 11:45:13
Donald Kinghorn

The ResNet50 job was run purely as a benchmark with synthetic data (random pixels) .... and I only had the card for a couple of hours :-) I hear you though.

Posted on 2020-09-18 19:02:40
a guy with a 2060 rtx

It looks like you guys didn't test the 2060 RTX. Are you able to test it & add it to those charts?

Posted on 2020-09-18 15:03:57
Donald Kinghorn

The last update on those charts before this post was when the 2070super came out. I've just never tested with 2060. At some point it would be nice to do a complete update, but it is a huge amount of work! (If we even have all of the cards available) I'm working up a better automated GPU compute testing suite so hopefully this will happen!

Posted on 2020-09-18 18:57:16
Sebastian Guerraty

Dam it, I was really looking forward to having good f64 performance based on the fact that its the "same architecture" as the A100, seems like the change to using a different chip led to some features being limited, or maybe its just nvidia preferring some customers buying their quadro products.
Would be interesting to see the impact of pcie 4.0 vs 3.0 with these gpus for scientific applications in the future (might be relevant when choosing cpu platform).
Cant wait to see the results of the 3090. Great work and thanks for sharing it :)

Posted on 2020-09-19 03:56:57
k8s_1

They have to leave something for the professional market, there's no reason for them to waste transistors on fp64.
My guess is that pcie 4.0 will mainly help scaling for multi gpu setups or scenarios where you need to communicate with the GPU very often.

Posted on 2020-09-19 05:48:58
Donald Kinghorn

Yes! I was hoping for good double precision too. Titan V is still the best bet for a reasonable priced card. I also wanted virtualization support and (MIG) but ...nope. Still as a replacement for the 2080Ti it's really good!

I'm curious about PCIe performance too. I'll for sure be testing that. Our production folks are telling me that the Threadripper boards are looking to have the best layout for multi-GPU ... so that may end up being the platform of choice ??? we'll test everything :-)

Posted on 2020-09-21 15:12:06
Sebastian Guerraty

Any update on if they will support SR-IOV on the rtx 3090? since they are marketing it as a titan level cards it makes sense that that one might get more of a "prosumer" feature set.
I am totally ignorant on virtualization, but would this help GPU performance when using WSL 2 to train models on Linux while programming on windows?

Posted on 2020-09-21 21:47:35
Donald Kinghorn

I'm not sure how this really works ... it looks like what NVIDIA is calling GPUDirect storage will be enabled (not 100% sure) I don't unserstand the tie in with SR-IOV for this ??

I also don't know how they are doing the GPU virtualization in WSL2 :-) ... I was wondering about that because it seems like it "shouldn't "work for GeForce, but it does! It's actual shared virtualization of the GPU. I have tested that and it's not too bad for compute. The latest update is only between 15-30% slower than Windows native. It's pretty cool really! It would well at fp32 but fp16 performance was bad fopr some reason.

I have even done a JupyterHub install on WSL2 by rewriting the systemd unit file as an old school rc script ... worked! I'm also going to try a python wrapper for systemd that can be used with docker containers

It keeps getting better and better! :-)

Posted on 2020-09-22 01:02:08
Scientism

Please include Titan V in your final benchmark -- very interesting to know how its performance changed due to software improvements, and how good it is compared to the new 30 series cards. Also interested in synthetic benchmarks that reveal "FP16 Tensor TFLOPS with FP32 Accumulate" differences between pro and consumer cards.

Posted on 2020-09-19 18:51:53
Donald Kinghorn

I think auto-mixed precision is going to work with these. I'm going to have to do some coding to check some of this stuff. I'm curious about TF32! Which could sole a lot of problems ???

I'll get 3090 results soon but will probably just be limited to comparison like what I have in this post. I'll want to get a fully refreshed benchmark setup and good comparison post up. Before I do that I'll need a better driver update, and CUDA update ...and some TF patches.

Right now everything is hit or miss because nothing is compiled for the new "compute capability" GA102 is sm_8.6 GA100 is sm_8 TensorFlow 2.x is not working (at least for me)

I expect the October release cycle of the devs at NV will have some things sorted out :-)

Posted on 2020-09-21 15:29:22
Donald Kinghorn

... I'll do that right now. (I've got a TitanV in a system at home :-)

Running the same NGC TF 1.15 container as in this post with batch_size 96 and 192
fp32 372 img/s (old 288)
fp16 1137 img/s (old 624)
same massive improvement at fp16!

This is part of why I think a CUDA and driver update will improve the 3080 and 3090 results significantly ... I think the current launch release of the driver is not optimal on sm_8.6

Posted on 2020-09-21 16:11:06
Scientism

BTW, I'm refreshing the all hpc blog, and the human detection / anti-bot machine learning "captcha" is unbearable. It forgets that I'm a human within 0.5-1 hour, and I have to find motorcycles again and again.

Posted on 2020-09-21 06:47:47
Donald Kinghorn

:-)

Posted on 2020-09-21 15:32:17
Stefan Fabian

These graphs are very misleading and have absolutely no informational value due to the different versions used.
The same card has a significant difference in performance in that graph (653 vs 1082).
Please either update or just remove the graph because comparing results for different versions holds no value and only serves to confuse readers.
The comparison between the RTX Titan and the RTX 3080 is interesting, though.
Would be a great follow up to compare it to the 2080 Ti with the same version and up to date drivers.

Posted on 2020-09-22 10:41:14
Donald Kinghorn

yes, you have to read the text, for clarification on that. copied from the post ...
...
Note: that these results for the RTX Titan are much improved over past testing that I have done using earlier versions of the NGC TensorFlow container. This is especially true for the fp16 result as you will will see in the chart below.

NOTE: These are mixed results using numbers from testing using an older NGC TensorFlow-1.13 container. There has been significant improvements to the new TensorFlow-1.15 build.
...

That is why I included the new numbers on the charts. This is important! Sorry if that was not clear to you. Also, see the comments below. I provided a new result for the Titan V, per a kind request.

Posted on 2020-09-22 15:36:00
Donald Kinghorn

... you do have a point! I think I may leave out old results for the RTX3090 testing. The comparison charts need a complete redo in light of the greatly increased performance for older GPUs with fp16 using the TensorFlow 1.15 NGC container.

Also, I'm pretty sure there will be better performance with the RTX30 once a CUDA update with support for compute level 8.6 is released ...

I'll see if I can get some 2080Ti results updated for the RTX2090 post. That's probably the most important comparison ... still have to keep in mind what I just mentioned above.

Thanks for your understanding :-) Expect a good comparison after we get a new CUDA and driver update

Posted on 2020-09-22 19:05:17
Ampere

https://developer.nvidia.co...
https://docs.nvidia.com/cud...

"Added support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series."

Posted on 2020-09-23 19:28:13
Donald Kinghorn

Thank You! I have been checking on that every day ... except today, because I was busy finishing up my post on the RTX3090 .... ha ha! I have comments all over that post saying that we will need to see a CUDA update before we are confidant with the results :-)

I will still need to wait for the NGC container updates but that should happen soon. One that happens I'll try to get more comprehensive testing done ... including multi-GPU.

I'll add a note for tomorrows post on the RTX3090. Thanks again --Don

Posted on 2020-09-23 22:14:14
ibmua

Was XLA compilation used? It's a critical question, because XLA speeds things up massively by removing unnecessary memory movements and if it wasn't used, 3080 is likely to get dethroned by Titan RTX big time due to NVidia's FP16 DL training tensor and cuda core castration of 30xx cards.

Posted on 2020-09-24 08:32:18
Donald Kinghorn

Thanks, interesting observation! Here's a snip from the output on a RTX3090 run (all test jobs were run the same way)
XLA was flagged to be used but I don't know it it actually was or not!, ...I didn't redirect std-error to my output files! arrrggg! I'm guessing it did not compile

check out the RTX3090 post too
https://www.pugetsystems.co...

WARNING: Detected NVIDIA GeForce RTX 3090 GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.

PY 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0]
TF 1.15.3
Script arguments:
--layers 50
--batch_size 384
--num_iter 90
--iter_unit epoch
--display_every 10
--precision fp16
--use_xla True
--predict False

+++... I went in to the office and put a 3090 back in my test system ... looks like XLA did not compile

You get this,

2020-09-24 12:58:35.403350: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x969a450 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-24 12:58:35.403408: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 3090, Compute Capability 8.6

then later,

2020-09-24 12:58:38.036758: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:70] Can't find ptxas binary in ${CUDA_DIR}/bin. Will back to the GPU driver for PTX -> sass compilation. This is OK so long as you don't see a warning below about an out-of-date driver version.
2020-09-24 12:58:38.036813: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:71] Searched for CUDA in the following directories:
2020-09-24 12:58:38.036831: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:74] /usr/local/cuda
2020-09-24 12:58:38.036841: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:74] /usr/local/cuda
2020-09-24 12:58:38.036850: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:74] .
2020-09-24 12:58:38.036858: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:76] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2020-09-24 12:58:38.866740: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1648] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

ptxas does not work (it is there and on the PATH despite what the messages suggest ... it works fine with RTX20) ptx is compiled by the driver giving a slow start up for the job.

We'll get all of this sorted out and working right but it will take some time for everything to get proper support. CUDA 11.1 was just released yesterday and it has support for sm_86

Posted on 2020-09-24 20:21:48
ibmua

Thanks for taking the time! Some XLA benchmarks from others for reference https://bizon-tech.com/gpu-... - though do mind that those are numbers for different batch sizes. It's just that the relative difference in after-XLA perf is very noticeable.

Posted on 2020-09-25 06:32:45
Adupa Vasista

OMG, For NAMD, RTX 3080 is better than Titan. Marvellous. Very excited to see 3090 in action and SLI. Also, It is sad that the NVidia is revoking SLI support :. I have heard that the new NAMD 3 is 1.9x faster than NAMD 1.94 version. Can you give this new NAMD 3 a try with 3090? Any particular reason to go with intel xeon rather than a AMD 3950 or a threadripper? Thank you.

Posted on 2020-09-30 04:09:56
Donald Kinghorn

Yes, single precision performance is very good! I have tested the RTX3090 with the same setup and same tests. https://www.pugetsystems.co...

I did run tests with NAMD 3 but the results were close enough that I reported 2.13 since it is easier to compare with older results. I will need to test on a system with more CPU performance to balance the GPUs to see performance difference with v3

I will redo NAMD testing on systems with more capable CPUs to balance out the GPUs . I'll test with TR 3070x and 3090x for sure :-) (I like the TR 3070x a lot) ... looking forward to Zen 3 too! Intel is very good but I would go dual socket Xeon for best performance with NAMD. EPYC would also be good and it looks like I will be testing that soon too :-)

The RTX 3090 does have NVLINK (higher end Quadro's should too) I will test dual RTX3090 once overall support is a bit better. ...I am enjoying the testing :-)

Posted on 2020-09-30 17:33:08
Adupa Vasista

Testing Testing Testing. Look's like you have lot on your mind sir. :)

Posted on 2020-09-30 18:24:21