Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1885
Dr Donald Kinghorn (Scientific Computing Advisor )

RTX3080 TensorFlow and NAMD Performance on Linux (Preliminary)

Written on September 17, 2020 by Dr Donald Kinghorn


The much anticipated NVIDIA GeForce RTX3080 has been released.

How good is the performance for Machine Learning and Molecular Dynamics on the RTX3080?

  • First tests look exceptionally promising! Initial results with TensorFlow running ResNet50 training looks to be significantly better than the RTX2080Ti.
  • NAMD molecular dynamics performance was as good as I've seen and was basically CPU bound with just one RTX3080 GPU on an Intel Xeon 24-core 3265W.

The testing was problematic though.

  • Had to wait until the official launch on sept 17th to get a usable Linux display driver. (arrrggg!) This was annoying because the Tesla A100 had been supported for a couple of months.
  • Only had a couple of hours of access to the card for testing and the scripts I had set up to automate the testing failed on the RTX3080 with the container application I was using. (had libnvidia-container error)
  • I tried using TensorFlow from the Anaconda build but it failed on the RTX3080 with an error in the cuBLAS library. (that was with Google's ResNet50 benchmark)
  • I had also tried the above earlier in the week using a pre-release Windows driver and had the same error.

But, I got things working by doing a quick install of docker and then using containers from NVIDIA NGC.(which I tried earlier with an alternative to docker).

This is very brief early testing! I expect performance and compatibility to improve considerably soon. I'll be doing more thorough testing after a new driver revision and after software developers have had a chance to do more optimizations and debugging.

Test system


  • Intel Xeon 3265W: 24-cores (4.4/3.4 GHz)
  • Motherboard: Asus PRO WS C621-64L SAGE/10G (Intel C621-64L EATX)
  • Memory: 6x REG ECC DDR4-2933 32GB (192GB total)


  • Ubuntu 20.04 Linux
  • Docker version 19.03.12
  • NVIDIA Driver Version: 455.23.04
  • nvidia-container-toolkit 1.3.0-1
  • NVIDIA NGC containers
    • nvcr.io/nvidia/tensorflow:20.08-tf1-py3
    • nvcr.io/hpc/namd:2.13-singlenode

Test Jobs

  • TensorFlow-1.15: ResNet50 v1, fp32 and fp16
  • NAMD-2.13: apoa1, stmv

Example Command Lines

  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/nvidia/tensorflow:20.08-tf1-py3
  • docker run --gpus all --rm -it -v $HOME:/projects nvcr.io/hpc/namd:2.13-singlenode
  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=96 --precision=fp32
  • python nvidia-examples/cnn/resnet.py --layers=50 --batch_size=192 --precision=fp16
  • namd2 +p24 +setcpuaffinity +idlepoll +devices 0 apoa1.namd


These results we run on the system and software listed above using the RTX Titan and RTX 3080.

Benchmark Job RTX3080 RTX Titan
TensorFlow 1.15, ResNet50 FP32 462 images/sec 373 images/sec
TensorFlow 1.15, ResNet50 FP16 1023 images/sec 1082 images/sec
NAMD 2.13, Apoa1 0.0285 day/ns 35.11 ns/day 0.0306 day/ns (32.68 ns/day)
NAMD 2.13, STMV 0.3400 day/ns 2.941 ns/day 0.3496 day/ns (2.860 ns/day)

I had tried to run the Big_LSTM benchmark that I have run in the past, but it failed with out-of-memory errors no matter how small I made the batch_size. I had also run with TensorFlow 2.2 on the RTX Titan but did not have time to do this on the RTX 3080.

You can see that the $700 RTX 3080 gave excellent performance compared to the much more expensive RTX Titan (which has 24GB of expensive memory). I didn't compare directly to other cards because the RTX Titan was what I had available at the time.

Note: that these results for the RTX Titan are much improved over past testing that I have done using earlier versions of the NGC TensorFlow container. This is especially true for the fp16 result as you will will see in the chart below.

Comparative Charts

I will not present any charts with NAMD results because more CPU cores will be needed to balance performance to show better discrimination between different GPUs. My guess is that the (1 or 2) RTX 3080 would be an excellent on an AMD TR 3990x or 3970x platform for NAMD.

These are results from older testing with new results mixed in!

TensorFlow ResNet50 FP32

You can see from the plot that RTX3080 is approaching the performance of 2 RTX 2080Ti GPUs!

TensorFlow ResNet50 FP16

Again the RTX3080 is doing very well with mixed precision fp16. I expect this number to improve with a new driver and some CUDA patches. There is a dramatic improvement for the RTX Titan at fp16 1082 img/sec vs 653 img/sec from the older testing!

NOTE: These are mixed results using numbers from testing using an older NGC TensorFlow-1.13 container. There has been significant improvements to the new TensorFlow-1.15 build.


NVIDIA is keeping the "spirit" of Moore's Law alive! The "Ampere" GPU based RTX 3080 is a significant step forward in performance-per-dollar. The results presented in this post are preliminary. They will only get better as the driver matures and as software developers tune their applications for better performance on the architecture.

I can tell you that some of the nice features on the Ampere Tesla GPUs are not available on the GeForce 30 series. There is no MIG (Multi-instance GPU) support and the double precision floating point performance is very poor compared to the Tesla A100 ( I compiled and ran nbody as a quick check). However, for the many applications where fp32 and fp16 are appropriate these new GeForce RTX30 GPUs look like they will make for very good and cost effective compute accelerators.

Happy computing! --dbk @dbkinghorn

Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of poweful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?

Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: NAMD, NVIDIA, TensorFlow, RTX30 series, Machine Learning, Molecular Dynamics


Posted on 2020-09-18 04:45:18
Donald Kinghorn

yes, that was my feeling too :-) Expect more (better) testing over the next few weeks. Looking forward to the 3090 and multi-GPU!

Posted on 2020-09-18 19:04:34

Could you please report time to a specific accuracy (or at least accuracy on the validation set after some time)

Posted on 2020-09-18 11:45:13
Donald Kinghorn

The ResNet50 job was run purely as a benchmark with synthetic data (random pixels) .... and I only had the card for a couple of hours :-) I hear you though.

Posted on 2020-09-18 19:02:40
a guy with a 2060 rtx

It looks like you guys didn't test the 2060 RTX. Are you able to test it & add it to those charts?

Posted on 2020-09-18 15:03:57
Donald Kinghorn

The last update on those charts before this post was when the 2070super came out. I've just never tested with 2060. At some point it would be nice to do a complete update, but it is a huge amount of work! (If we even have all of the cards available) I'm working up a better automated GPU compute testing suite so hopefully this will happen!

Posted on 2020-09-18 18:57:16
Sebastian Guerraty

Dam it, I was really looking forward to having good f64 performance based on the fact that its the "same architecture" as the A100, seems like the change to using a different chip led to some features being limited, or maybe its just nvidia preferring some customers buying their quadro products.
Would be interesting to see the impact of pcie 4.0 vs 3.0 with these gpus for scientific applications in the future (might be relevant when choosing cpu platform).
Cant wait to see the results of the 3090. Great work and thanks for sharing it :)

Posted on 2020-09-19 03:56:57

They have to leave something for the professional market, there's no reason for them to waste transistors on fp64.
My guess is that pcie 4.0 will mainly help scaling for multi gpu setups or scenarios where you need to communicate with the GPU very often.

Posted on 2020-09-19 05:48:58
Donald Kinghorn

Yes! I was hoping for good double precision too. Titan V is still the best bet for a reasonable priced card. I also wanted virtualization support and (MIG) but ...nope. Still as a replacement for the 2080Ti it's really good!

I'm curious about PCIe performance too. I'll for sure be testing that. Our production folks are telling me that the Threadripper boards are looking to have the best layout for multi-GPU ... so that may end up being the platform of choice ??? we'll test everything :-)

Posted on 2020-09-21 15:12:06
Sebastian Guerraty

Any update on if they will support SR-IOV on the rtx 3090? since they are marketing it as a titan level cards it makes sense that that one might get more of a "prosumer" feature set.
I am totally ignorant on virtualization, but would this help GPU performance when using WSL 2 to train models on Linux while programming on windows?

Posted on 2020-09-21 21:47:35
Donald Kinghorn

I'm not sure how this really works ... it looks like what NVIDIA is calling GPUDirect storage will be enabled (not 100% sure) I don't unserstand the tie in with SR-IOV for this ??

I also don't know how they are doing the GPU virtualization in WSL2 :-) ... I was wondering about that because it seems like it "shouldn't "work for GeForce, but it does! It's actual shared virtualization of the GPU. I have tested that and it's not too bad for compute. The latest update is only between 15-30% slower than Windows native. It's pretty cool really! It would well at fp32 but fp16 performance was bad fopr some reason.

I have even done a JupyterHub install on WSL2 by rewriting the systemd unit file as an old school rc script ... worked! I'm also going to try a python wrapper for systemd that can be used with docker containers

It keeps getting better and better! :-)

Posted on 2020-09-22 01:02:08

Please include Titan V in your final benchmark -- very interesting to know how its performance changed due to software improvements, and how good it is compared to the new 30 series cards. Also interested in synthetic benchmarks that reveal "FP16 Tensor TFLOPS with FP32 Accumulate" differences between pro and consumer cards.

Posted on 2020-09-19 18:51:53
Donald Kinghorn

I think auto-mixed precision is going to work with these. I'm going to have to do some coding to check some of this stuff. I'm curious about TF32! Which could sole a lot of problems ???

I'll get 3090 results soon but will probably just be limited to comparison like what I have in this post. I'll want to get a fully refreshed benchmark setup and good comparison post up. Before I do that I'll need a better driver update, and CUDA update ...and some TF patches.

Right now everything is hit or miss because nothing is compiled for the new "compute capability" GA102 is sm_8.6 GA100 is sm_8 TensorFlow 2.x is not working (at least for me)

I expect the October release cycle of the devs at NV will have some things sorted out :-)

Posted on 2020-09-21 15:29:22
Donald Kinghorn

... I'll do that right now. (I've got a TitanV in a system at home :-)

Running the same NGC TF 1.15 container as in this post with batch_size 96 and 192
fp32 372 img/s (old 288)
fp16 1137 img/s (old 624)
same massive improvement at fp16!

This is part of why I think a CUDA and driver update will improve the 3080 and 3090 results significantly ... I think the current launch release of the driver is not optimal on sm_8.6

Posted on 2020-09-21 16:11:06

BTW, I'm refreshing the all hpc blog, and the human detection / anti-bot machine learning "captcha" is unbearable. It forgets that I'm a human within 0.5-1 hour, and I have to find motorcycles again and again.

Posted on 2020-09-21 06:47:47
Donald Kinghorn


Posted on 2020-09-21 15:32:17
Stefan Fabian

These graphs are very misleading and have absolutely no informational value due to the different versions used.
The same card has a significant difference in performance in that graph (653 vs 1082).
Please either update or just remove the graph because comparing results for different versions holds no value and only serves to confuse readers.
The comparison between the RTX Titan and the RTX 3080 is interesting, though.
Would be a great follow up to compare it to the 2080 Ti with the same version and up to date drivers.

Posted on 2020-09-22 10:41:14
Donald Kinghorn

yes, you have to read the text, for clarification on that. copied from the post ...
Note: that these results for the RTX Titan are much improved over past testing that I have done using earlier versions of the NGC TensorFlow container. This is especially true for the fp16 result as you will will see in the chart below.

NOTE: These are mixed results using numbers from testing using an older NGC TensorFlow-1.13 container. There has been significant improvements to the new TensorFlow-1.15 build.

That is why I included the new numbers on the charts. This is important! Sorry if that was not clear to you. Also, see the comments below. I provided a new result for the Titan V, per a kind request.

Posted on 2020-09-22 15:36:00
Donald Kinghorn

... you do have a point! I think I may leave out old results for the RTX3090 testing. The comparison charts need a complete redo in light of the greatly increased performance for older GPUs with fp16 using the TensorFlow 1.15 NGC container.

Also, I'm pretty sure there will be better performance with the RTX30 once a CUDA update with support for compute level 8.6 is released ...

I'll see if I can get some 2080Ti results updated for the RTX2090 post. That's probably the most important comparison ... still have to keep in mind what I just mentioned above.

Thanks for your understanding :-) Expect a good comparison after we get a new CUDA and driver update

Posted on 2020-09-22 19:05:17


"Added support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series."

Posted on 2020-09-23 19:28:13
Donald Kinghorn

Thank You! I have been checking on that every day ... except today, because I was busy finishing up my post on the RTX3090 .... ha ha! I have comments all over that post saying that we will need to see a CUDA update before we are confidant with the results :-)

I will still need to wait for the NGC container updates but that should happen soon. One that happens I'll try to get more comprehensive testing done ... including multi-GPU.

I'll add a note for tomorrows post on the RTX3090. Thanks again --Don

Posted on 2020-09-23 22:14:14

Was XLA compilation used? It's a critical question, because XLA speeds things up massively by removing unnecessary memory movements and if it wasn't used, 3080 is likely to get dethroned by Titan RTX big time due to NVidia's FP16 DL training tensor and cuda core castration of 30xx cards.

Posted on 2020-09-24 08:32:18
Donald Kinghorn

Thanks, interesting observation! Here's a snip from the output on a RTX3090 run (all test jobs were run the same way)
XLA was flagged to be used but I don't know it it actually was or not!, ...I didn't redirect std-error to my output files! arrrggg! I'm guessing it did not compile

check out the RTX3090 post too

WARNING: Detected NVIDIA GeForce RTX 3090 GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.

PY 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0]
TF 1.15.3
Script arguments:
--layers 50
--batch_size 384
--num_iter 90
--iter_unit epoch
--display_every 10
--precision fp16
--use_xla True
--predict False

+++... I went in to the office and put a 3090 back in my test system ... looks like XLA did not compile

You get this,

2020-09-24 12:58:35.403350: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x969a450 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-09-24 12:58:35.403408: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 3090, Compute Capability 8.6

then later,

2020-09-24 12:58:38.036758: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:70] Can't find ptxas binary in ${CUDA_DIR}/bin. Will back to the GPU driver for PTX -> sass compilation. This is OK so long as you don't see a warning below about an out-of-date driver version.
2020-09-24 12:58:38.036813: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:71] Searched for CUDA in the following directories:
2020-09-24 12:58:38.036831: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:74] /usr/local/cuda
2020-09-24 12:58:38.036841: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:74] /usr/local/cuda
2020-09-24 12:58:38.036850: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:74] .
2020-09-24 12:58:38.036858: W tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:76] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2020-09-24 12:58:38.866740: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1648] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

ptxas does not work (it is there and on the PATH despite what the messages suggest ... it works fine with RTX20) ptx is compiled by the driver giving a slow start up for the job.

We'll get all of this sorted out and working right but it will take some time for everything to get proper support. CUDA 11.1 was just released yesterday and it has support for sm_86

Posted on 2020-09-24 20:21:48

Thanks for taking the time! Some XLA benchmarks from others for reference https://bizon-tech.com/gpu-... - though do mind that those are numbers for different batch sizes. It's just that the relative difference in after-XLA perf is very noticeable.

Posted on 2020-09-25 06:32:45
Adupa Vasista

OMG, For NAMD, RTX 3080 is better than Titan. Marvellous. Very excited to see 3090 in action and SLI. Also, It is sad that the NVidia is revoking SLI support :. I have heard that the new NAMD 3 is 1.9x faster than NAMD 1.94 version. Can you give this new NAMD 3 a try with 3090? Any particular reason to go with intel xeon rather than a AMD 3950 or a threadripper? Thank you.

Posted on 2020-09-30 04:09:56
Donald Kinghorn

Yes, single precision performance is very good! I have tested the RTX3090 with the same setup and same tests. https://www.pugetsystems.co...

I did run tests with NAMD 3 but the results were close enough that I reported 2.13 since it is easier to compare with older results. I will need to test on a system with more CPU performance to balance the GPUs to see performance difference with v3

I will redo NAMD testing on systems with more capable CPUs to balance out the GPUs . I'll test with TR 3070x and 3090x for sure :-) (I like the TR 3070x a lot) ... looking forward to Zen 3 too! Intel is very good but I would go dual socket Xeon for best performance with NAMD. EPYC would also be good and it looks like I will be testing that soon too :-)

The RTX 3090 does have NVLINK (higher end Quadro's should too) I will test dual RTX3090 once overall support is a bit better. ...I am enjoying the testing :-)

Posted on 2020-09-30 17:33:08
Adupa Vasista

Testing Testing Testing. Look's like you have lot on your mind sir. :)

Posted on 2020-09-30 18:24:21
Donald Kinghorn

Hi Adupa, thought I'd bump this thread a bit. Take a look at the comment above :-) Also, I just posted some results for Threadripper Pro 3995WX. Really great CPU and I got the best NAMD results I've every had with it and a couple of GPUs (I used A6000's but 3090's would be great)

Posted on 2021-03-12 17:20:49
Julio Carvalho

Whoa, just saw this! What a pleasant surprise :)

Hey Donald, I'm the main NAMD 3.0 developer, and to get good performance you need to slightly alter your configuration files used as inputs for NAMD (mainly including the CUDASOAIntegrate on flag.)

We do have a webpage on this on more details. We also did a NVIDIA blogpost a few months ago.

If you ever redo benchmarks on NAMD 3.0, please include these changes in your configuration files. It will most certainly improve your performance numbers on modern GPU architectures.

We also have support for tightly-coupled GPUs (with NVLink) on our most recent alpha version. Don't know if this is of any interest, but we can scale across multiple GPUs on a single node now.

Send us an email to namd-l@ks.uiuc.edu if you face any problems. I'm happy to help you get the best numbers possible.

Posted on 2021-03-12 07:43:08
Donald Kinghorn

Julio! Thank you so much for reaching out! I am, of course, anxious to properly try out version 3. I really appreciate your advise and will dig into the resources you mention. I have planned a post that is more comprehensive for NAMD and should be able to get started on that sometime in the next few weeks. We want to have a recommended system for NAMD that is the best configuration we can do for a single node workstation. I'll be sure to reach out with questions :-)

Best wishes --Don

Posted on 2021-03-12 17:17:45
Игорь Поляков

Donald Kinghorn hey there!

i would strongly suggest you test with the new NAMD 3.0 alpha gpu-resident version. It makes HUGE difference for all the new cards and doesnt depend on the CPU power. The old NAMD scheme with the cpu-bound integration calculations was slow (because of the CPU or the pci-e speed) for the GPUs past pascal chips.

The link explaining NAMD 3.0 is here https://www.ks.uiuc.edu/Res...

Posted on 2020-10-28 20:32:06
Donald Kinghorn

I did test with 3.0 alpha but the results I got were essentially the same as 2.13 so I went with that since I had older data as a reference.
However, that was when the driver was not really working right! I will definitely retest. All of the early issues I had with the RTX30's are starting to resolve. I'm probably going to wait a month or so before I do more comprehensive testing. I want to do a proper MD platform comparison. (might add Gromacs too)

Thanks for the encouragement. You made me more curious about v3 I had expected better results on that first testing so I'm curious to try again

Posted on 2020-10-30 23:00:01

What about the new RX 6800 XT ?

Posted on 2020-11-18 16:42:14
Donald Kinghorn

Who knows?? ( I think a few YouTube reviewers may have some??)

Myself, I'm looking forward to getting access to a MI100!

Posted on 2020-11-18 18:51:09
Angga Febrian Sahid

Thank you for the article.
Could you please show me how to use RTX3080 alongside Tensorflow 1.15
since lot of people having some trouble to do it.

Posted on 2020-11-23 03:04:21
Donald Kinghorn

I had some trouble with it to when I did this post. The testing I did with the RTX3090 had a slightly newer driver and NGC TF container and it was better. The most recent testing I did when I looked at Quad RTX3090 went well. There was another update on the containers and driver.

I haven't tried starting up TF using conda with the new GPU's. The Anaconda build of TF 1.5 may not be linked with the new cuda libraries yet. ... I just looked on Anaconda cloud TF1.15 there was built over a year ago!

The best thing to try will be NVIDIA's build. It's available on GitHub https://github.com/NVIDIA/t...

I really like to use their docker containers but you could try using a "pip" install. For that kind of install you will need to have the latest CUDA and cuDNN setup on your system.

I hope this helps you! I can see that it would be good to do a new post on about this. Best wishes --Don

Posted on 2020-11-24 01:37:28
Angga Febrian Sahid

Thank you so much for the response,
Looking forward for your new article about installing RTX 30 series for tensorflow 1.15

Best Wishes --Angga

Posted on 2020-11-30 03:15:14
Donald Kinghorn

Should have the post up on Monday :-) "How To Install TensorFlow 1.15 for NVIDIA RTX30 GPUs (without docker or CUDA install)"

Posted on 2020-12-05 01:30:16
Ajay Singh Panwar

This test is very close to what I was looking for, thank you! I am planning to build a small(ish) HPC with 4 compute nodes with single RTX 3080 cards on each node, where we will be running mostly molecular dynamics codes (LAMMPS, NAMD). We have a choice between 1x AMD EPYC 7402P and AMD TR 3970x, based on our current budget. Which configuration would you recommend? Your blog posts are very informative and exhaustive, many thanks!

Posted on 2020-12-23 13:35:09
Donald Kinghorn

You are welcome :-)
This is a case where more cores will be better. The RTX3080 is really fast and will need many CPU cores to balance performance with the MD apps. I would go with the 3970x. I really like that CPU. The EPYC would have better memory throughput but I really don't think that will be a bottleneck. The extra cores on the 3970x should give better performance. ... I haven't tested with LAMMPS recently so I'm mostly thinking about NAMD but I expect that GPU-CPU balance will be similar for that, i.e. mostly non-bonded forces going to the GPU

Posted on 2021-01-04 22:15:00