Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1551
Dr Donald Kinghorn (Scientific Computing Advisor )

2 x RTX2070 Super with NVLINK TensorFlow Performance Comparison

Written on August 14, 2019 by Dr Donald Kinghorn


This is a short post showing a performance comparison with the RTX2070 Super and several GPU configurations from recent testing. The comparison is with TensorFlow running a ResNet-50 and Big-LSTM benchmark.

I was at Puget Systems Labs when our inventory manager came in with an RTX2070 Super and said "this thing has an NVLINK connector, you might be interested in that" ... Nice! I was indeed interested. I immediately thought, two of these could be a great relatively inexpensive setup for ML/AI work. That thought was echoed back to me from someone commenting on one of my blog posts who was thinking the same thing. I replied, I was eager to try it. I finally got some time in with two of these nice GPU's and an NVLINK bridge.

Here's the testbed,

RTX 2070 Super with NVLINK

System Configuration


  • Intel Core i9 9960X (16-core)
  • Gigabyte X299 Designare EX
  • 8x DDR4-2666 16GB (128GB total)
  • Intel 2TB 660p NVMe M.2
  • 2 x NVIDIA RTX 2070 Super + NVLINK Bridge


NOTE: I have updated my Docker-ce setup to the latest version and am now using the native "nvidia-container-runtime". I will be writing a post with detailed instructions for this new configuration soon. My older posts have used NVIDIA-Docker2 which is now deprecated.

I used the NGC TensorFlow container tagged 19.02-py3 for consistency with other results in the charts below.


The older results in these charts are from the more extensive testing in the post TensorFlow Performance with 1-4 GPUs -- RTX Titan, 2080Ti, 2080, 2070, GTX 1660Ti, 1070, 1080Ti, and Titan V. Please see that link for detailed information about the job run command-line arguments etc..

[ResNet-50 fp32] TensorFlow, Training performance (Images/second) comparison using 2 NVIDIA RTX 2070-Super GPU's

2070 Super ResNet-50

These results show the RTX2070-Super performing as well as the 2080's. The multi-GPU methodology is using "Horovod" i.e. MPI for data-parallel scaling so there is little effect from using the NVLINK bridge. An interesting thing to note is that for this job the performance with 2 x 2070-Super's is slightly better than a single RTX Titan.

[ResNet-50 fp16 Tensor-cores] TensorFlow, Training performance (Images/second) comparison using 2 NVIDIA RTX 2070-Super GPU's

2070 Super ResNet-50 fp16

The 2070-Super did very well at fp16.

TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

2070 Super Big LSTM

With this recurrent neural network there is more GPU-GPU communication so NVLINK has more impact. The 2 x 2070-Super + NVLINK configuration did a little better than a single RTX Titan.


It looks like the NVIDIA RTX2070 Super is indeed a great GPU! At $500 it seems like a bargain and is worth considering for a ML/AI setup. The performance was near that of the RTX 2080 and in a dual configuration it performed as well as the RTX Titan, at least for the limited testing in this post.

The RTX2070 Super tested were the NVIDIA editions and had the side fans which I really don't care for, (they blow hot air into the case instead of out the back and are easily obstructed). I expect that there will be other vendor releases of these cards that use the blower style fans which is a much better cooling solution for GPU's that will be under heavy load.

My favorite workstation GPU for ML/AI work right now is the RTX2080Ti. The 12GB of memory on the 2080Ti is a big plus over the 8GB on the RTX2070 Super. If you have input data with a large number of features you may hit the dreaded OOM (Out of Memory) with an 8GB GPU. However, having 2 of the 2070-Super's for less cost than a single 2080Ti is very tempting and definitely worth considering if the 8GB on the 2070 Super will suffice for your work!

Happy computing! --dbk @dbkinghorn

Appendix: Peer to Peer Bandwidth and Latency results for 2 RTX 2070 Super GPU's

The RTX 2070Super has a single sub-link (like the RTX 2080). The RTX 2080Ti and RTX Titan (as well as the RTX Quadro's) have dual sub-links. This means that the 2070-Super has 1/2 the NVLINK bandwidth as the 2080Ti and RTX Titan.

kinghorn@utest:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Graphics Device, pciBusID: a1, pciDeviceID: 0, pciDomainID:0
Device: 1, Graphics Device, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     DD     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   DD     0      1 
     0 386.90   5.82 
     1   5.83 387.81 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   DD     0      1 
     0 387.93  24.23 
     1  24.25 371.77 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   DD     0      1 
     0 388.80  11.63 
     1  11.69 387.52 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   DD     0      1 
     0 388.58  48.38 
     1  48.38 377.05 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.49  11.40 
     1  11.27   1.56 

   CPU     0      1 
     0   2.96   7.67 
     1   7.17   2.87 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.49   0.89 
     1   0.90   1.56 

   CPU     0      1 
     0   2.95   2.00 
     1   2.05   2.88 

Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of poweful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?

Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: GPU, Machine Learning, NVIDIA, TensorFlow, RTX2070 Super, NVLINK
Misha Engel

Looking at the price/performance/usability the $699 11 GByte GTX1080ti is still the card to beat.

Posted on 2019-08-15 11:55:03
Scoodood C

I saw some deal on 1080ti for $500 each.
Just wondering should we pick 11GB of memory on 1080ti over the FP16 Tensor Cores on RTX-20 series GPU? What is your thought?

Posted on 2019-08-21 19:29:50
Donald Kinghorn

Sorry I didn't see these comments earlier The 1080Ti is a great card! It was/is a serious and affordable "work-hourse" for ML/AI. NVIDIA stopped production of the 10xx cards (and my beloved Titan V ...expensive but wonderful) If you can get them for a bargain it's worth considering

I wasn't a big fan of TensorCores, (fp16) at first but usage has gotten better and "automatic mixed precision" looks very good. Even Intel is getting in on this with BFloat https://software.intel.com/... ... BFloat looks like a better idea ...

Another thing to consider with fp16 is that it is "kind of like" having twice as much memory too. that can be really important...

So, these days I would lean toward the GPU with TensorCores. ... All things considered the RTX 2070 Super looks like good deal to me, I may buy one for myself. ( I have a 1080Ti, 1070 and 2 borrowed Titan V's in use)

Posted on 2019-09-06 16:11:56
Ernst Stavro Blofeld

Fellas, the Pascal cards are perfectly capable of doing FP16. You get a much more modest speedup WRT Turing (some 10% best case scenario), but you still get to double (or almost) the available memory.

Posted on 2020-02-27 15:15:03
Donald Kinghorn

I never even thought about fp16 on the Pascal cards ... didn't consider it on "TensorCores" at first either :-) Mixed precision seems to be getting implemented OK these days ... and of course you are right about the memory, that can be important.

Posted on 2020-02-28 01:16:49

A bit late to the party and just learning a bit of AI - Spark/TensorFlow as a side hobby with a view monetizing the knowledge at some point : -
Pascal bought with it native support for FP16 compute for both storage and compute. On the storage side, Pascal supports FP16 datatypes, with relative to the previous use of FP32 means that FP16 values take up less space at every level of the memory hierarchy (registers, cache, and DRAM). On the compute side, Pascal introduces a new type of FP32 CUDA core that supports a form of FP16 execution where two FP16 operations are run through the CUDA core at once (vec2). This core, which for clarity I’m going to call an FP16x2 core, allows the GPU to process 1 FP32 or 2 FP16 operations per clock cycle, essentially doubling FP16 performance relative to an identically configured Maxwell or Kepler GPU. From the AnandTech article https://www.anandtech.com/s...

Posted on 2020-09-09 17:15:16
Donald Kinghorn

see my reply below ...

Posted on 2019-09-06 16:12:05
Soldier OfHell

Hi Donald,
You are definitely right writing "MPI for data-parallel scaling so there is little effect from using the NVLINK bridge", but can we do anything towards true model-parallelism with NVLink? I mean to have only one copy of the model (in other words to use 2 GPUs as one device with "merged" resources)? Data parallelism is simple to use, but very inefficient, especially for large models where we would like to use the memory for bigger batch size and not for storing another copy of the model.


Posted on 2019-09-20 13:06:22
Donald Kinghorn

Really good question!

I think model parallelism could be tricky. By its nature, there is lots of dependency in a neural network, and, there are lots of different model types. However things like convolutions, pooling and such could be spread out over multiple devices. There are also algorithms for distributed matrix multiplication ... but it could be difficult to generalize into a framework that was simple to work with. And, even though NVLINK has amazing performance, communication is almost always a bottle neck. I'm sure there has been work done along these line but I just haven't looked into it. I honestly don't know what the "state of the art" is. I have not written enough DNN code from scratch to have keen insight (yet). ... though, in general, data parallelism is relatively easy, algorithm parallelism is hard. (the tensor based frameworks are really amazing though!)

There is room for improvement in data parallelism too! Spreading batches out over GPU's is one thing but dealing with input data in high dimensional feature spaces is harder. Medical researcher are running into problems with high res images and 3D tomography data etc.. That data often just doesn't fit in GPU memory even with batch size of 1. Then you have input feature partitioning and stitching all of that together ...

But, in any case, if I was doing a multi-GPU system, I would "want" NVLINK, especially on the RTX cards (The GeForce cards do not have P2P enabled over PCIe) ..., the hardware nerd in me drools over the IBM Power + NVIDIA configuration with (many lane) NVLINK from GPU-to-GPU AND GPU-to-CPU ... I expect PCIe 4 or 5 will come to our rescue in the x86 world before too long.

... and then we have the specialized "AI chips" coming out!! NVIDIA has seen the light too! (so has Intel) I expect to see multi-chiplet GPU's with larger shared memory space that could make things easier for everyone. NVLINK is a stop-gap ...

This is a VERY interesting domain to be working in for both Software and Hardware folks. I hope you will be able to make great contributions!

Posted on 2019-09-20 16:26:40
Eric C. Bohn

The 2060 Super seems to be an even better value here - basically 2070 level of performance and 20% cheaper than the 2070 Super, but no NVLINK.

Posted on 2020-01-09 15:48:00

If you don't have any interest in NVLink, the 2060 SUPER is indeed a fantastic card. I was pleasantly surprised by the increased VRAM it was given, putting it on par with all the 2070 and 2080 cards (except for the Ti).

Posted on 2020-01-09 17:46:04
Donald Kinghorn

Agreed, the "supers" are a nice performance increase. The 2060 super should be near the same as the 2070 (and it has 8GB mem!) There are rumors that there will be price reductions soon too.?? I don't have any real info about that though, but I would expect it, since there "may" be new card announcements at GTC in March.?? In any case they have to keep competitive with AMD in the gaming market ... the gaming market has been wonderful for folks doing GPU compute on a budget :-)

Posted on 2020-01-09 17:59:42
Vedran Klemen

Hello, I have one 2070S and planning to buy one more. Can i connect them with dual slot Quadro nvlink? Thanks.

Posted on 2020-02-27 18:59:50
Boyuan Ning

Hi Donald,

Great stuff as always.

I have a question of how to adjust the performance of Nvidia GPUs and hope you can help. When I am using the Titan-V to do some computations, through nvidia-smi command, the performance level of my GPUs is always limited to P2 state even though the "Volatile Utility" column is almost 100% for the GPUs. I wonder how to push the performance level to the P0 state? Thanks.

Again, its so great to find a tech blog here. Salute to you!

Posted on 2020-04-28 11:24:34
Donald Kinghorn

thank you! Interesting! I usually don't pay much attention to the p state, just the GPU Utilization percentage ... My guess is that the reported p state is not accurate because it is so dynamic i.e. the timers may not have enough resolution or the card may still be at a lower power state even though the computational cores are at 100% load (there are other parts of the card that would not be in use and drawing power)

I think it is possible to force a p state like p0 which essentially shuts off dynamic power management ??? Personally when I am seeing 100% on Utilization I'm pretty happy. A lot of problems have so much I/O that you never get a full load ... that is, the GPU is finishing calculations so fast that it is idle waiting on I/O and showing less than 100% load ...

I'm sure there are overclockers that have tricks for forcing the p state and you might even be able to do it directly with nvidia-smi (?) but for myself I wouldn't mess with it as long as I was happy with performance. I don't know that forcing p0 would have any positive effect on "real" performance of a job or not??? [also, I wouldn't want to risk burning out a reg or mem chip on that beautiful TitanV either :-) ] ... Good question!

Posted on 2020-04-28 15:13:50
Boyuan Ning

Thanks a lot for the reply Donald! I am just curious about how the actual performance of the P0 state would be like. Anyway, we all agree that it is a quite "beautiful" card of Titan-V, especially considering its capability of double-precision numerical computations and price compared to the V100.

By the way, hope you don't mind, I have another question towards the ECC memory issue of GPU. I heard some rumors that the non-ECC supported memory of the GeForce or Titan cards would affect the results of numerical computations. Have you ever come across that problem? Thanks!

Posted on 2020-04-29 03:07:03
Donald Kinghorn

Two things:
1) The most common thing that I know of regarding the ECC memory on the Tesla's is that nearly everyone turns that off with nvidia-smi because it significantly slows performance. Large cluster admins should probably be leaving it on though! I'll add another thing here about ECC memory; Modules with ECC are generally the higher quality parts! That's considered "server" memory by manufactures so the tolerances and quality control are generally better. [on the CPU side, Reg ECC memory is one of the most reliable components we have seen from our failure tracking over the years]

2) The most common failure I know of with GPU compute is memory related. In the past the memory chips would be the first to degrade and generate errors. (They are usually not cooled all the well) You would see that in corrupted results as often as with complete card failures. This was most common when using overclocked GeForce cards. The rule of thumb was if you had a suspect card then toss it and get a new one. Don't bother troubleshooting or trying to fix things ... they are cheap by comparison...

Another thing about repeatability in calculations. It is typically worse when using GPU's. It's not necessarily the GPU's fault! Programs may load memory a bit more randomly which can lead to common associativity errors (computer math is not the same as "real" math) There is also an increased risk of precision loss from accumulation and round-off ... all of the normal numerical analysis problems come up magnified on GPU's. That's at least part of why you see complaints ... it's never the programmers fault, right? :-)

Those comments are mostly based on my *past* experience. NVIDIA GPU's including GeForce since Pascal 900's 1000, and now 2000 have been excellent! In the "old" days of using 400, 500 and 600 series cards I would tell people putting 4 cards in a system to expect the be replacing at least one in the first year. Things are much better now. Even overclocked cards seem to do OK as long as they are not pushed to hard.

Posted on 2020-04-29 14:48:51
Boyuan Ning

Really appreciate the reply. Thank you Donald!

Posted on 2020-04-29 18:02:26