Puget Systems print logo


Read this article at https://www.pugetsystems.com/guides/1551
Dr Donald Kinghorn (Scientific Computing Advisor )

2 x RTX2070 Super with NVLINK TensorFlow Performance Comparison

Written on August 14, 2019 by Dr Donald Kinghorn


This is a short post showing a performance comparison with the RTX2070 Super and several GPU configurations from recent testing. The comparison is with TensorFlow running a ResNet-50 and Big-LSTM benchmark.

I was at Puget Systems Labs when our inventory manager came in with an RTX2070 Super and said "this thing has an NVLINK connector, you might be interested in that" ... Nice! I was indeed interested. I immediately thought, two of these could be a great relatively inexpensive setup for ML/AI work. That thought was echoed back to me from someone commenting on one of my blog posts who was thinking the same thing. I replied, I was eager to try it. I finally got some time in with two of these nice GPU's and an NVLINK bridge.

Here's the testbed,

RTX 2070 Super with NVLINK

System Configuration


  • Intel Core i9 9960X (16-core)
  • Gigabyte X299 Designare EX
  • 8x DDR4-2666 16GB (128GB total)
  • Intel 2TB 660p NVMe M.2
  • 2 x NVIDIA RTX 2070 Super + NVLINK Bridge


NOTE: I have updated my Docker-ce setup to the latest version and am now using the native "nvidia-container-runtime". I will be writing a post with detailed instructions for this new configuration soon. My older posts have used NVIDIA-Docker2 which is now deprecated.

I used the NGC TensorFlow container tagged 19.02-py3 for consistency with other results in the charts below.


The older results in these charts are from the more extensive testing in the post TensorFlow Performance with 1-4 GPUs -- RTX Titan, 2080Ti, 2080, 2070, GTX 1660Ti, 1070, 1080Ti, and Titan V. Please see that link for detailed information about the job run command-line arguments etc..

[ResNet-50 fp32] TensorFlow, Training performance (Images/second) comparison using 2 NVIDIA RTX 2070-Super GPU's

2070 Super ResNet-50

These results show the RTX2070-Super performing as well as the 2080's. The multi-GPU methodology is using "Horovod" i.e. MPI for data-parallel scaling so there is little effect from using the NVLINK bridge. An interesting thing to note is that for this job the performance with 2 x 2070-Super's is slightly better than a single RTX Titan.

[ResNet-50 fp16 Tensor-cores] TensorFlow, Training performance (Images/second) comparison using 2 NVIDIA RTX 2070-Super GPU's

2070 Super ResNet-50 fp16

The 2070-Super did very well at fp16.

TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

2070 Super Big LSTM

With this recurrent neural network there is more GPU-GPU communication so NVLINK has more impact. The 2 x 2070-Super + NVLINK configuration did a little better than a single RTX Titan.


It looks like the NVIDIA RTX2070 Super is indeed a great GPU! At $500 it seems like a bargain and is worth considering for a ML/AI setup. The performance was near that of the RTX 2080 and in a dual configuration it performed as well as the RTX Titan, at least for the limited testing in this post.

The RTX2070 Super tested were the NVIDIA editions and had the side fans which I really don't care for, (they blow hot air into the case instead of out the back and are easily obstructed). I expect that there will be other vendor releases of these cards that use the blower style fans which is a much better cooling solution for GPU's that will be under heavy load.

My favorite workstation GPU for ML/AI work right now is the RTX2080Ti. The 12GB of memory on the 2080Ti is a big plus over the 8GB on the RTX2070 Super. If you have input data with a large number of features you may hit the dreaded OOM (Out of Memory) with an 8GB GPU. However, having 2 of the 2070-Super's for less cost than a single 2080Ti is very tempting and definitely worth considering if the 8GB on the 2070 Super will suffice for your work!

Happy computing! --dbk @dbkinghorn

Appendix: Peer to Peer Bandwidth and Latency results for 2 RTX 2070 Super GPU's

The RTX 2070Super has a single sub-link (like the RTX 2080). The RTX 2080Ti and RTX Titan (as well as the RTX Quadro's) have dual sub-links. This means that the 2070-Super has 1/2 the NVLINK bandwidth as the 2080Ti and RTX Titan.

kinghorn@utest:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Graphics Device, pciBusID: a1, pciDeviceID: 0, pciDomainID:0
Device: 1, Graphics Device, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     DD     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   DD     0      1 
     0 386.90   5.82 
     1   5.83 387.81 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   DD     0      1 
     0 387.93  24.23 
     1  24.25 371.77 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   DD     0      1 
     0 388.80  11.63 
     1  11.69 387.52 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   DD     0      1 
     0 388.58  48.38 
     1  48.38 377.05 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.49  11.40 
     1  11.27   1.56 

   CPU     0      1 
     0   2.96   7.67 
     1   7.17   2.87 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.49   0.89 
     1   0.90   1.56 

   CPU     0      1 
     0   2.95   2.00 
     1   2.05   2.88 

Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of workstations that are tailor-made for your unique workflow. Our goal is to provide the most effective and reliable system possible so you can concentrate on your work and not worry about your computer.

Configure a System!

Why Choose Puget Systems?

Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: GPU, Machine Learning, NVIDIA, TensorFlow, RTX2070 Super, NVLINK
Misha Engel

Looking at the price/performance/usability the $699 11 GByte GTX1080ti is still the card to beat.

Posted on 2019-08-15 11:55:03
Scoodood C

I saw some deal on 1080ti for $500 each.
Just wondering should we pick 11GB of memory on 1080ti over the FP16 Tensor Cores on RTX-20 series GPU? What is your thought?

Posted on 2019-08-21 19:29:50
Donald Kinghorn

Sorry I didn't see these comments earlier The 1080Ti is a great card! It was/is a serious and affordable "work-hourse" for ML/AI. NVIDIA stopped production of the 10xx cards (and my beloved Titan V ...expensive but wonderful) If you can get them for a bargain it's worth considering

I wasn't a big fan of TensorCores, (fp16) at first but usage has gotten better and "automatic mixed precision" looks very good. Even Intel is getting in on this with BFloat https://software.intel.com/... ... BFloat looks like a better idea ...

Another thing to consider with fp16 is that it is "kind of like" having twice as much memory too. that can be really important...

So, these days I would lean toward the GPU with TensorCores. ... All things considered the RTX 2070 Super looks like good deal to me, I may buy one for myself. ( I have a 1080Ti, 1070 and 2 borrowed Titan V's in use)

Posted on 2019-09-06 16:11:56
Donald Kinghorn

see my reply below ...

Posted on 2019-09-06 16:12:05
Soldier OfHell

Hi Donald,
You are definitely right writing "MPI for data-parallel scaling so there is little effect from using the NVLINK bridge", but can we do anything towards true model-parallelism with NVLink? I mean to have only one copy of the model (in other words to use 2 GPUs as one device with "merged" resources)? Data parallelism is simple to use, but very inefficient, especially for large models where we would like to use the memory for bigger batch size and not for storing another copy of the model.


Posted on 2019-09-20 13:06:22
Donald Kinghorn

Really good question!

I think model parallelism could be tricky. By its nature, there is lots of dependency in a neural network, and, there are lots of different model types. However things like convolutions, pooling and such could be spread out over multiple devices. There are also algorithms for distributed matrix multiplication ... but it could be difficult to generalize into a framework that was simple to work with. And, even though NVLINK has amazing performance, communication is almost always a bottle neck. I'm sure there has been work done along these line but I just haven't looked into it. I honestly don't know what the "state of the art" is. I have not written enough DNN code from scratch to have keen insight (yet). ... though, in general, data parallelism is relatively easy, algorithm parallelism is hard. (the tensor based frameworks are really amazing though!)

There is room for improvement in data parallelism too! Spreading batches out over GPU's is one thing but dealing with input data in high dimensional feature spaces is harder. Medical researcher are running into problems with high res images and 3D tomography data etc.. That data often just doesn't fit in GPU memory even with batch size of 1. Then you have input feature partitioning and stitching all of that together ...

But, in any case, if I was doing a multi-GPU system, I would "want" NVLINK, especially on the RTX cards (The GeForce cards do not have P2P enabled over PCIe) ..., the hardware nerd in me drools over the IBM Power + NVIDIA configuration with (many lane) NVLINK from GPU-to-GPU AND GPU-to-CPU ... I expect PCIe 4 or 5 will come to our rescue in the x86 world before too long.

... and then we have the specialized "AI chips" coming out!! NVIDIA has seen the light too! (so has Intel) I expect to see multi-chiplet GPU's with larger shared memory space that could make things easier for everyone. NVLINK is a stop-gap ...

This is a VERY interesting domain to be working in for both Software and Hardware folks. I hope you will be able to make great contributions!

Posted on 2019-09-20 16:26:40