2 x RTX2070 Super with NVLINK TensorFlow Performance Comparison
Written on August 14, 2019 by Dr Donald KinghornIntroduction
This is a short post showing a performance comparison with the RTX2070 Super and several GPU configurations from recent testing. The comparison is with TensorFlow running a ResNet-50 and Big-LSTM benchmark.
I was at Puget Systems Labs when our inventory manager came in with an RTX2070 Super and said "this thing has an NVLINK connector, you might be interested in that" ... Nice! I was indeed interested. I immediately thought, two of these could be a great relatively inexpensive setup for ML/AI work. That thought was echoed back to me from someone commenting on one of my blog posts who was thinking the same thing. I replied, I was eager to try it. I finally got some time in with two of these nice GPU's and an NVLINK bridge.
Here's the testbed,
System Configuration
Hardware:
- Intel Core i9 9960X (16-core)
- Gigabyte X299 Designare EX
- 8x DDR4-2666 16GB (128GB total)
- Intel 2TB 660p NVMe M.2
- 2 x NVIDIA RTX 2070 Super + NVLINK Bridge
Software:
- Ubuntu 18.04
- NVIDIA display driver 430.40 (from Graphics-Drivers ppa)
- Docker 19.03.0-ce
- NVIDIA-Container-Toolkit 1.0
- NVIDIA NGC container registry
- Container image: nvcr.io/nvidia/tensorflow:19.02-py3 for "Big LSTM" and "CNN"
NOTE: I have updated my Docker-ce setup to the latest version and am now using the native "nvidia-container-runtime". I will be writing a post with detailed instructions for this new configuration soon. My older posts have used NVIDIA-Docker2 which is now deprecated.
I used the NGC TensorFlow container tagged 19.02-py3 for consistency with other results in the charts below.
Results
The older results in these charts are from the more extensive testing in the post TensorFlow Performance with 1-4 GPUs -- RTX Titan, 2080Ti, 2080, 2070, GTX 1660Ti, 1070, 1080Ti, and Titan V. Please see that link for detailed information about the job run command-line arguments etc..
[ResNet-50 fp32] TensorFlow, Training performance (Images/second) comparison using 2 NVIDIA RTX 2070-Super GPU's
These results show the RTX2070-Super performing as well as the 2080's. The multi-GPU methodology is using "Horovod" i.e. MPI for data-parallel scaling so there is little effect from using the NVLINK bridge. An interesting thing to note is that for this job the performance with 2 x 2070-Super's is slightly better than a single RTX Titan.
[ResNet-50 fp16 Tensor-cores] TensorFlow, Training performance (Images/second) comparison using 2 NVIDIA RTX 2070-Super GPU's
The 2070-Super did very well at fp16.
TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset
With this recurrent neural network there is more GPU-GPU communication so NVLINK has more impact. The 2 x 2070-Super + NVLINK configuration did a little better than a single RTX Titan.
Conclusions
It looks like the NVIDIA RTX2070 Super is indeed a great GPU! At $500 it seems like a bargain and is worth considering for a ML/AI setup. The performance was near that of the RTX 2080 and in a dual configuration it performed as well as the RTX Titan, at least for the limited testing in this post.
The RTX2070 Super tested were the NVIDIA editions and had the side fans which I really don't care for, (they blow hot air into the case instead of out the back and are easily obstructed). I expect that there will be other vendor releases of these cards that use the blower style fans which is a much better cooling solution for GPU's that will be under heavy load.
My favorite workstation GPU for ML/AI work right now is the RTX2080Ti. The 12GB of memory on the 2080Ti is a big plus over the 8GB on the RTX2070 Super. If you have input data with a large number of features you may hit the dreaded OOM (Out of Memory) with an 8GB GPU. However, having 2 of the 2070-Super's for less cost than a single 2080Ti is very tempting and definitely worth considering if the 8GB on the 2070 Super will suffice for your work!
Happy computing! --dbk @dbkinghorn
Appendix: Peer to Peer Bandwidth and Latency results for 2 RTX 2070 Super GPU's
The RTX 2070Super has a single sub-link (like the RTX 2080). The RTX 2080Ti and RTX Titan (as well as the RTX Quadro's) have dual sub-links. This means that the 2070-Super has 1/2 the NVLINK bandwidth as the 2080Ti and RTX Titan.
kinghorn@utest:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Graphics Device, pciBusID: a1, pciDeviceID: 0, pciDomainID:0
Device: 1, Graphics Device, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
DD 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
DD 0 1
0 386.90 5.82
1 5.83 387.81
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
DD 0 1
0 387.93 24.23
1 24.25 371.77
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
DD 0 1
0 388.80 11.63
1 11.69 387.52
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
DD 0 1
0 388.58 48.38
1 48.38 377.05
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.49 11.40
1 11.27 1.56
CPU 0 1
0 2.96 7.67
1 7.17 2.87
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.49 0.89
1 0.90 1.56
CPU 0 1
0 2.95 2.00
1 2.05 2.88
Related Content
Latest Content
Why Choose Puget Systems?
Built specifically for you
Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.
Fast Build Times
By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time.
We're Here, Give Us a Call!
We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!
Lifetime Support/Labor Warranty
Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.
Click here for even more reasons!
Looking at the price/performance/usability the $699 11 GByte GTX1080ti is still the card to beat.
I saw some deal on 1080ti for $500 each.
Just wondering should we pick 11GB of memory on 1080ti over the FP16 Tensor Cores on RTX-20 series GPU? What is your thought?
Sorry I didn't see these comments earlier The 1080Ti is a great card! It was/is a serious and affordable "work-hourse" for ML/AI. NVIDIA stopped production of the 10xx cards (and my beloved Titan V ...expensive but wonderful) If you can get them for a bargain it's worth considering
I wasn't a big fan of TensorCores, (fp16) at first but usage has gotten better and "automatic mixed precision" looks very good. Even Intel is getting in on this with BFloat https://software.intel.com/... ... BFloat looks like a better idea ...
Another thing to consider with fp16 is that it is "kind of like" having twice as much memory too. that can be really important...
So, these days I would lean toward the GPU with TensorCores. ... All things considered the RTX 2070 Super looks like good deal to me, I may buy one for myself. ( I have a 1080Ti, 1070 and 2 borrowed Titan V's in use)
Fellas, the Pascal cards are perfectly capable of doing FP16. You get a much more modest speedup WRT Turing (some 10% best case scenario), but you still get to double (or almost) the available memory.
I never even thought about fp16 on the Pascal cards ... didn't consider it on "TensorCores" at first either :-) Mixed precision seems to be getting implemented OK these days ... and of course you are right about the memory, that can be important.
A bit late to the party and just learning a bit of AI - Spark/TensorFlow as a side hobby with a view monetizing the knowledge at some point : -
Pascal bought with it native support for FP16 compute for both storage and compute. On the storage side, Pascal supports FP16 datatypes, with relative to the previous use of FP32 means that FP16 values take up less space at every level of the memory hierarchy (registers, cache, and DRAM). On the compute side, Pascal introduces a new type of FP32 CUDA core that supports a form of FP16 execution where two FP16 operations are run through the CUDA core at once (vec2). This core, which for clarity I’m going to call an FP16x2 core, allows the GPU to process 1 FP32 or 2 FP16 operations per clock cycle, essentially doubling FP16 performance relative to an identically configured Maxwell or Kepler GPU. From the AnandTech article https://www.anandtech.com/s...
see my reply below ...
Hi Donald,
You are definitely right writing "MPI for data-parallel scaling so there is little effect from using the NVLINK bridge", but can we do anything towards true model-parallelism with NVLink? I mean to have only one copy of the model (in other words to use 2 GPUs as one device with "merged" resources)? Data parallelism is simple to use, but very inefficient, especially for large models where we would like to use the memory for bigger batch size and not for storing another copy of the model.
Regards,
Really good question!
I think model parallelism could be tricky. By its nature, there is lots of dependency in a neural network, and, there are lots of different model types. However things like convolutions, pooling and such could be spread out over multiple devices. There are also algorithms for distributed matrix multiplication ... but it could be difficult to generalize into a framework that was simple to work with. And, even though NVLINK has amazing performance, communication is almost always a bottle neck. I'm sure there has been work done along these line but I just haven't looked into it. I honestly don't know what the "state of the art" is. I have not written enough DNN code from scratch to have keen insight (yet). ... though, in general, data parallelism is relatively easy, algorithm parallelism is hard. (the tensor based frameworks are really amazing though!)
There is room for improvement in data parallelism too! Spreading batches out over GPU's is one thing but dealing with input data in high dimensional feature spaces is harder. Medical researcher are running into problems with high res images and 3D tomography data etc.. That data often just doesn't fit in GPU memory even with batch size of 1. Then you have input feature partitioning and stitching all of that together ...
But, in any case, if I was doing a multi-GPU system, I would "want" NVLINK, especially on the RTX cards (The GeForce cards do not have P2P enabled over PCIe) ..., the hardware nerd in me drools over the IBM Power + NVIDIA configuration with (many lane) NVLINK from GPU-to-GPU AND GPU-to-CPU ... I expect PCIe 4 or 5 will come to our rescue in the x86 world before too long.
... and then we have the specialized "AI chips" coming out!! NVIDIA has seen the light too! (so has Intel) I expect to see multi-chiplet GPU's with larger shared memory space that could make things easier for everyone. NVLINK is a stop-gap ...
This is a VERY interesting domain to be working in for both Software and Hardware folks. I hope you will be able to make great contributions!
The 2060 Super seems to be an even better value here - basically 2070 level of performance and 20% cheaper than the 2070 Super, but no NVLINK.
If you don't have any interest in NVLink, the 2060 SUPER is indeed a fantastic card. I was pleasantly surprised by the increased VRAM it was given, putting it on par with all the 2070 and 2080 cards (except for the Ti).
Agreed, the "supers" are a nice performance increase. The 2060 super should be near the same as the 2070 (and it has 8GB mem!) There are rumors that there will be price reductions soon too.?? I don't have any real info about that though, but I would expect it, since there "may" be new card announcements at GTC in March.?? In any case they have to keep competitive with AMD in the gaming market ... the gaming market has been wonderful for folks doing GPU compute on a budget :-)
Hello, I have one 2070S and planning to buy one more. Can i connect them with dual slot Quadro nvlink? Thanks.
Hi Donald,
Great stuff as always.
I have a question of how to adjust the performance of Nvidia GPUs and hope you can help. When I am using the Titan-V to do some computations, through nvidia-smi command, the performance level of my GPUs is always limited to P2 state even though the "Volatile Utility" column is almost 100% for the GPUs. I wonder how to push the performance level to the P0 state? Thanks.
Again, its so great to find a tech blog here. Salute to you!
thank you! Interesting! I usually don't pay much attention to the p state, just the GPU Utilization percentage ... My guess is that the reported p state is not accurate because it is so dynamic i.e. the timers may not have enough resolution or the card may still be at a lower power state even though the computational cores are at 100% load (there are other parts of the card that would not be in use and drawing power)
I think it is possible to force a p state like p0 which essentially shuts off dynamic power management ??? Personally when I am seeing 100% on Utilization I'm pretty happy. A lot of problems have so much I/O that you never get a full load ... that is, the GPU is finishing calculations so fast that it is idle waiting on I/O and showing less than 100% load ...
I'm sure there are overclockers that have tricks for forcing the p state and you might even be able to do it directly with nvidia-smi (?) but for myself I wouldn't mess with it as long as I was happy with performance. I don't know that forcing p0 would have any positive effect on "real" performance of a job or not??? [also, I wouldn't want to risk burning out a reg or mem chip on that beautiful TitanV either :-) ] ... Good question!
Thanks a lot for the reply Donald! I am just curious about how the actual performance of the P0 state would be like. Anyway, we all agree that it is a quite "beautiful" card of Titan-V, especially considering its capability of double-precision numerical computations and price compared to the V100.
By the way, hope you don't mind, I have another question towards the ECC memory issue of GPU. I heard some rumors that the non-ECC supported memory of the GeForce or Titan cards would affect the results of numerical computations. Have you ever come across that problem? Thanks!
Two things:
1) The most common thing that I know of regarding the ECC memory on the Tesla's is that nearly everyone turns that off with nvidia-smi because it significantly slows performance. Large cluster admins should probably be leaving it on though! I'll add another thing here about ECC memory; Modules with ECC are generally the higher quality parts! That's considered "server" memory by manufactures so the tolerances and quality control are generally better. [on the CPU side, Reg ECC memory is one of the most reliable components we have seen from our failure tracking over the years]
2) The most common failure I know of with GPU compute is memory related. In the past the memory chips would be the first to degrade and generate errors. (They are usually not cooled all the well) You would see that in corrupted results as often as with complete card failures. This was most common when using overclocked GeForce cards. The rule of thumb was if you had a suspect card then toss it and get a new one. Don't bother troubleshooting or trying to fix things ... they are cheap by comparison...
Another thing about repeatability in calculations. It is typically worse when using GPU's. It's not necessarily the GPU's fault! Programs may load memory a bit more randomly which can lead to common associativity errors (computer math is not the same as "real" math) There is also an increased risk of precision loss from accumulation and round-off ... all of the normal numerical analysis problems come up magnified on GPU's. That's at least part of why you see complaints ... it's never the programmers fault, right? :-)
Those comments are mostly based on my *past* experience. NVIDIA GPU's including GeForce since Pascal 900's 1000, and now 2000 have been excellent! In the "old" days of using 400, 500 and 600 series cards I would tell people putting 4 cards in a system to expect the be replacing at least one in the first year. Things are much better now. Even overclocked cards seem to do OK as long as they are not pushed to hard.
Really appreciate the reply. Thank you Donald!