Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1267
Dr Donald Kinghorn (Scientific Computing Advisor )

RTX 2080Ti with NVLINK - TensorFlow Performance (Includes Comparison with GTX 1080Ti, RTX 2070, 2080, 2080Ti and Titan V)

Written on October 26, 2018 by Dr Donald Kinghorn
Share:



This post is a continuation of the NVIDIA RTX GPU testing I've done with TensorFlow in; NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux and NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0. The same job runs as done in these previous two posts will be extended with dual RTX 2080Ti's. I was also able to add performance numbers for a single RTX 2070.

If you have read the earlier posts then you may want to just scroll down and check out the new result tables and plots.


Test system

Hardware

  • Puget Systems Peak Single
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 1TB Samsung NVMe M.2
  • GPU's
  • GTX 1080Ti
  • RTX 2070
  • RTX 2080 (2)
  • RTX 2080Ti (2)
  • Titan V

Software

Two TensorFlow builds were used since the latest version of the TensorFlow docker image on NGC does not support multi-GPU for the CNN ResNet-50 training test job I like to use. For the "Big LSTM billion word" model training I use the latest container with TensorFlow 1.10 linked with CUDA 10.0. Both of the test programs are from "nvidia-examples" in the container instances.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning


There are two links available:

GPU 0: GeForce RTX 2080 Ti (UUID:

  • Link 0, P2P is supported: true
  • Link 0, Access to system memory supported: true
  • Link 0, P2P atomics supported: true
  • Link 0, System memory atomics supported: true
  • Link 0, SLI is supported: true
  • Link 0, Link is supported: false
  • Link 1, P2P is supported: true
  • Link 1, Access to system memory supported: true
  • Link 1, P2P atomics supported: true
  • Link 1, System memory atomics supported: true
  • Link 1, SLI is supported: true
  • Link 1, Link is supported: false

Those two links get aggregated over the NVLINK bridge!

In summary, NVLINK with two RTX 2080 Ti GPU's provides the following features and performance,

simpleP2P

  • Peer-to-Peer memory access: Yes
  • Unified Virtual Addressing (UVA): Yes

Yes!

  • cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 44.87GB/s

That is twice the unidirectional bandwidth of the RTX 2080.

p2pBandwidthLatencyTest

The terminal output below shows that two RTX 2080 Ti GPU's with NVLINK provides,

  • Unidirectional Bandwidth: 48 GB/s

  • Bidirectional Bandwidth: 96 GB/s

  • Latency (Peer-To-Peer Disabled),

    • GPU-GPU: 12 micro seconds
  • Latency (Peer-To-Peer Enabled),

    • GPU-GPU: 1.3 micro seconds

Bidirectional bandwidth over NVLINK with 2 2080 Ti GPU's is nearly 100 GB/sec!

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 528.83   5.78
     1   5.81 531.37
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 532.21  48.37
     1  48.38 532.37
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 535.76  11.31
     1  11.42 536.52
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 535.72  96.40
     1  96.40 534.63
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.93  12.10
     1  12.92   1.91

   CPU     0      1
     0   3.77   8.49
     1   8.52   3.75
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.93   1.34
     1   1.34   1.92

   CPU     0      1
     0   3.79   3.08
     1   3.07   3.76

First, don't expect miracles from that 100GB/sec bidirectional bandwidth, ...

The convolution neural network (CNN) and LSTM problems I'll test will not expose much of the benefit of using NVLINK. This is because their multi-GPU algorithms achieve parallelism mostly by distributing data as independent batches of images or words across the two GPU's. There is little use of GPU-to-GPU communication. Algorithms with finer grained parallelism that need more direct data and instruction access across the GPU's would benefit more.

The TensorFlow jobs the I have run with 2 GPU's and NVLINK are giving around 6-8% performance boost. That is right around the percentage cost increase of adding the NVLINK bridge. It looks like you get what you pay for, which is a good thing! I haven't tested anything yet where the (amazing) bandwidth will really help. You may have ideas where that would be a big help?? I have a lot more testing to do.

I am using benchmarks that I used in the recent post "NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux". The CNN code I am using is from an older NGC docker image with TensorFlow 1.4 linked with CUDA 9.0 and NCCL. I'm using this in order to have a multi-GPU support utilizing the NCCL communication library for the CNN code. The most recent version of that code does not support this. The LSTM "Billion Word" benchmark I'm running is using the newer version with TensorFlow 1.10 link with CUDA 10.0.

I'll give the command-line inputs for reference.

The tables and plots are getting bigger! I've been adding to the testing data over the last 3 posts. There is now comparison of GTX 1080 Ti, RTX 2070, 2080, 2080 Ti and Titan V.

TensorFlow CNN: ResNet-50

Docker container image tensorflow:18.03-py2 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Example command line for job start,

NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2 --fp16

Note, --fp16 means "use tensor-cores".

ResNet-50 - GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V - TensorFlow - Training performance (Images/second)

GPU FP32
Images/sec
FP16 (Tensor-cores)
Images/sec
RTX 2070 192 280
GTX 1080 Ti 207 N/A
RTX 2080 207 332
RTX 2080 Ti 280 437
Titan V 299 547
2 x RTX 2080 364 552
2 x RTX 2080+NVLINK 373 566
2 x RTX 2080 Ti 470 750
2 x RTX 2080 Ti+NVLINK 500 776

 

ResNet-50 with RTX GPU's

 


TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:18.09-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.09-py3

Example job command-line,

/NGC/tensorflow/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256

"Big LSTM" - GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V - TensorFlow - Training performance (words/second)

GPU FP32
Images/sec
RTX 2070 (Note:1) 4740
GTX 1080 Ti 6460
RTX 2080 (Note:1) 5071
RTX 2080 Ti 8945
Titan V (Note:2) 7066
Titan V (Note:3) 8373
2 x RTX 2080 8882
2 x RTX 2080+NVLINK 9711
2 x RTX 2080 Ti 15770
2 x RTX 2080 Ti+NVLINK 16977

 

 

  • Note:1 With only 8GB memory on the RTX 2070 and 2080 I had to drop the batch size down to 256 to keep from getting "out of memory" errors. That typically has a big (downward) influence on performance.
  • Note:2 For whatever reason this result for the Titan V is worse than expected. This is TensorFlow 1.10 linked with CUDA 10 running NVIDIA's code for the LSTM model. The RTX 2080Ti performance was very good!
  • Note:3 I re-ran the "big-LSTM" job on the Titan V using TensorFlow 1.4 linked with CUDA 9.0 and got results consistent with what I have seen in the past. I have no explanation for the slowdown with the newer version of "big-LSTM".

Should you get an RTX 2080Ti (or two, or more) for machine learning work?

I've said it before ... I think that is an obvious yes! For ML/AI work using fp32 or fp16 (tensor-cores) precision the new NVIDIA RTX 2080 Ti looks really good. The RTX 2080 Ti may seem expensive but I believe you are getting what you pay for. Two RTX 2080 Ti's with the NVLINK bridge will cost less than a single Titan V and can give double (or more) of the performance in some cases. The Titan V is still the best value when you need fp64 (double precision). I would not hesitate to recommend the 2080 Ti for machine learning work.

I did get my first testing with the RTX 2070 in this post but I'm not sure if it is a good value or not for ML/AI work. However, from the limited testing here it looks like it would be a better value than the RTX 2080 if you have a tight budget.

I'm sure I will do 4 GPU testing before too long and that should be very interesting.

Happy computing! --dbk

Tags: NVLINK, RTX 2080 Ti, TensorFlow, CUDA, NVIDIA, ML/AI, Machine Learning
Jerry

Thank you , your write ups are about the only place I've been able to get any useful information on ML using linux, NVLINK and the RTX 2080 cards.

Posted on 2018-10-30 23:39:09
salbito

Thank you for doing this research. This is absolutely fantastic and fits the exact situation I'm in right now. I wanted to ask which of the Nvidia bridges you had used for this setup. On the Nvidia nvlink product pages they list bridges for the gp100 and gv100. The gv100 is listed at 200 GB/s (100 GB/s for a single bridge) and gp100 at 160 GB/s (80 GB/s for a single bridge). Should I be our purchasing the gv100 bridge? Thanks in advance

Posted on 2018-10-31 14:14:29

Chiming in for Don, you need to use the GeForce RTX NVLink bridges that you can get from https://www.nvidia.com/en-u... . It is buried in that page, but search for "3 SLOT" and it will take right right to the purchase button.

You cannot use GV100 bridges - those are for the first generation NVLink and do not work on the RTX cards.

Posted on 2018-10-31 16:15:22

GP100 bridges won't work, but GV100 bridges actually might be okay on the GeForce RTX cards. We don't have any to test, but I've seen others online show them working. However, they are VERY expensive. Get the bridge Matt suggested if you need a 3- or 4-slot size, or wait for the upcoming Quadro RTX bridge if you need a 2-slot solution.

Posted on 2018-10-31 16:18:27
salbito

Oh interesting. The reason I had been looking into the gv was because I was looking for a 2-slot solution. I had been looking around and found no reference to Nvidia making a 2-slot version. Was this upcoming Quadro RTX bridge listed somewhere?

Posted on 2018-10-31 17:06:21

They aren't listed for sale yet, but there's a "notify me" button on this page if you scroll down a bit: https://www.nvidia.com/en-u...

Posted on 2018-10-31 17:13:33
oriad

Hi Donald,

This article is exactly what I have been looking for. Thanks a lot.

Posted on 2018-11-02 07:45:46
Mr Ocr

Donald Kinghorn, can you test Mixed-Precision training.

Posted on 2018-11-05 21:26:57
Donald Kinghorn

using fp16 (tensor-cores) is basically using mixed precision. It's 16 bit multiplies and adds with 32 bit accumulators. This is from a README for the CNN testing code.

"
With the --fp16 flag the model is trained using 16-bit floating-point operations. This provides optimized performance on Volta's TensorCores. For more information on training with FP16 arithmetic see Training with Mixed Precision.
"
The link is to here https://docs.nvidia.com/dee...

Posted on 2018-11-05 23:23:46
lemans24

Don, could you also add test result for Titan Xp??

I have been trying to find out why my Titan Xp seems to be about 20% faster than my 1080ti when running the same CUDA code.
Did not find a good explanation but I installed MSI Afterburner and i see that the Titan Xp is running the boost clock to 1835mhz when the temp is under 75c!!!
Since I have a dual card setup, the 1080ti CUDA code runs in parallel with the Titan Xp code and it only boosts the clock to 1480mhz with the same temps.
Both of these cards are blower versions and I am running them nearly 8 hours a day. I will be testing them today to run nearly 24 hours and I will report back the results by the end of the week.

If this is the explanation then I will definitely be waiting for a Titan RTX or Titan V 32GB version as I want to run 4 cards in a well ventilated server case.
I will be eventually experimenting with tensorflow but I just got sidetracked with Infer.Net which allows you to do some probabilistic programming...you may want to look into this as it is a open source tool from Microsoft

Posted on 2018-11-05 23:49:26
Donald Kinghorn

I wont be testing GPU's again for a bit but you did give me a good idea for a post. If I spend a day at the office I could probably setup a machine and run the CNN and LSTM jobs I've been using on a big collection of cards.

I will probably be testing CPU's for for a few weeks and I'm working on a new ML/AI project :-)

I do have an older post that has an interesting comparison https://www.pugetsystems.co...
That is using different testing code but still might be interesting.

You might like this too (I do!) ... we have a search bar at the top of the blog and article pages now. I'm using it quite a bit because I have so many post I can't find things. It's at the top of the main HPC blog page https://www.pugetsystems.co... I searched there for titan Xp to find that older post!

Posted on 2018-11-06 00:32:14
lemans24

Ok...I will start using that search bar...but thanks for the reply.

Titan Xp is definitely faster than the 1080ti and all the benchmarks for the RTX 2080ti do not include the Titan Xp and I think this card from a price/performance point of view when running CUDA is probably a better deal to use than Titan V or 1080ti and maybe the 2080ti. The older post confirms my thinking that staying with a Titan Xp until I can get a another Titan card that is 50% faster seems to be the way to go for me especially running a quad server under Windows Server 16 which would also allow me to run the gpu drivers in TCC mode.

Keep up the great articles and I hope you continue running examples with source code too...

Posted on 2018-11-06 04:06:10
Donald Kinghorn

... I love the search bar, I could never find stuff in my old posts :-)
You are right about the Titan Xp that is a really good card and it does definitely outperform the 1080Ti, it always seemed to get a little better memory utilization too ... I just happen to have a 1080Ti in my personal system so results from that end up in a lot of my posts :-) We have one Titan Xp in Puget-Labs so it is often in use for various testing.

I'm wondering when we will see the next Titan. A possibility for an announcement is at NIPS in December. I hope it's not pushed out until GTC in March. The Quadro's are out and shipping now so I'm hopeful for December

Posted on 2018-11-06 23:37:26
lemans24

Thanks Don

I will continue to overclock the Titan Xp to 75c and see how it performs.
I have really optimized my monte carlo CUDA app and it is running near real time now. It takes about 4 seconds to run 4 batches with each batch containing 2048 simulation runs. The real fun part is that EACH simulation run does 1 million monte carlo simulations...yep 16 Billion simulations within 4 seconds. I don't exactly know how many floating point instructions I am executing per simulation but I must be getting close to a Teraflop of instructions within 4 seconds and this is running in multi gpu mode with the Titan Xp and the 1080ti!!

When I first ran these monte carlo simulations on my 16 core threadripper, it used to take nearly an hour, which I though was crazy slow!!
I don't know how you can run these AI training jobs in days!! Lol

Anyway, yes I really hope Nvidia soon releases a Titan RTX and/or an updated Titan V with more than 12GB memory running at full speed.
Amd has their new Radeon Instrinct AI gpu which looks to be similar speeds as the V100 but I really hope they undercut the price and Nvidia will starts to drop their prices...one can only hope!!!

Posted on 2018-11-08 00:08:59
Lawrence Barras

Excellent writeup! I finally got around to installing a pair of 2080TI FE cards. They appear to run at full GPU boost with the latest nvidia Linux drivers. My results mirror yours, but I've been testing on a bit older PC and I think it is PCIe starved a bit. (2 cards run in x8 each on this one, not enough lanes on i7-6700k). The RTX2080ti-FE are honestly outrunning my TitanV on many tests with the same NGC tensorflow image. (18.10). For whatever reason, the Titan-V is still limited to 1335mhz gpu clock when running CUDA apps. The RTX 2080ti seems in all ways happy to run up to full boost until it thermal limits. I think your Xeon-W is able to do a better job keeping the cards fed with data.

If anyone is curious, the RTX NVlink is mechanically reversed from the GV-NVink on the Volta-series cards (Tesla V and Quadro GV). They will not interchange. The card edges appear the same, but the big and small tabs are reversed. I opened the RTX Link, no active electronics other than the LED circuits. The HB SLI bridge from the 1080ti is also completely different.

Posted on 2018-11-07 07:41:35

They seem to have just put the connector edge on "backwards" on the GeForce RTX cards, compared to the older Quadro GP100 and GV100. You can flip the bridges around and it will physically plug in just fine. In terms of functionality, we tested Quadro GP100 era bridges and they do *not* work on the GeForce RTX cards. We don't have any GV100 bridges here, but someone online showed them being used on GeForce RTX 2080 Ti cards and appearing to function. We also did test that the newer GeForce RTX bridge works on a pair of Quadro GP100 cards, though because of the larger physical size of the plastic around the bridge we could only fit one (instead of the pair that card is designed to have).

You can see more info, along with pictures, on another of our articles: https://www.pugetsystems.co...

Posted on 2018-11-07 19:33:54
lemans24

'...For whatever reason, the Titan-V is still limited to 1335mhz gpu clock when running CUDA apps'

Curious...are you running custom CUDA apps, c/cc++ or third party CUDA apps ??
Did you try something like MSI Afterburner or something similar if you are running under Linux??

I have been running my custom monte carlo simulation CUDA code on a Titan Xp and 1080ti using Windows 10 and the Titan Xp is way better for serious application deployment. Even though the 2080ti is less than half the price of the Titan V, I need to make sure my app runs rock solid during trading hours which for options are 23 hours a day and 6 days a week. I have been testing the 1080ti and it just does not seem to be as reliable as the Titan Xp. The Titan Xp on the other hand has been class A excellent and have not run into any problems running my code for extended periods of time.

The Titan V looks like a beast but I really hope that Nvidia comes out with a Titan RTX that will replace the Titan Xp with the same reliability as well.

Posted on 2018-11-10 02:24:22
Hery Ratsimihah

Please let us know when they break!

Posted on 2018-11-19 11:04:03
Nestorius

This is a great write-up and probably the first one to cover the current options (1080Ti vs 2070/2080/2080Ti) for single GPU users.

I find the 2070 to be a very interesting offer vs both the 1080Ti and the 2080 non-TI, for CNN with mixed-precision.

I ran a quick test on Cifar10 with Fastai V1 for the 2070 vs 1080Ti: if you manage to get Fp16 running, the 2070 will deliver slightly faster results indeed.
Then the 2080 non-Ti is rather "meh..." compared to the cheaper 2070.

Posted on 2018-11-16 20:06:58
Donald Kinghorn

I agree on that ... I think the sweet-spot is the 2070 and then 2080TI. If the 2080 had 12GB mem it would be a different story. Having said that, I would be pretty happy if I had a 2080 Too :-) I do still have my borrowed (and beloved) Titan V that I'll be getting under load with a project again soon. Even when the new Titan comes out I think the 2080Ti will be the staple for ML/AI stuff. There are just times when I have to have 64-bit, which I hope NVIDIA keeps strong in the next Titan.

Posted on 2018-11-19 17:26:18
JerryK

I already have a 1070 and was thinking about adding a 2070. Is mixing the two a good idea? That is, will the 1070 play well with 2070, slow it up, or ???

Posted on 2018-11-20 00:54:28
Donald Kinghorn

You should be fine with doing that. The newer drivers for the 2070 work well with the 10xx cards and so does the included CUDA runtime lib. I like both of those cards. The 1070 was a lot of performance for the cost and it looks like the 2070 is continuing that. You might have some trouble if you try to run multi-GPU jobs on both cards together ... not sure. They both support a wide variety of cuda compute levels but of course the 2070 supports some newer ops in level 7.x including support for fp16 tensor cores

Posted on 2018-11-20 02:53:07