RTX 2080Ti with NVLINK - TensorFlow Performance (Includes Comparison with GTX 1080Ti, RTX 2070, 2080, 2080Ti and Titan V)

Table of Contents

Test system
- Hardware
- Software
Functionality and Peer-to-Peer Data Transfer Performance for 2 RTX 2080 GPU's with NVLINK
- RTX 2080 Ti NVLINK "Capability" report from nvidia-smi nvlink -c
RTX 2080 Ti NVLINK Peer-To-Peer Performance:
- simpleP2P
  - Does NVLINK with two NVIDIA RTX 2080 Ti GPU use both Links for CUDA Memory Copy?
- p2pBandwidthLatencyTest
TensorFlow performance with 2 RTX 2080 Ti GPU's and NVLINK
TensorFlow CNN: ResNet-50
- ResNet-50 – GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V – TensorFlow – Training performance (Images/second)
TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset
- "Big LSTM" – GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V – TensorFlow – Training performance (words/second)
Should you get an RTX 2080Ti (or two, or more) for machine learning work?

This post is a continuation of the NVIDIA RTX GPU testing I've done with TensorFlow in; NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux and NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0. The same job runs as done in these previous two posts will be extended with dual RTX 2080Ti's. I was also able to add performance numbers for a single RTX 2070.

If you have read the earlier posts then you may want to just scroll down and check out the new result tables and plots.

Test system

Hardware

Puget Systems Peak Single
Intel Xeon-W 2175 14-core
128GB Memory
1TB Samsung NVMe M.2
GPU's
GTX 1080Ti
RTX 2070
RTX 2080 (2)
RTX 2080Ti (2)
Titan V

Software

Ubuntu 18.04
NVIDIA display driver 410.66 (from CUDA install) NOTE: The 410.48 driver that I used in previous testing was causing system restarts during the big LSTM testing with 2 RTX 2080Ti's and NVLINK.
CUDA 10.0 Source builds of,
- simpleP2P
- p2pBandwidthLatencyTest
TensorFlow 1.10 and 1.4
Docker 18.06.1-ce
NVIDIA-Docker 2.0.3
NVIDIA NGC container registry
Container image: nvcr.io/nvidia/tensorflow:18.08-py3 for "Big LSTM"
Container image: nvcr.io/nvidia/tensorflow:18.03-py2 linked with NCCL and CUDA 9.0 for milti-GPU "CNN"

Two TensorFlow builds were used since the latest version of the TensorFlow docker image on NGC does not support multi-GPU for the CNN ResNet-50 training test job I like to use. For the "Big LSTM billion word" model training I use the latest container with TensorFlow 1.10 linked with CUDA 10.0. Both of the test programs are from "nvidia-examples" in the container instances.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 5 Docker Performance and Resource Tuning

Functionality and Peer-to-Peer Data Transfer Performance for 2 RTX 2080 GPU's with NVLINK

RTX 2080 Ti NVLINK "Capability" report from `nvidia-smi nvlink -c`

There are two links available:

GPU 0: GeForce RTX 2080 Ti (UUID:

Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: false

Those two links get aggregated over the NVLINK bridge!

RTX 2080 Ti NVLINK Peer-To-Peer Performance:

In summary, NVLINK with two RTX 2080 Ti GPU's provides the following features and performance,

`simpleP2P`

Peer-to-Peer memory access: Yes
Unified Virtual Addressing (UVA): Yes

Does NVLINK with two NVIDIA RTX 2080 Ti GPU use both Links for CUDA Memory Copy?

Yes!

cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 44.87GB/s

That is twice the unidirectional bandwidth of the RTX 2080.

`p2pBandwidthLatencyTest`

The terminal output below shows that two RTX 2080 Ti GPU's with NVLINK provides,

Unidirectional Bandwidth: 48 GB/s
Bidirectional Bandwidth: 96 GB/s
Latency (Peer-To-Peer Disabled),
- GPU-GPU: 12 micro seconds
Latency (Peer-To-Peer Enabled),
- GPU-GPU: 1.3 micro seconds

Bidirectional bandwidth over NVLINK with 2 2080 Ti GPU's is nearly 100 GB/sec!

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 528.83   5.78
     1   5.81 531.37
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 532.21  48.37
     1  48.38 532.37
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 535.76  11.31
     1  11.42 536.52
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 535.72  96.40
     1  96.40 534.63
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.93  12.10
     1  12.92   1.91

   CPU     0      1
     0   3.77   8.49
     1   8.52   3.75
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.93   1.34
     1   1.34   1.92

   CPU     0      1
     0   3.79   3.08
     1   3.07   3.76

TensorFlow performance with 2 RTX 2080 Ti GPU's and NVLINK

First, don't expect miracles from that 100GB/sec bidirectional bandwidth, …

The convolution neural network (CNN) and LSTM problems I'll test will not expose much of the benefit of using NVLINK. This is because their multi-GPU algorithms achieve parallelism mostly by distributing data as independent batches of images or words across the two GPU's. There is little use of GPU-to-GPU communication. Algorithms with finer grained parallelism that need more direct data and instruction access across the GPU's would benefit more.

The TensorFlow jobs the I have run with 2 GPU's and NVLINK are giving around 6-8% performance boost. That is right around the percentage cost increase of adding the NVLINK bridge. It looks like you get what you pay for, which is a good thing! I haven't tested anything yet where the (amazing) bandwidth will really help. You may have ideas where that would be a big help?? I have a lot more testing to do.

I am using benchmarks that I used in the recent post "NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux". The CNN code I am using is from an older NGC docker image with TensorFlow 1.4 linked with CUDA 9.0 and NCCL. I'm using this in order to have a multi-GPU support utilizing the NCCL communication library for the CNN code. The most recent version of that code does not support this. The LSTM "Billion Word" benchmark I'm running is using the newer version with TensorFlow 1.10 link with CUDA 10.0.

I'll give the command-line inputs for reference.

The tables and plots are getting bigger! I've been adding to the testing data over the last 3 posts. There is now comparison of GTX 1080 Ti, RTX 2070, 2080, 2080 Ti and Titan V.

TensorFlow CNN: ResNet-50

Docker container image tensorflow:18.03-py2 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Example command line for job start,

NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2 --fp16

Note, –fp16 means "use tensor-cores".

ResNet-50 – GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V – TensorFlow – Training performance (Images/second)

GPU	FP32 Images/sec	FP16 (Tensor-cores) Images/sec
RTX 2070	192	280
GTX 1080 Ti	207	N/A
RTX 2080	207	332
RTX 2080 Ti	280	437
Titan V	299	547
2 x RTX 2080	364	552
2 x RTX 2080+NVLINK	373	566
2 x RTX 2080 Ti	470	750
2 x RTX 2080 Ti+NVLINK	500	776

ResNet-50 with RTX GPU's

TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:18.09-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.09-py3

Example job command-line,

/NGC/tensorflow/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256

"Big LSTM" – GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V – TensorFlow – Training performance (words/second)

GPU	FP32 Images/sec
RTX 2070 (Note:1)	4740
GTX 1080 Ti	6460
RTX 2080 (Note:1)	5071
RTX 2080 Ti	8945
Titan V (Note:2)	7066
Titan V (Note:3)	8373
2 x RTX 2080	8882
2 x RTX 2080+NVLINK	9711
2 x RTX 2080 Ti	15770
2 x RTX 2080 Ti+NVLINK	16977

Note:1 With only 8GB memory on the RTX 2070 and 2080 I had to drop the batch size down to 256 to keep from getting "out of memory" errors. That typically has a big (downward) influence on performance.
Note:2 For whatever reason this result for the Titan V is worse than expected. This is TensorFlow 1.10 linked with CUDA 10 running NVIDIA's code for the LSTM model. The RTX 2080Ti performance was very good!
Note:3 I re-ran the "big-LSTM" job on the Titan V using TensorFlow 1.4 linked with CUDA 9.0 and got results consistent with what I have seen in the past. I have no explanation for the slowdown with the newer version of "big-LSTM".

Should you get an RTX 2080Ti (or two, or more) for machine learning work?

I've said it before … I think that is an obvious yes! For ML/AI work using fp32 or fp16 (tensor-cores) precision the new NVIDIA RTX 2080 Ti looks really good. The RTX 2080 Ti may seem expensive but I believe you are getting what you pay for. Two RTX 2080 Ti's with the NVLINK bridge will cost less than a single Titan V and can give double (or more) of the performance in some cases. The Titan V is still the best value when you need fp64 (double precision). I would not hesitate to recommend the 2080 Ti for machine learning work.

I did get my first testing with the RTX 2070 in this post but I'm not sure if it is a good value or not for ML/AI work. However, from the limited testing here it looks like it would be a better value than the RTX 2080 if you have a tight budget.

I'm sure I will do 4 GPU testing before too long and that should be very interesting.

Happy computing! –dbk

Tags: CUDA, Machine Learning, ML/AI, NVIDIA, NVLINK, RTX 2080 Ti, TensorFlow