Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1262
Dr Donald Kinghorn (Scientific Computing Advisor )

NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux

Written on October 16, 2018 by Dr Donald Kinghorn
Share:


NVLINK is one of the more interesting features of NVIDIA's new RTX GPU's. In this post I'll take a look at the performance of NVLINK between 2 RTX 2080 GPU's along with a comparison against single GPU I've recently done. The testing will be a simple look at the raw peer-to-peer data transfer performance and a couple of TensorFlow job runs with and without NVLINK.

For most people outside of the HPC world NVLINK is unfamiliar. The GeForce RTX Turing cards are the first to have this high performance connection on a "consumer" GPU. NVLINK is the high performance GPU-to-GPU interconnect fabric that was first used on motherboards for server gear using NVIDIA's SXM GPU modules. The Quadro P100, GV100 and upcoming Quadro RTX cards also have NVLINK. The cards use a bridge connector similar to an SLI bridge. In fact the NVLINK bridge on the RTX 20xx series is used to provide SLI capability. The NVLINK implementation on the RTX 2080 and 2080Ti is a full NVLINK-2 implementation but is limited to 1 "link" (a.k.a. "brick") on the RTX 2080. It looks like there is 2 "links" on the RTX 2080 Ti but I haven't confirmed that they are aggregated yet (still waiting on a second card). The server Tesla SXM modules have 6 NVLINK-2 "links".

Note: on IBM Power 8,9 architecture NVLINK is also a high performance interconnect from GPU-to-CPU. That is the hardware used on the Oak Ridge National Laboratory Summit Supercomputer -- the fastest computer in the world right now.

My colleague William George has done some testing with NVLINK on Windows 10 and at this point it doesn't appear to be fully functional on that platform. You might want to check out his post NVIDIA GeForce RTX 2080 & 2080 Ti Do NOT Support Full NVLink in Windows 10. I think you can expect that to change soon. It should be fully functional on Windows 10 after a round of updates.

NVLINK is fully functional with the RTX 2080 on Ubuntu 18.04 with driver version 410.

I will be doing testing similar to what I did in the post NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0. I'll include some results from that post for comparison.


Test system

Hardware

  • Puget Systems Peak Single
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 1TB Samsung NVMe M.2
  • GPU's
  • GTX 1080Ti
  • RTX 2080 (2)
  • RTX 2080Ti
  • Titan V

Software

Two TensorFlow builds were used since the latest version of the TensorFlow docker image on NGC does not support multi-GPU for the CNN ResNet-50 training test job I like to use. For the "Big LSTM billion word" model training I used the latest container with TensorFlow 1.10 linked with CUDA 10.0. Both of the test programs are from "nvidia-examples" in the container instances.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning


There is one link available:

  • Link 0, P2P is supported: true
  • Link 0, Access to system memory supported: true
  • Link 0, P2P atomics supported: true
  • Link 0, System memory atomics supported: true
  • Link 0, SLI is supported: true
  • Link 0, Link is supported: false

I believe that the "Link is supported: false" line is referring to the CPU-GPU connection on IBM Power arch. I'm not completely sure about that since I can not find any information about it.

This test provides some additional information and the performance of a CUDA memory copy from GPU to GPU. (I'm listing only additional information from the output in the form of questions and answers.)

Does NVIDIA GeForce RTX LNVINK support Peer-To-Peer memory access?

Checking GPU(s) for support of peer to peer memory access...

  • Peer access from GeForce RTX 2080 (GPU0) -> GeForce RTX 2080 (GPU1) : Yes
  • Peer access from GeForce RTX 2080 (GPU1) -> GeForce RTX 2080 (GPU0) : Yes
    Enabling peer access between GPU0 and GPU1...

Does NVIDIA GeForce RTX LNVINK support Unified Virtual Addressing (UVA)?

Checking GPU0 and GPU1 for UVA capabilities...

  • GeForce RTX 2080 (GPU0) supports UVA: Yes
  • GeForce RTX 2080 (GPU1) supports UVA: Yes
    Both GPUs can support UVA, enabling...

How Fast is NVIDIA GeForce RTX LNVINK for CUDA Memory Copy?

  • cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.53GB/s

The following output show detailed information on bandwidth and latency between the 2 RTX 2080 GPU's.

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]


P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 389.09   5.82
     1   5.82 389.35
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 386.63  24.23
     1  24.23 389.76
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 386.41  11.59
     1  11.57 391.01
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 382.58  48.37
     1  47.95 390.62
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.67  20.55
     1  11.36   1.64

   CPU     0      1
     0   4.01   8.29
     1   8.37   3.65
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.67   0.92
     1   0.92   1.64

   CPU     0      1
     0   3.70   2.79
     1   2.95   3.68

For 2 RTX 2080 GPU's with NVLINK we see,

  • Unidirectional Bandwidth: 24 GB/s

  • Bidirectional Bandwidth: 48 GB/s

  • Latency (Peer-To-Peer Disabled),

    • GPU-GPU: 11-20 micro seconds
  • Latency (Peer-To-Peer Enabled),

    • GPU-GPU: 1 micro seconds

Now on to something a bit more "real-world".

The convolution neural network (CNN) and LSTM problems I'll test will not expose much of the benefit of using NVLINK. This is because their multi-GPU algorithms achieve parallelism mostly by distributing data as independent batches of images or words across the two GPU's. There is little use of GPU-to-GPU communication. Algorithms with finer grained parallelism that need more direct data and instruction access across the GPU's would benefit more.

One of the interesting questions this part of the testing will address is "is it better to get 2 RTX 2080 vs one of a more expensive card?". Lets find out.

I am mostly testing with benchmarks that I used in the recent post "NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0". However, for the CNN I am using an older NGC docker image with TensorFlow 1.4 linked with CUDA 9.0 and NCCL. I'm using this in order to have multi-GPU support utilizing the NCCL communication library for the CNN code. The most recent version of that code does not support this. The LSTM "Billion Word" benchmark I'm running is using the newer version with TensorFlow 1.10 link with CUDA 10.0.
I'll give the command-line input and some of the output for reference.

TensorFlow CNN: ResNet-50

Docker container image tensorflow:18.03-py2 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Example job command-line and truncated startup output, (with NVLINK bridge)

NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2 --fp16

TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
  --fp16
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp16
Have NCCL:   True
Using NCCL:  True

...
2018-10-11 01:01:05.405568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-10-11 01:01:05.405598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2018-10-11 01:01:05.405604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2018-10-11 01:01:05.405609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
...

Note, that --fp16 means "use Tensor-cores".

I ran that job at FP32 and FP16 (Tensor-cores) both with and without NVLINK on the 2 RTX 2080 GPU's.

GPU FP32
Images/sec
FP16 (Tensorcores)
Images/sec
GTX 1080 Ti207 N/A
RTX 2080207 332
RTX 2080 Ti280 437
Titan V299 547
2 x RTX 2080 364552
RTX 2 x 2080+NVLINK373 566

ResNet-50 with RTX NVLINK


TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:18.09-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.09-py3

Example job command-line and truncated startup output, (no NVLINK bridge)

/NGC/tensorflow/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256

...
2018-10-10 22:43:54.139543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-10 22:43:54.139577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2018-10-10 22:43:54.139583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N N
2018-10-10 22:43:54.139603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N N
...
GPU FP32
Images/sec
GTX 1080 Ti6460
RTX 2080 (Note:1)5071
RTX 2080 Ti8945
Titan V (Note:2)7066
Titan V (Note:3)8373
2 x RTX 2080 8882
2 x RTX 2080+NVLINK 9711

LSTM with RTX NVLINK

  • Note:1 With only 8GB memory on the RTX 2080 I had to drop the batch size down to 256 to keep from getting "out of memory" errors. That typically has a big (downward) influence on performance.
  • Note:2 For whatever reason this result for the Titan V is worse than expected. This is TensorFlow 1.10 linked with CUDA 10 running NVIDIA's code for the LSTM model. The RTX 2080Ti performance was very good!
  • Note:3 I re-ran the "big-LSTM" job on the Titan V using TensorFlow 1.4 linked with CUDA 9.0 and got results consistent with what I have seen in the past. I have no explanation for the slowdown with the newer version of "big-LSTM".

This might not be an easy decision. I haven't done very much testing and I'm still waiting for cards to do multi-GPU testing with the RTX 2080 Ti. I suspect that a multi-GPU configuration with the RTX 2080 Ti's will become the new standard workstation setup for folks doing ML/AI work. I'm a little reluctant to recommend the RTX 2080 because of the 8GB memory limitation. It is however obviously a great card! If that was what your budget allowed for then you would still be getting solid performing card.

The NVLINK bridge is a nice option for use with two cards, but, just two cards. There will be use cases where the NVLINK bridge will have a significant impact but a lot of GPU code is optimized to minimize communication. This is especially true for ML workloads where parallelism is often achieved by doing data distribution across devices. Still, it is a very nice option to have! I will be looking for usages that highlight the advantages it offers for future "real-world" testing.

We have a lot of great GPU's to do computational work with right now. These RTX cards give great compute performance for the cost. And, then there is the Titan V for when you need to have that wonderful double precision (FP64) performance of the Volta architecture.

It shouldn't be too much longer before I get a chance to look at performance with multiple RTX 2080 Ti's. Really looking forward to that!

Happy computing! --dbk

Tags: NVLINK, RTX 2080, TensorFlow, CUDA, NVIDIA, ML/AI, Machine Learning
Chris Jensen

We're you able to run with batch size of 512 with nvlink enabled? Curious if this allows a virtual 16 gb (22!! with the Ti) for training big models.

Posted on 2018-10-18 01:32:59
Donald Kinghorn

In a way ... When using multiple cards it puts 256 on each GPU so the total was 512. I do understand what you are thinking but I didn't test that. It would be interesting to see if would auto-magically allow 1 GPU process to use memory across both cards. I could have tested that by using --num_gpus=1 and batch_size=512 ( or maybe 640 ...) The situation does come up where you have to access a single variable that wont fit in memory. That is usually a show stopper. We sell quite a few CPU systems with 512GB or 1TB of mem for situations like that. OK, now I'm really curious! I'll run down to the office and see if I can borrow the cards and the bridge again. I have a couple of problems I can look at, the example above, and I can try using a larger basis set for my PyTorch QM code

Posted on 2018-10-18 15:30:35
Donald Kinghorn

Nope no magic!

People keep talking about NVLINK memory pooling or something like that. I don't know what that is/means. Each GPU has it's own memory ... very fast low latency memory. NVLINK is a fast communication bus i.e. a better alternative to PCIe but the latency is much longer than direct GDDR6 memory access on a single card ... I'm not sure how a single process on 1 GPU could allocate memory on another GPU and do writes and reads to it? If that is possible I would expect the performance to be poor ?? I'm not sure what the Quadro folks are talking about when they say it allows memory pooling, does that mean a combined memory frame buffer? I think that would be very different than a computational memory space

Posted on 2018-10-19 00:53:41
Michael Lu

Thanks for running these tests. Here are results for ResNet-50 using two 2080 Tis with and without NVLink:

https://uploads.disquscdn.c...

Posted on 2018-10-18 12:50:31
Donald Kinghorn

nice! We're still waiting on our cards to show up. That is a much better relative speedup from NVLINK than what I saw with the 2080's That is going to end up being a great setup for machine learning!

Posted on 2018-10-18 15:28:11
lemans24

Don...excellent article once again!!!
Definitely should be using dual 2080ti as MINIMUM for realistic ML/AI development.
I am not doing any ML/AI dev YET, but no way can I wait days for training...I want results in less than a second!! LOL (I am an options trader)

The Titan RTX should really be a beast when it comes out whether it has 12GB or more
...but more and more, I hope they come out with a Titan V 2019 with 32GB for $3000!!!

I am currently running my monte carlo simulations in batches of 4096 per gpu with each batch doing 524,000+ simulations on a Titan Xp and 1080ti.
As good as they are, this is dev only and but I need at a minimum for a single production gpu server, 4 Titan V 32GB cards to be able to use my simulations output in realtime with realtime defined in my case as execution of ALL 4096 batches over all cards in under a second.

And guess what: that is just for ONE trade with ONE instrument!! love these gaming cards...

Keep up these excellent articles and it would be great to see some more example code in Python and especially direct CUDA code in c/c++

Posted on 2018-10-23 12:51:58