Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1262
Dr Donald Kinghorn (Scientific Computing Advisor )

NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux

Written on October 16, 2018 by Dr Donald Kinghorn
Share:


NVLINK is one of the more interesting features of NVIDIA's new RTX GPU's. In this post I'll take a look at the performance of NVLINK between 2 RTX 2080 GPU's along with a comparison against single GPU I've recently done. The testing will be a simple look at the raw peer-to-peer data transfer performance and a couple of TensorFlow job runs with and without NVLINK.

For most people outside of the HPC world NVLINK is unfamiliar. The GeForce RTX Turing cards are the first to have this high performance connection on a "consumer" GPU. NVLINK is the high performance GPU-to-GPU interconnect fabric that was first used on motherboards for server gear using NVIDIA's SXM GPU modules. The Quadro P100, GV100 and upcoming Quadro RTX cards also have NVLINK. The cards use a bridge connector similar to an SLI bridge. In fact the NVLINK bridge on the RTX 20xx series is used to provide SLI capability. The NVLINK implementation on the RTX 2080 and 2080Ti is a full NVLINK-2 implementation but is limited to 1 "link" (a.k.a. "brick") on the RTX 2080. It looks like there is 2 "links" on the RTX 2080 Ti but I haven't confirmed that they are aggregated yet (still waiting on a second card). The server Tesla SXM modules have 6 NVLINK-2 "links".

Note: on IBM Power 8,9 architecture NVLINK is also a high performance interconnect from GPU-to-CPU. That is the hardware used on the Oak Ridge National Laboratory Summit Supercomputer -- the fastest computer in the world right now.

My colleague William George has done some testing with NVLINK on Windows 10 and at this point it doesn't appear to be fully functional on that platform. You might want to check out his post NVIDIA GeForce RTX 2080 & 2080 Ti Do NOT Support Full NVLink in Windows 10. I think you can expect that to change soon. It should be fully functional on Windows 10 after a round of updates.

NVLINK is fully functional with the RTX 2080 on Ubuntu 18.04 with driver version 410.

I will be doing testing similar to what I did in the post NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0. I'll include some results from that post for comparison.


Test system

Hardware

  • Puget Systems Peak Single
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 1TB Samsung NVMe M.2
  • GPU's
  • GTX 1080Ti
  • RTX 2080 (2)
  • RTX 2080Ti
  • Titan V

Software

Two TensorFlow builds were used since the latest version of the TensorFlow docker image on NGC does not support multi-GPU for the CNN ResNet-50 training test job I like to use. For the "Big LSTM billion word" model training I used the latest container with TensorFlow 1.10 linked with CUDA 10.0. Both of the test programs are from "nvidia-examples" in the container instances.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning


There is one link available:

  • Link 0, P2P is supported: true
  • Link 0, Access to system memory supported: true
  • Link 0, P2P atomics supported: true
  • Link 0, System memory atomics supported: true
  • Link 0, SLI is supported: true
  • Link 0, Link is supported: false

I believe that the "Link is supported: false" line is referring to the CPU-GPU connection on IBM Power arch. I'm not completely sure about that since I can not find any information about it.

This test provides some additional information and the performance of a CUDA memory copy from GPU to GPU. (I'm listing only additional information from the output in the form of questions and answers.)

Does NVIDIA GeForce RTX LNVINK support Peer-To-Peer memory access?

Checking GPU(s) for support of peer to peer memory access...

  • Peer access from GeForce RTX 2080 (GPU0) -> GeForce RTX 2080 (GPU1) : Yes
  • Peer access from GeForce RTX 2080 (GPU1) -> GeForce RTX 2080 (GPU0) : Yes
    Enabling peer access between GPU0 and GPU1...

Does NVIDIA GeForce RTX LNVINK support Unified Virtual Addressing (UVA)?

Checking GPU0 and GPU1 for UVA capabilities...

  • GeForce RTX 2080 (GPU0) supports UVA: Yes
  • GeForce RTX 2080 (GPU1) supports UVA: Yes
    Both GPUs can support UVA, enabling...

How Fast is NVIDIA GeForce RTX LNVINK for CUDA Memory Copy?

  • cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.53GB/s

The following output show detailed information on bandwidth and latency between the 2 RTX 2080 GPU's.

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]


P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 389.09   5.82
     1   5.82 389.35
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 386.63  24.23
     1  24.23 389.76
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 386.41  11.59
     1  11.57 391.01
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 382.58  48.37
     1  47.95 390.62
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.67  20.55
     1  11.36   1.64

   CPU     0      1
     0   4.01   8.29
     1   8.37   3.65
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.67   0.92
     1   0.92   1.64

   CPU     0      1
     0   3.70   2.79
     1   2.95   3.68

For 2 RTX 2080 GPU's with NVLINK we see,

  • Unidirectional Bandwidth: 24 GB/s

  • Bidirectional Bandwidth: 48 GB/s

  • Latency (Peer-To-Peer Disabled),

    • GPU-GPU: 11-20 micro seconds
  • Latency (Peer-To-Peer Enabled),

    • GPU-GPU: 1 micro seconds

Now on to something a bit more "real-world".

The convolution neural network (CNN) and LSTM problems I'll test will not expose much of the benefit of using NVLINK. This is because their multi-GPU algorithms achieve parallelism mostly by distributing data as independent batches of images or words across the two GPU's. There is little use of GPU-to-GPU communication. Algorithms with finer grained parallelism that need more direct data and instruction access across the GPU's would benefit more.

One of the interesting questions this part of the testing will address is "is it better to get 2 RTX 2080 vs one of a more expensive card?". Lets find out.

I am mostly testing with benchmarks that I used in the recent post "NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0". However, for the CNN I am using an older NGC docker image with TensorFlow 1.4 linked with CUDA 9.0 and NCCL. I'm using this in order to have multi-GPU support utilizing the NCCL communication library for the CNN code. The most recent version of that code does not support this. The LSTM "Billion Word" benchmark I'm running is using the newer version with TensorFlow 1.10 link with CUDA 10.0.
I'll give the command-line input and some of the output for reference.

TensorFlow CNN: ResNet-50

Docker container image tensorflow:18.03-py2 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Example job command-line and truncated startup output, (with NVLINK bridge)

NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2 --fp16

TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
  --fp16
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp16
Have NCCL:   True
Using NCCL:  True

...
2018-10-11 01:01:05.405568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-10-11 01:01:05.405598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2018-10-11 01:01:05.405604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2018-10-11 01:01:05.405609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
...

Note, that --fp16 means "use Tensor-cores".

I ran that job at FP32 and FP16 (Tensor-cores) both with and without NVLINK on the 2 RTX 2080 GPU's.

GPU FP32
Images/sec
FP16 (Tensorcores)
Images/sec
GTX 1080 Ti207 N/A
RTX 2080207 332
RTX 2080 Ti280 437
Titan V299 547
2 x RTX 2080 364552
RTX 2 x 2080+NVLINK373 566

ResNet-50 with RTX NVLINK


TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:18.09-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.09-py3

Example job command-line and truncated startup output, (no NVLINK bridge)

/NGC/tensorflow/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256

...
2018-10-10 22:43:54.139543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-10 22:43:54.139577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2018-10-10 22:43:54.139583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N N
2018-10-10 22:43:54.139603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N N
...
GPU FP32
Images/sec
GTX 1080 Ti6460
RTX 2080 (Note:1)5071
RTX 2080 Ti8945
Titan V (Note:2)7066
Titan V (Note:3)8373
2 x RTX 2080 8882
2 x RTX 2080+NVLINK 9711

LSTM with RTX NVLINK

  • Note:1 With only 8GB memory on the RTX 2080 I had to drop the batch size down to 256 to keep from getting "out of memory" errors. That typically has a big (downward) influence on performance.
  • Note:2 For whatever reason this result for the Titan V is worse than expected. This is TensorFlow 1.10 linked with CUDA 10 running NVIDIA's code for the LSTM model. The RTX 2080Ti performance was very good!
  • Note:3 I re-ran the "big-LSTM" job on the Titan V using TensorFlow 1.4 linked with CUDA 9.0 and got results consistent with what I have seen in the past. I have no explanation for the slowdown with the newer version of "big-LSTM".

This might not be an easy decision. I haven't done very much testing and I'm still waiting for cards to do multi-GPU testing with the RTX 2080 Ti. I suspect that a multi-GPU configuration with the RTX 2080 Ti's will become the new standard workstation setup for folks doing ML/AI work. I'm a little reluctant to recommend the RTX 2080 because of the 8GB memory limitation. It is however obviously a great card! If that was what your budget allowed for then you would still be getting solid performing card.

The NVLINK bridge is a nice option for use with two cards, but, just two cards. There will be use cases where the NVLINK bridge will have a significant impact but a lot of GPU code is optimized to minimize communication. This is especially true for ML workloads where parallelism is often achieved by doing data distribution across devices. Still, it is a very nice option to have! I will be looking for usages that highlight the advantages it offers for future "real-world" testing.

We have a lot of great GPU's to do computational work with right now. These RTX cards give great compute performance for the cost. And, then there is the Titan V for when you need to have that wonderful double precision (FP64) performance of the Volta architecture.

It shouldn't be too much longer before I get a chance to look at performance with multiple RTX 2080 Ti's. Really looking forward to that!

Happy computing! --dbk

Tags: NVLINK, RTX 2080, TensorFlow, CUDA, NVIDIA, ML/AI, Machine Learning
Chris Jensen

We're you able to run with batch size of 512 with nvlink enabled? Curious if this allows a virtual 16 gb (22!! with the Ti) for training big models.

Posted on 2018-10-18 01:32:59
Donald Kinghorn

In a way ... When using multiple cards it puts 256 on each GPU so the total was 512. I do understand what you are thinking but I didn't test that. It would be interesting to see if would auto-magically allow 1 GPU process to use memory across both cards. I could have tested that by using --num_gpus=1 and batch_size=512 ( or maybe 640 ...) The situation does come up where you have to access a single variable that wont fit in memory. That is usually a show stopper. We sell quite a few CPU systems with 512GB or 1TB of mem for situations like that. OK, now I'm really curious! I'll run down to the office and see if I can borrow the cards and the bridge again. I have a couple of problems I can look at, the example above, and I can try using a larger basis set for my PyTorch QM code

Posted on 2018-10-18 15:30:35
Donald Kinghorn

Nope no magic!

People keep talking about NVLINK memory pooling or something like that. I don't know what that is/means. Each GPU has it's own memory ... very fast low latency memory. NVLINK is a fast communication bus i.e. a better alternative to PCIe but the latency is much longer than direct GDDR6 memory access on a single card ... I'm not sure how a single process on 1 GPU could allocate memory on another GPU and do writes and reads to it? If that is possible I would expect the performance to be poor ?? I'm not sure what the Quadro folks are talking about when they say it allows memory pooling, does that mean a combined memory frame buffer? I think that would be very different than a computational memory space

Posted on 2018-10-19 00:53:41
Michael Lu

Thanks for running these tests. Here are results for ResNet-50 using two 2080 Tis with and without NVLink:

https://uploads.disquscdn.c...

Posted on 2018-10-18 12:50:31
Donald Kinghorn

nice! We're still waiting on our cards to show up. That is a much better relative speedup from NVLINK than what I saw with the 2080's That is going to end up being a great setup for machine learning!

Posted on 2018-10-18 15:28:11
lemans24

Don...excellent article once again!!!
Definitely should be using dual 2080ti as MINIMUM for realistic ML/AI development.
I am not doing any ML/AI dev YET, but no way can I wait days for training...I want results in less than a second!! LOL (I am an options trader)

The Titan RTX should really be a beast when it comes out whether it has 12GB or more
...but more and more, I hope they come out with a Titan V 2019 with 32GB for $3000!!!

I am currently running my monte carlo simulations in batches of 4096 per gpu with each batch doing 524,000+ simulations on a Titan Xp and 1080ti.
As good as they are, this is dev only and but I need at a minimum for a single production gpu server, 4 Titan V 32GB cards to be able to use my simulations output in realtime with realtime defined in my case as execution of ALL 4096 batches over all cards in under a second.

And guess what: that is just for ONE trade with ONE instrument!! love these gaming cards...

Keep up these excellent articles and it would be great to see some more example code in Python and especially direct CUDA code in c/c++

Posted on 2018-10-23 12:51:58
Eri Rubin

Hi, i just got a new server with 8 2080tis, i don't have any NV-link bridges yet. When I tried checking if P2P is working, i got errors that its not supported. Did any of you verify that P2P is working without the nvlink bridge ?

Posted on 2018-12-26 15:27:19
Donald Kinghorn

You shouldn't need NVLINK. It does 'support' P2P but it is not 'necessary' for it. Also, the 2080Ti only supports 2 cards with NVLINK so even if you had the bridges you could only have pairs of cards connected. I wish I had done more comparative testing to show what difference it makes. I expect it to give only a small improvement for most programs.

Try compiling the p2pBandwidthLatencyTest code from CUDA samples and see what you get. In fact I would appreciate it if you could paste the output back here in the comments if you can! I haven't had a lot of time on the new RTX cards and only a short bit of testing with 4 2080Ti's, that was in the Threadripper post testing NAMD ...

An 8 card config would be a good test! I'll try to arrange to do that sometime in the next few weeks. I should also do some testing showing the effect of NVLINK on some common job runs.

If you get errors with p2pBandwidthLatencyTest code let me know!

Posted on 2019-01-02 20:30:38
Sergei Miliaev

Hi Donald, do you have any updates on the test with 8 2080Ti's?

Posted on 2019-02-06 10:07:10
Donald Kinghorn

Hi Sergei, I have not gotten too this yet. I'm planning another round of GPU testing toward the end of the month. I hope I can get the 8-GPU testing in. It is hard to get the 2080Ti's right now so I'll have to sneak in a test when we have have a bunch in for builds. I really don't know how much problem memcpy for communication will be. I suppose it is going to depend on the code being run and the job. For data parallel code with a job size big enough to keep the GPU's busy I'd guess it wouldn't have much impact. But, for code that needs all-to-all communication it could really slow things down.

Posted on 2019-02-06 23:58:19
Donald Kinghorn

... this came up on another post too. After what I had written below ... I'm going to do some testing because something doesn't look right. I checked a couple of 10xx cards and the p2p test from cuda 9.2 "says" something different from cuda 10 but the actual numbers are the same. I had some test numbers for dual 2080's and they showed bandwidth that looked like X8 PCIe. So, I need to check all of this out and confirm what is going on. (and how much difference it makes)

Posted on 2019-01-04 00:32:26
Anton Rager

Thanks for the tensorflow multi- GPU benchmarks! Based on your tests, I think I'll stick with no NVLink for 2xRTX 2080 ML workstation and focus on FP16 model optimization.

I am running into OOM/pinned host memory errors with the big_lstm with batch sizes of 256 and was wondering if this is something you've seen as well. Here's the setup: Ryzen7 2700 CPU, MSI X470 Gaming Plus mobo, 32GB RAM, 800W PSU and two RTX 2080s (no NVLink, Asus turbo/blower 2080, Zotac blower 2080).

Both GPUs work great for superposition, hashcat, other tensorflow work so I think GPUs are fine. I'm running "Billion Word" big_lstm in NGC's nvcr.io/nvidia/tensorflow:1... container on Ubuntu 18.04 and get errors with both the 410.93 and 415.25 Nvidia drivers (ppa installed or via run files)

Typically hits bunch of tensorflow errors like this first:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7039 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:1d:00.0, compute capability: 7.5)
2019-01-07 15:31:02.832396: E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to alloc 8589934592 bytes on host: CUresult(304)
2019-01-07 15:31:02.832479: W ./tensorflow/core/common_runtime/gpu/cuda_host_allocator.h:40] could not allocate pinned host memory of size: 8589934592

Eventually OOMs:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[99184,1024] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc

I get these errors with both GPUs (--num_gpus=2) as well as running against each card indiviually. I'm suspecting Ryzen 7 and/or mobo chipset at this point as I've even swapped multiple sticks of DDR4.

Posted on 2019-01-09 00:56:18
Donald Kinghorn

I just saw this! There are a couple of things that could be causing the oom. You may need to drop the batch size a little bit more than 256. That worked OK for the cards that I used but I have noticed variation between different manufactures before. I always seem to get a little more out of EVGA cards than others for some reason. There can also be differences in the amount of memory used by your desktop. I have compositing turned off on my system for example. Another thing that I run into is memory leaks that leave allocated mem on the cards and then cause oom's on later runs. I see this a lot when I use Jupyter notebooks. If I have an oom I completely shutdown Jupyter and restart. This can happen from the Python REPL too ... Hope this helps, sorry it's late!

Posted on 2019-01-29 21:18:14
Anton Rager

I finally figured it out thanks to VAST.AI and a variety of system/card configs. It appears to be a host memory size issue. Your system is 128GB and from testing 64GB allows big_lstm to complete w/ batch=256, but 32GB gets OOM issues. Ordering more memory for my workstation now :)

Posted on 2019-02-15 17:38:07
Donald Kinghorn

Interesting! I have been spoiled by having "large mem" on my personal system and systems I've been testing with.

I'm glad you got this sorted out and thank you for letting me know! That's a valuable "data point"

My rule of thumb for system memory when doing GPU computing is to have at least twice the system memory as you have total GPU memory. That's to allow for pinning and buffering etc.. With 2 2080's and 32GB sys mem you are right at that, which make what you saw with the LSTM "exta interesting". More memory is always a good upgrade. I'm guessing you will definitely notice the difference in how the system "feels" too.
best wishes --Don

Posted on 2019-02-17 21:41:02
omsrisagar

Dear Donald,

I have two RTX 2080 Tis installed on my system. I recently purchased Quadro RTX NVLink (2-slot) bridge and installed it to. May I know how are you enabling/disabling nvlink on your system when you do these experiments? (are you physically adding or removing the nvlink brdige?). Also how do I know if the purchased Quadro RTX NVlink bridge is working?

Thank you
Vidyasagar.

Posted on 2019-01-09 03:48:23
Donald Kinghorn

I do the testing by rebooting with or without the bridge attached.
To check the NVLINK status you can use

nvidia-smi nvlink -h

to see a bunch of options.

nvidia-smi nvlink -c will give you "capabilities" There are settings to check how much traffic went over the bridge during a job run. That can be interesting to look at, it's often not a lot of data transfer which is why NVLINK may not help much for some jobs.

Posted on 2019-01-29 21:25:42
omsrisagar

Thank you Donald, I appreciate your helpful response.

Posted on 2019-01-29 22:40:59
Juan Nunez

How did you turn on NVLINK on an Ubuntu system? Or does it "just work" if the NVLINK Bridge is connected to the two GeForce RTXs?
* On Windows to "enable NVLINK" you have to turn on SLI.

Posted on 2019-03-11 21:59:11
Donald Kinghorn

On Linux it "just works" ... any software that is peer-to-peer aware will use it by default. For the RTX 20xx cards and Titan if the NVLINK bridge is not there then things fall back to memcpy to buffers in CPU memory space. On the RTX Quadro cards the fallback is to p2p over the PCIe bus ... like the older GTX cards.

I am writing up a post where I look at 4 GPU's and have testing with 2 NVLINK bridges on 4 cards. That "just works" on Linux too. On Windows you have to have a display connected to one of each of the pairs and enable SLI

Posted on 2019-03-12 17:47:39
Juan Nunez

Thank you for pointing out that very important characteristic regarding P2P and the different GPUs.

Non-NVLink P2P
* Quadros - P2P via PCIe
* GeForce / Titan - P2P via Host.

I look forward to your write-up on the 4 GPU setup.

It is very good of you for making all this information available, in particular on the Windows side of things. Not to undervalue Linux , but there just seems to be a lack of *public* information on the Windows side.

Thank you again, Dr. Kinghorn.

Posted on 2019-03-12 18:09:54