Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1331
Dr Donald Kinghorn (Scientific Computing Advisor )

P2P peer-to-peer on NVIDIA RTX 2080Ti vs GTX 1080Ti GPUs

Written on January 11, 2019 by Dr Donald Kinghorn
Share:


In my recent testing with NVIDIA's new RTX 20xx GPU's I had been focused on application performance. That included testing using the NVLINK bridge for direct GPU-to-GPU communication i.e. peer-to-peer (P2P). I deliberately used an older version of NVIDIA's docker image for the "CNN" TensorFlow test application so that I would be using code linked with the NCCL library to give multi-GPU capability and to take advantage of P2P communication. The results on dual RTX 2080Ti showed only a modest, ~7%, performance increase when using NVLINK.

NVLINK provides an impressive bi-directional bandwidth of nearly 95GB/sec! "Normal "mem-copy provides 11.5 GB/s bidirectional bandwidth on RTX 2080Ti. That is 9 times less than with NVLINK but ...

...NVLINK didn't make as large of an impact as I had expected. I assumed that was partly due to the fact that "normal" P2P over the PCIe bus was pretty effective and the 4-5 fold bandwidth increase with NVLINK just didn't matter that much. What I didn't notice was that the communication bandwidth difference was actually near a factor of ten! I had assumed that P2P worked as normal over PCIe with the new RTX cards but in fact it is disabled on the new RTX GPU's unless you have the NVLINK bridge attached.

It was pointed out in the comments on one of my testing posts that P2P was disabled on RTX 20xx cards and that surprised me since I had looked at testing that showed that but didn't notice the anomaly. I was just focused on application performance during the short time I had with the cards!

I have used the term "Disabled", however, I don't know that P2P over PCIe is actually disabled. This may just be a design trade-off for the Turing architecture(?). This is a complicated GPU with new features for Ray-Tracing etc.. I'm sure there were engineering decisions that had to be made to accommodate the many uses for these cards. P2P is only relevant for multi-GPU application and most applications do not specifically use it, even compute focused applications. For a detailed discussion of the Turing architecture see this (86 page!) white paper
"NVIDIA Turing GPU Archetecture". (I have not read through this document, yet.)

I have only tested with 2 GPU's. There may be more impact from lack of P2P when 4 or more GPU's are being used. For a code that is performance dependent on GPU-GPU communication more cards means more possibility of memory access contention when using mem-copy. I will be testing with 4 cards in a few weeks and will report my findings. I suspect that "real" application performance will not be heavily impacted. The other consideration is that NVLINK on the RTX cards only supports two devices. So, if you have code that is using P2P and is performance dependent on it then you will get great performance with two cards but could have difficulty with more GPU's.


What is NVIDIA CUDA Peer-to-Peer (P2P)?

OK, so what is P2P? In a very simplified description, P2P is functionality in NVIDIA GPU's that allow CUDA programs to access and transfer data from one GPU's memory to another without having to go through a shared pool of system memory attached to a CPU. It has been a feature for NVIDIA GPU's for 8 or 9 years. P2P together with "Unified Virtual Addressing" (UVA) were big improvements in CUDA for GPU computing. They allowed efficient use of multi-GPU and multi-Node system environments and simplified programming for highly parallel code.

On a Workstation with 2-4 GPU's P2P and UVA can give a modest performance improvement for some programs. For large GPU accelerated Supercomputers it allows a GPU on one node (system) to access memory on a GPU on another node by using RDMA (Remote Direct Memory Access) over a high-speed network like Infiband. That is very import for massively parallel supercomputing and is driving the performance of the fastest computer systems in the world. On Workstations it is significant but not nearly as important since standard mem-copy back and forth through a CPU's memory pool is reasonably efficient. Note: this is exactly why a single CPU systems is often a better choice for GPU accelerated workstations since a dual CPU system can have slow-downs caused by memory transfers from one CPU memory space to a GPU that is attached to PCIe lanes on the other CPU!

Following are diagrams sourced from older NVIDIA developer blogs that give a good visual representation of P2P and UVA.


This diagram shows how memory is transferred using a shared pool of memory ("SysMem") that is attached to the CPU. "Chipset" would be the PCIe bus.

P2P direct


Here is an illustration of two types of P2P communication, Access and Transfer.

P2P access


For completeness, this diagram shows how Unified Virtual Addressing (UVA) appears to the CPU and GPU's. You should understand that UVA is largely a convenience for programmers. It makes memory management easier by providing a single memory address space and reduces the amount of code that has to be written. On a single workstation it probably doesn't improve performance. UVA is not the same thing as "memory-pooling". I believe memory-pooling is something that some NVIDIA Quadro cards can do to provide a large graphics frame-buffer across multiple cards. (I'm sure someone will point it out in the comments if I'm wrong).

UVA


Look at the diagrams above and replace PCIe with NVLINK. NVLINK provides the same functionality as PCIe but it is lower latency and higher bandwidth. NVLINK bandwidth is about 4-5 times that of PCIe v3 X16. However, keep in mind that NVLINK only comes into play during memory access or transfer between GPU's (except on IBM Power which connects to CPU's too). Good parallel code will try to minimize "communication". NVLINK is definitely a good thing but it's high performance may not have as large an impact as you might think unless communication is the major bottleneck in your code. [NVLINK and UVA become a significant performance boost when the communication is over network fabric between nodes in a cluster.]

Note: We should see PCIe version 4 chip-sets from Intel and AMD, along with motherboards that support it, by the end of 2019. That will be a significant improvement over the current PCIe v3.


Test setup and Results

The most important thing to keep in mind with the results that follow is that I am only looking at P2P and NVLINK with two GPU's. With 4 or more GPU's the effect of not having P2P available will possibly have more impact. NVLINK on all of the RTX cards only supports connection between two devices.

Testing hardware and software

Hardware (My personal Workstation)

Software

Two TensorFlow builds were used since the latest version of the TensorFlow docker image on NGC does not support multi-GPU for the CNN ResNet-50 training test job I like to use. For the "Big LSTM billion word" model training I use the latest container with TensorFlow 1.10 linked with CUDA 10.0. Both of the test programs are from "nvidia-examples" in the container instances.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning


Results Summary

The following two tables will present a condensed representation of the main results of the P2P impact testing. There will be an appendex with the (trimmed) raw output data.

Direct "Synthetic" Measurment of P2P Performance for 2 x RTX 2080Ti and 2 x GTX 1080Ti

Test 2 x RTX 2080Ti2 x 2080Ti + NVLINK2 x GTX 1080Ti
P2P FabricNONE NVLINKPCIe
Bidirectional Bandwidth
P2P Enabled
11.5 GB/s 93.6GB/s20.3GB/s
Bidirectional Bandwidth
P2P Disabled ("memcopy")
11.5 GB/s 11.3GB/s 20.2GB/s
GPU-GPU Latency
P2P Enabled
12.5us 1.7us 1.2us
GPU-GPU Latency
P2P Disabled
11us11.9us 11us

Notes:

  • The numbers in these results generally have a 10-15% error margin.
  • The bandwidth for the 2080Ti's is closer to what would be expected when having GPU's connected to PCIe X8 slots. All of the tests were with the cards connected to PCIe X16 slots!
  • The bandwidth for the 1080Ti's was invariant to enabling P2P but the latency showed significant improvement.

The next results are using the machine learning applications that I have been using recently in my other GPU performance posts.

  • CNN - A Convolution Neural Network job measuring training performance for the ResNet-50 model.
  • BigLSTM - A Long Short Term Memory network training job on a billion word corpus.

Application Measurement of P2P Performance for 2 x RTX 2080Ti and 2 x GTX 1080Ti -- CNN and LSTM Deep-Learning

Test 2 x RTX 2080Ti2 x 2080Ti + NVLINK2 x GTX 1080Ti
CNN (ResNet-50, train)476.6 images/s 490.1 images/s366.8 images/s
CNN (ResNet-50, train) fp16735.4 images/s 760.9 images/sN/A
Big-LSTM (billion word, train)15496 words/s 16753 words/s 11462 words/s

Notes:

  • fp16 means "using Tensor-cores". The GTX 10xx GPU's do not have fp16 Tensor-cores available.

Conclusions

The bottom line is that NVLINK and P2P have amazing performance when measured directly but the impact on application performance is likely to be minimal for most multi-GPU programs running on a workstation. However, that is not always going to be the case. I don't know of any programs that are severely limited by P2P, but YOU may have code that you know is! In that case a 2 GPU system with NVLINK may give you great performance. Although, if you want to use 4 or more GPU's then the RTX GPU's may not be so great if you are dependent on P2P. This is especially true since it looks like "normal" mem-copy is slower with the RTX GPU's. I will be testing a variety of GPU's in configurations up to 4 GPU's. I will be particularly cognizant of effects of P2P, or rather, the lack of it.

I also did all of the testing in this post on a pair of RTX Titan GPU's and the results were similar to what I observed with the 2080Ti. (I'll be writing an RTX Titan post soon.) There is no magic greatness with the Titan over the 20xx cards. ...but the performance is pretty good! ... and 24GB of memory is nice too!

My colleague William George is doing some testing with RTX Quadro cards right now and it looks like there is no magic there either as far as P2P goes. You may see some comments in this post from him.

I think what we are seeing is just part of the design trade-offs of the Turing architecture. This is a great GPU with some very interesting and innovative features. I am impressed with the compute performance and feel they are a nice improvement over the 10xx cards. I'm looking forward to seeing what developers do with the Ray-Tracing features too!

When I get my comprehensive multi-GPU testing done I'll have more to say about Turing GPU's for compute. From what I've seen so far I'm pleased with performance.

Comments are welcome!

Happy computing --dbk

Appendix: Raw output (somewhat trimmed for space)

If you want details on what I did and what I saw during this testing the following output has my command lines and program output. Enjoy!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 x 1080Ti ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/Documents/Puget/blog-posts$ nvidia-smi
Fri Jan  4 14:32:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:65:00.0  On |                  N/A |
| 28%   34C    P8    12W / 250W |    146MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:B3:00.0 Off |                  N/A |
| 28%   25C    P8     8W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce GTX 1080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce GTX 1080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : Yes
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> GeForce GTX 1080 Ti (GPU0) supports UVA: Yes
> GeForce GTX 1080 Ti (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 9.67GB/s

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1080 Ti, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: b3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 351.91  11.48
     1  11.52 354.79
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 352.55  10.40
     1  10.38 355.11
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 355.00  20.18
     1  20.14 356.41
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 354.62  20.30
     1  20.28 355.28
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.46  10.85
     1  12.11   1.63

   CPU     0      1
     0   3.04   7.42
     1   7.32   2.99
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.46   1.17
     1   1.22   1.63

   CPU     0      1
     0   3.23   2.02
     1   2.02   3.05

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@ae162b83b0ff:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp32
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-04 23:13:01.204755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.61GiB
2019-01-04 23:13:01.390387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:b3:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2019-01-04 23:13:01.391392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-04 23:13:01.394017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-04 23:13:01.394025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2019-01-04 23:13:01.394029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2019-01-04 23:13:01.394037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2019-01-04 23:13:01.394059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:b3:00.0, compute capability: 6.1)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    ...
    47     1   365.6  12.486 0.10000
    48     1   365.7  12.407 0.10000
    49     1   368.1  12.393 0.10000
    50     1   367.4  12.333 0.10000
----------------------------------------------------------------
Images/sec: 366.8 +/- 0.3 (jitter = 1.0)
----------------------------------------------------------------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@96f4797b1511:/workspace/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=/projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=448
...
2019-01-05 02:00:25.535364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2019-01-05 02:00:25.535371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y
2019-01-05 02:00:25.535377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N
2019-01-05 02:00:25.538173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10254 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2019-01-05 02:00:25.640711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10409 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:b3:00.0, compute capability: 6.1)
Processing file: /projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00033-of-00100
Finished processing!
Iteration 1, time = 20.76s, wps = 863, train loss = 13.0247
Iteration 2, time = 17.34s, wps = 1033, train loss = 12.9863
Iteration 3, time = 1.56s, wps = 11452, train loss = 12.9114
Iteration 4, time = 1.57s, wps = 11430, train loss = 12.8277
Iteration 5, time = 1.57s, wps = 11418, train loss = 12.6549
Iteration 6, time = 1.57s, wps = 11436, train loss = 11.7663
Iteration 7, time = 1.56s, wps = 11462, train loss = 26.3362

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 x 2080Ti ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~$ nvidia-smi
Fri Jan  4 15:35:28 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:65:00.0  On |                  N/A |
| 41%   38C    P0    63W / 260W |    169MiB / 10986MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:B3:00.0 Off |                  N/A |
| 41%   32C    P8    11W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : No
> Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce RTX 2080 Ti, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce RTX 2080 Ti, pciBusID: b3, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     0
     1	     0     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 530.74   5.79
     1   5.82 532.37
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 530.74   5.79
     1   5.81 532.32
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 535.21  11.52
     1  11.57 535.83
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 526.45  11.57
     1  11.50 527.09
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.89  11.44
     1  14.94   1.31

   CPU     0      1
     0   3.04   7.11
     1   7.70   2.83
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.89  16.14
     1  12.53   1.33

   CPU     0      1
     0   3.05   7.19
     1   7.26   2.82

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@f11011faf31d:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp32
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-04 23:46:03.045773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.37GiB
2019-01-04 23:46:03.299061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:b3:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-04 23:46:03.299165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-04 23:46:03.301627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-04 23:46:03.301636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N
2019-01-04 23:46:03.301640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y
2019-01-04 23:46:03.301648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-04 23:46:03.301654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    47     1   477.9  11.907 0.10000
    48     1   472.7  11.862 0.10000
    49     1   474.7  11.885 0.10000
    50     1   474.9  11.872 0.10000
----------------------------------------------------------------
Images/sec: 476.6 +/- 0.5 (jitter = 2.7)
----------------------------------------------------------------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@f11011faf31d:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=128 --num_gpus=2 --fp16
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=128
  --num_gpus=2
  --fp16
Num images:  Synthetic
Model:       resnet50
Batch size:  256 global
             128 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp16
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-04 23:56:44.953679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.37GiB
2019-01-04 23:56:45.217624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:b3:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-04 23:56:45.217745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-04 23:56:45.217827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-04 23:56:45.217834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N
2019-01-04 23:56:45.217840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y
2019-01-04 23:56:45.217849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-04 23:56:45.217856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    46     1   733.9  10.043 0.10000
    47     1   733.1  10.038 0.10000
    48     1   733.2  10.020 0.10000
    49     1   734.2  10.033 0.10000
    50     1   734.4  10.034 0.10000
----------------------------------------------------------------
Images/sec: 735.4 +/- 0.6 (jitter = 3.3)
----------------------------------------------------------------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@98756a012bec:/workspace/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=/projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=448
...
2019-01-05 01:44:32.742663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2019-01-05 01:44:32.742672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N N
2019-01-05 01:44:32.742677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N N
2019-01-05 01:44:32.745709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10008 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-05 01:44:32.839719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10171 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Processing file: /projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00097-of-00100
Finished processing!
Iteration 1, time = 20.43s, wps = 877, train loss = 13.0220
Iteration 2, time = 17.73s, wps = 1011, train loss = 12.9759
Iteration 3, time = 1.17s, wps = 15269, train loss = 12.9492
Iteration 4, time = 1.16s, wps = 15486, train loss = 12.8480
Iteration 5, time = 1.18s, wps = 15229, train loss = 12.6204
Iteration 6, time = 1.16s, wps = 15496, train loss = 11.6590
Iteration 7, time = 1.15s, wps = 15548, train loss = 29.8347
Iteration 8, time = 1.16s, wps = 15439, train loss = 55.7615

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 x 2080Ti + NVLINK +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~$ nvidia-smi  nvlink -c
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-10f085c9-3950-42ce-cca9-e3a7f64520d9)
	 Link 0, P2P is supported: true
	 Link 0, Access to system memory supported: true
	 Link 0, P2P atomics supported: true
	 Link 0, System memory atomics supported: true
	 Link 0, SLI is supported: true
	 Link 0, Link is supported: false
	 Link 1, P2P is supported: true
	 Link 1, Access to system memory supported: true
	 Link 1, P2P atomics supported: true
	 Link 1, System memory atomics supported: true
	 Link 1, SLI is supported: true
	 Link 1, Link is supported: false
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-463f5ae1-b594-9afc-359f-def62ca73137)
	 Link 0, P2P is supported: true
	 Link 0, Access to system memory supported: true
	 Link 0, P2P atomics supported: true
	 Link 0, System memory atomics supported: true
	 Link 0, SLI is supported: true
	 Link 0, Link is supported: false
	 Link 1, P2P is supported: true
	 Link 1, Access to system memory supported: true
	 Link 1, P2P atomics supported: true
	 Link 1, System memory atomics supported: true
	 Link 1, SLI is supported: true
	 Link 1, Link is supported: false

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : Yes
> Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> GeForce RTX 2080 Ti (GPU0) supports UVA: Yes
> GeForce RTX 2080 Ti (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 43.58GB/s

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce RTX 2080 Ti, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce RTX 2080 Ti, pciBusID: b3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 530.58   5.78
     1   5.82 533.16
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 528.50  46.95
     1  46.97 532.39
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 533.91  11.32
     1  11.29 536.62
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 530.74  93.59
     1  93.73 535.10
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.88  11.80
     1  12.15   1.82

   CPU     0      1
     0   2.93   7.74
     1   7.69   2.97
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.87   1.73
     1   1.72   1.82

   CPU     0      1
     0   3.08   2.16
     1   2.14   2.96

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@6e8ef3f22155:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp32
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-05 00:24:38.886533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.37GiB
2019-01-05 00:24:39.140620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:b3:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-05 00:24:39.140675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-05 00:24:39.140687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-05 00:24:39.140693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2019-01-05 00:24:39.140697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2019-01-05 00:24:39.140705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-05 00:24:39.140712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    46     1   489.6  11.960 0.10000
    47     1   492.0  11.914 0.10000
    48     1   499.8  11.869 0.10000
    49     1   498.0  11.888 0.10000
    50     1   499.9  11.874 0.10000
----------------------------------------------------------------
Images/sec: 490.1 +/- 0.6 (jitter = 1.2)
----------------------------------------------------------------

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@6e8ef3f22155:/projects/NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=128 --num_gpus=2 --fp16
TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=128
  --num_gpus=2
  --fp16
Num images:  Synthetic
Model:       resnet50
Batch size:  256 global
             128 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp16
Have NCCL:   True
Using NCCL:  True
Using XLA:   False
Building training graph
Creating session
2019-01-05 00:26:04.149449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:65:00.0
totalMemory: 10.73GiB freeMemory: 10.37GiB
2019-01-05 00:26:04.400957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635
pciBusID: 0000:b3:00.0
totalMemory: 10.73GiB freeMemory: 10.53GiB
2019-01-05 00:26:04.401010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2019-01-05 00:26:04.401019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2019-01-05 00:26:04.401023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2019-01-05 00:26:04.401027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
2019-01-05 00:26:04.401036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-05 00:26:04.401042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Initializing variables
Pre-filling input pipeline
Training
  Step Epoch Img/sec   Loss   LR
    46     1   760.5  10.059 0.10000
    47     1   759.2  10.051 0.10000
    48     1   761.7  10.032 0.10000
    49     1   753.6  10.044 0.10000
    50     1   763.0  10.044 0.10000
----------------------------------------------------------------
Images/sec: 760.9 +/- 0.5 (jitter = 2.8)
----------------------------------------------------------------

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

root@6a2b08b4e5cd:/workspace/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=/projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=448
...
2019-01-05 00:43:00.303275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2019-01-05 00:43:00.303281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y
2019-01-05 00:43:00.303285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N
2019-01-05 00:43:00.306050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10014 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
2019-01-05 00:43:00.400029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10171 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:b3:00.0, compute capability: 7.5)
Processing file: /projects/NGC/tensorflow/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00074-of-00100
Finished processing!
Iteration 1, time = 20.42s, wps = 877, train loss = 12.9972
Iteration 2, time = 16.79s, wps = 1067, train loss = 12.9753
Iteration 3, time = 1.10s, wps = 16339, train loss = 12.9377
Iteration 4, time = 1.10s, wps = 16297, train loss = 12.8125
Iteration 5, time = 1.07s, wps = 16732, train loss = 12.6209
Iteration 6, time = 1.10s, wps = 16316, train loss = 11.4865
Iteration 7, time = 1.07s, wps = 16753, train loss = 33.7183
Iteration 8, time = 1.08s, wps = 16561, train loss = 61.7707
Iteration 9, time = 1.07s, wps = 16673, train loss = 20.1651


Tags: P2P, RTX 2080Ti, NVIDIA, CUDA, Machine Learning
Adam Chodaba

you should test with Tensorflow 18.09 + , preferably 18.12, NCCL 2.3.4 which adds support for turing https://docs.nvidia.com/dee... is included starting in 18.09

Posted on 2019-01-15 08:21:25
Donald Kinghorn

Thank you! I had tried some newer version a while back and could not get it to work with multi-GPU. I used that older version because I knew it did work and I have a bunch of other testing with it for comparison. I've wanted to update all of the testing I'm doing. I'm planning a bigger test round with a bunch of different cards 1-4 multi-GPU. Your comment is encouraging and you are absolutely right, I should be testing with the latest :-)

I'm doing some unrelated CPU stuff this week but I might be able to get a couple of job runs in ... If so I'l post back in the comments or if it's significant, update this post

Posted on 2019-01-16 02:46:28
Adam Chodaba

if you do get 18.12 working make sure to try variable_update=parameter_server
local_parameter_device=cpu
and another configuration
variable_update=replicated
all_reduce_spec=nccl

as inputs to the benchmark. I’ve found both configurations to work well. Resnet50 fp16 synthetic gets ~1000 images/s for me with 2 NVlinked 2080 ti both in 16x slots. 4 2080ti gets 1930 images/s for me in the same benchmark with 16x/8x Nvlinkined plus 16x/8x Nvlinked. I think 8x maxes out at ~450 per gpu unless nvlinked with a 16x card then it can approach ~500 per gpu.

Posted on 2019-01-16 16:16:25
lemans24

Not too impressed by the speedup of Nvlink and tensor cores so far...I think we would probably get much better speed up from major increase in fp32 cores as well as pcie gen 4.0. Nvlink will not be properly used unless you have access to a full speed switch bar that allows any gpu card to talk to any other gpu card attached to the switch bar at full speed which obviously Nvidia will never do as that is Tesla/DGX territory.
As for tensor cores, it seems very few apps get a 12x speed up...

Posted on 2019-01-17 00:52:48
Donald Kinghorn

That's all true. It's hard to say what NVIDIA will do in the future. They want people to do compute on Tesla but developers and experimenters are the ones that made all of this happen in the first place ... and that was on GeForce. These RTX cards really are nice for compute. The things like NVLINK and such are not so important for the most part. Mixed precision with fp16 is tricky, I feel it is best for inference and deployment with ML models that have been well trained and then "reduced" to mixed precision. In general the cards are great and I think most uses are going to do well. I would love for them to maintain a serious Workstation card like the Titan V at or near the current price. There are so many people that could benefit from that and they don't even know it yet! I'm giving a talk next month about Scientific computing with PyTorch and I'll demo on a Titan V so I can show off the double precision. If this would have been available 8 years ago the whole scientific programming community would be on-board by now. I think it's criticle that NVIDIA keeps an option like Titan V available so that people don't have the barrier of going to the cloud. If they don't it will slow scientific progress.

Was really on my soap-box there for a bit :-)

Posted on 2019-01-18 00:38:33
Donald Kinghorn

that's what I needed! I did run with 18.12 yesterday and the results were better ... close to what you saw and I saw the same thing with X8 ... I was messing around with job parameters and mpi options and felt I was seriously missing something because I haven't explored this version of the code. I did these runs while I was doing testing for a CPU comparison post. I can probably drop back into labs tomorrow and redo that with your suggestions while I still have access and setup. I seriously appreciate you suggestions! Thanks --Don

Posted on 2019-01-18 00:21:49
Jack

on the quadro rtx card did it have the same slow p2p bandwidth issue as the 2080 ti? according to nvidia GPU direct is only enabled on quadro and tesla. so quadro should have fast p2p just as the 1080ti. in 2080ti either by bug or design gpu direct does not work hence the slower 11.5GBps bandwidth. are you seeing 11.5 or the 20.3 for quadro rtx? pls advise even if prelim info as buying soon.

Posted on 2019-01-18 15:56:06
Donald Kinghorn

I just fired up 2 RTX 6000 Quadro's On PCIe everything says "Yes" i.e. p2p over PCIe is working and the numbers look the same as I have reported for the 1080Ti ... However, this is on Linux it apparently is not enabled with Windows, i.e. everything says "No".

If this is a concern, then, unless you are looking at more than 2 cards, NVLINK will offer much better performance (5 X) and that works fine on Windows too.

I am a little disturbed by this, but from my testing so far as long as you are using X16 PCIe for the cards old school pinned memory and memcpy is pretty good. This stuff only comes into play during communication which is hopefully not a large percent of runtime. "Real world" impact should not be to severe. ... but if it is a bottleneck for your particular application and you cannot use NVLINK then, yes, it could give you trouble.

Posted on 2019-01-18 19:59:16
Donald Kinghorn

... and I just did some job runs with NVLINK ... It's the same story impact is only a small performance gain.
For example cnn ResNet-50 with large batch size (256) at fp16 (since these cards have 24GB mem)
With NVLINK 1191 images/sec
Without 1188 images/sec
less than 1% difference

Posted on 2019-01-18 20:41:18
lemans24

Yep...kinda of what I thought:

Unless you know the topology that your software library is running, then NVlink will probably have negligible speedup.
If you code directly in CUDA with c++ and are using a nearest neighbor type algorithm then maybe using Nvlink would be worth it.
But since you can only use NVlink with 2 cards, to take advantage of the communication speedup requires your software to be directly aware of which card you are talking to. Most algorithms that work via grids are much more amenable to scatter/gather batches spread out over multiple cards with host based accumulation of results. Monte Carlo simulations have a near linear speedup on Nvidia hardware when run in multiple batches with no regard for which card you are communicating with.

So basically you really should be intimate with the algorithms that your selected library uses or you code directly in c/c++/CUDA and therefore could possibly take advantage of NVlink communication speed over PCIE based communication. Nvlink really is geared to be used with a high speed switch bar which normal PC based motherboards do not have (think NVidia DGX Station which has a 4 way NVlink)

Posted on 2019-01-20 23:27:27
Adam Chodaba

that just means that your PCIE bus isn't the bottleneck at lower card counts. I guarantee you will see a difference with more cards in the setup. With 4-10 cards or if some of them are x8 you will be more likely see a big difference. You don't need to know the topology, NCCL will take care of that. NVlink really shines when you start to saturate the pcie bus. NVlink is nice when you have both 8X and 16x slots and pair them together so that you can share the bandwith and alleviate some bottlenecking in the 8x card.

Posted on 2019-01-21 03:10:58
Jack

with what software and what networks have u seen such a difference with nvlink? we have 10 gpu and 4 gpu systems and with pytorch have seen minimal scalability issues over pcie (the 10 gpu system has pcie switches and the 4 gpu actually go over QPI between the CPU's which is really slow).

Posted on 2019-01-21 05:27:18
Donald Kinghorn

This is a good question? I can see how limited bandwidth could potentially be a problem. In the early days of GPU compute it was a big deal and a fair amount of effort went in to "best practices" for minimizing the overhead, double buffering, pinning etc ..

I haven't run any jobs that seem to be significantly affected. I'd be curious to find a good example too.

Communication bound code certainly does exist! It's why some problems are really hard to parallelize across devices. Communication can be a show stopper if your code stalls because of it!

Posted on 2019-01-21 23:23:55
Donald Kinghorn

I did see a difference between X8 and X16 with the RTX cards ( small, but still more than what I saw with 10xx cards Yes, I expect things will get more interesting with 4-8 cards. I will be able to test with an 8 GPU system. I'll be on that soon I hope ... a little behind with things right now!

Posted on 2019-01-21 23:01:15
lemans24

If Titan based gpu cards have dual dma engines, then I could see the bottleneck growing between x16 and x8 for Titan RTX cards as compared to the geforce 20xx cards which normally have single dma engines...

Posted on 2019-01-22 04:09:16
lemans24

Thanks for chiming in Jack!!!

A simple test would be to run your software between 2 cards in x16 PCIE slots and then run the same software between 2 cards in x8 PCIE slots. If the difference is not more than 10% then NVlink will probably show negligible or no difference.

These GPU cards are made for executing TFlops and when you are executing those TFlops, I guarantee you are NOT transferring gigabytes of data which is what NVlink is optimized for. Good algorithms should have a balance between TFlops and communication between cards.
You absolutely need a high speed cross bar between ALL of your NVlinks to make a real difference and this is currently called an NVidia DGX system. Also you need to know the topology for maximum performance of the software that you are using if you want to use 2-way NVlink cards with 4 or more cards. If you program bare metal with c++/CUDA then knock yourself out. It took me months to optimize my c++/CUDA code as I was hanging onto useless preconceptions before fully embracing the NVidia hardware design for maximum parallel tasks to ridiculously speed up my Monte Carlo simulations which by the way is an algorithm that excels on parallel hardware...

Posted on 2019-01-22 04:00:08