PCIe X16 vs X8 for GPUs when running cuDNN and Caffe

When you are testing GPUs for compute the question "how important is PCIe X16 vs X8 for performance?" seems to always come up. It's often in connection with the question "I want to add a second video card to my system for compute but I only have one X16 slot — will the GPUs be OK at X8?"  The answers are; important? "It depends", OK at X8? "Yes".  What about for workloads like training deep neural network models with large data sets? I decided to find out for a classification model on a 1.3 million image data set using NVIDIA DIGITS with Caffe. 

I looked at performance of deep neural network training with Caffe using GoogLeNet and AlexNet models as well as a few tests from the "CUDA samples" (nbody, bandwidthtest, p2pbandwidth). 

How important is X16?

Unfortunately the "it depends" answer is usually best. To really know, you have to test with what you are working on. 

What I was working on is this;

NVIDIA DIGITS with Caffe — Performance on Pascal multi-GPU

The blog post linked above has the details of the hardware system,  software configuration, and job setup. … however, for this X16 vs X8 testing I only used 1 or 2 GTX 1070s.

Briefly:

Software:

Hardware:

  • The Peak Single ("DIGITS" GPU Workstation) 
  • CPU: Intel Core i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
  • Memory: 128 GB DDR4 2133MHz Reg ECC
  • PCIe: (4) X16-X16 v3
  • Motherboard: ASUS X99-E-10G WS
  • 1 or 2 NVIDIA GTX 1070 GPUs 

X16 to X8 hardware hack (sticky-note method!)

OK, I really don't recommend you do the "sticky note method" at home! However, it's simple and it works, effectively turning an X16 card into X8 (if you do it right). It's the same as putting the card in an X8 slot but, by doing this, we can hold everything else constant. [ Hats off to Matt Bach here at Puget for turning me on to this trick! ]

Results

The table below shows why the "It depends" answer is most appropriate. The Caffe classification model training jobs showed only a small performance drop when run at X8. However, the nbody multi-GPU benchmark showed a significant performance degradation at X8!  

Job X16 X8 Comment

GoogLeNet

1 x 1070

32hr 33hr

Training time for 30 epochs, 1.3million images.

Only small degradation in run-time with X8

GoogLeNet

2 x 1070

19hr 47min 20hr 22min Dual GPU scaling is as expected and again impact of using X8 is small

AlexNet

1 x 1070

8hr 1min 8hr 27min AlexNet with Caffe does not scale well on multiple 1070s (or TitanX) GPUs. Presumably I/O bound(?) X8 had only a small impact on run time.

nbody

1 x 1070

4085 GFLOP/s 4109 GFLOP/s

nbody benchmark from CUDA samples, (256000 bodies)

[results typically vary as much as 10% from run to run]

nbody

2 x 1070

7654 GFLOP/s 4968 GFLOP/s Large performance drop on X8 when using multi-GPU!

 

For completeness here’s the output from the CUDA samples bandwidth test and P2P bandwidth test which clearly show the bandwidth improvement when using PCIe X16.

X16

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1070
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11708.3

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12866.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			191565.2

Result = PASS
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1070
Device: 1, GeForce GTX 1070
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 192.52  11.20 
     1  11.20 193.57 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 192.81  10.18 
     1  10.18 193.76 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 193.52  19.69 
     1  19.74 191.33 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 193.62  19.68 
     1  19.70 194.58 
P2P=Disabled Latency Matrix (us)
   D\D     0      1 
     0   3.22  16.83 
     1  16.96   3.24 
P2P=Enabled Latency Matrix (us)
   D\D     0      1 
     0   3.23   8.12 
     1   7.99   2.76 

X8

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1070
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6432.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6452.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			191625.0

Result = PASS
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1070
Device: 1, GeForce GTX 1070
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 192.71   6.03 
     1   5.95 193.76 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 192.81   5.89 
     1   5.89 192.62 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 196.04  10.30 
     1  10.43 192.57 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 195.80  10.55 
     1  10.57 195.85 
P2P=Disabled Latency Matrix (us)
   D\D     0      1 
     0   3.24  16.63 
     1  17.47   2.96 
P2P=Enabled Latency Matrix (us)
   D\D     0      1 
     0   2.82   8.02 
     1   8.21   2.78 

Happy computing! –dbk