Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/887
Dr Donald Kinghorn (Scientific Computing Advisor )

PCIe X16 vs X8 for GPUs when running cuDNN and Caffe

Written on January 16, 2017 by Dr Donald Kinghorn
Share:

When you are testing GPUs for compute the question "how important is PCIe X16 vs X8 for performance?" seems to always come up. It's often in connection with the question "I want to add a second video card to my system for compute but I only have one X16 slot -- will the GPUs be OK at X8?"  The answers are; important? "It depends", OK at X8? "Yes".  What about for workloads like training deep neural network models with large data sets? I decided to find out for a classification model on a 1.3 million image data set using NVIDIA DIGITS with Caffe. 

I looked at performance of deep neural network training with Caffe using GoogLeNet and AlexNet models as well as a few tests from the "CUDA samples" (nbody, bandwidthtest, p2pbandwidth). 

How important is X16?

Unfortunately the "it depends" answer is usually best. To really know, you have to test with what you are working on. 

What I was working on is this;

NVIDIA DIGITS with Caffe -- Performance on Pascal multi-GPU

The blog post linked above has the details of the hardware system,  software configuration, and job setup. ... however, for this X16 vs X8 testing I only used 1 or 2 GTX 1070s.

Briefly:

Software:

Hardware:

  • The Peak Single ("DIGITS" GPU Workstation) 
  • CPU: Intel Core i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
  • Memory: 128 GB DDR4 2133MHz Reg ECC
  • PCIe: (4) X16-X16 v3
  • Motherboard: ASUS X99-E-10G WS
  • 1 or 2 NVIDIA GTX 1070 GPUs 

X16 to X8 hardware hack (sticky-note method!)

OK, I really don't recommend you do the "sticky note method" at home! However, it's simple and it works, effectively turning an X16 card into X8 (if you do it right). It's the same as putting the card in an X8 slot but, by doing this, we can hold everything else constant. [ Hats off to Matt Bach here at Puget for turning me on to this trick! ]

Results

The table below shows why the "It depends" answer is most appropriate. The Caffe classification model training jobs showed only a small performance drop when run at X8. However, the nbody multi-GPU benchmark showed a significant performance degradation at X8!  

Job X16 X8 Comment

GoogLeNet

1 x 1070

32hr 33hr

Training time for 30 epochs, 1.3million images.

Only small degradation in run-time with X8

GoogLeNet

2 x 1070

19hr 47min 20hr 22min Dual GPU scaling is as expected and again impact of using X8 is small

AlexNet

1 x 1070

8hr 1min 8hr 27min AlexNet with Caffe does not scale well on multiple 1070s (or TitanX) GPUs. Presumably I/O bound(?) X8 had only a small impact on run time.

nbody

1 x 1070

4085 GFLOP/s 4109 GFLOP/s

nbody benchmark from CUDA samples, (256000 bodies)

[results typically vary as much as 10% from run to run]

nbody

2 x 1070

7654 GFLOP/s 4968 GFLOP/s Large performance drop on X8 when using multi-GPU!

 

For completeness here's the output from the CUDA samples bandwidth test and P2P bandwidth test which clearly show the bandwidth improvement when using PCIe X16.

X16

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1070
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11708.3

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12866.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			191565.2

Result = PASS
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1070
Device: 1, GeForce GTX 1070
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 192.52  11.20 
     1  11.20 193.57 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 192.81  10.18 
     1  10.18 193.76 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 193.52  19.69 
     1  19.74 191.33 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 193.62  19.68 
     1  19.70 194.58 
P2P=Disabled Latency Matrix (us)
   D\D     0      1 
     0   3.22  16.83 
     1  16.96   3.24 
P2P=Enabled Latency Matrix (us)
   D\D     0      1 
     0   3.23   8.12 
     1   7.99   2.76 

X8

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1070
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6432.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6452.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			191625.0

Result = PASS
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1070
Device: 1, GeForce GTX 1070
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 192.71   6.03 
     1   5.95 193.76 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 192.81   5.89 
     1   5.89 192.62 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 196.04  10.30 
     1  10.43 192.57 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 195.80  10.55 
     1  10.57 195.85 
P2P=Disabled Latency Matrix (us)
   D\D     0      1 
     0   3.24  16.63 
     1  17.47   2.96 
P2P=Enabled Latency Matrix (us)
   D\D     0      1 
     0   2.82   8.02 
     1   8.21   2.78 

Happy computing! --dbk

Tags: PCIe X16 X8, Caffe, GPU, Machine Learning
Aman Sinha

What was the Batch size and number of iterations that you used to train AlexNet and other networks?

Posted on 2018-02-09 08:50:15
Donald Kinghorn

Good question because that could possibly have a performance effect with X8 vs X16. I apologize for not putting more detail in this post about how I ran the tests. It was really just one of those "quick" ... I wonder what the effect would be dropping to X8.

I ran those jobs with a batch size of 64. I ran for 30 full epochs on a training set with 960893 images from the IMAGENET ILSVRC2012 data set.

I usually try to set a batch size that will mostly fill the GPU memory. For the 1070's on that model a batch size of 64 was OK. On a TitanXp or 1080Ti I would have used a batch size of 128. For testing GPU's I often just increase the batch size until the job wont start because of GPU memory and then lower it back down a bit (but I do like powers of 2 :-).

I really like to have full X16! The performance hit from dropping to X8 could be very dependent on specific job and application code. However, one of the early, big, problems with GPU compute was hiding latency and bandwidth over the PCIe bus. There was a lot of effort put in by devs to make good use of buffering and such to minimize the hit so it became best practice to put thought and effort into that problem. ... but still, X16 is a safer bet!

I'm anxious to see systems come out with PCIe v4. The spec is out and IBM is implementing on POWER. Unfortunately Intel decided not to use it on Purley platform ... shame really because I think they possibly could have included it!
--Don

Posted on 2018-02-09 18:35:31
ibmua

The other problem is when will NVidia use it and whether it will.

Posted on 2018-04-22 20:20:22
Donald Kinghorn

I don't know when we'll see more adoption of PCIe 4 or 5 but it probably wont be until 2019. I'm disappointed with Intel on that. ... I think it will have a large effect on overall platform performance when it is finally adopted but ... I hear you ... how manufactures will exploit it is uncertain.

As another note I had several people ask me about PCIe bandwidth performance differences at GTC. I'm going to try to get some more testing in soon using multiple cards with variation from X1 to X16 (if I can) Specifically I'll look at machine learning codes ... probably with TensorFlow. It will be interesting to see what effect there is. Some folks are wondering if they can use an X1 mining rig for AI research! :-)

Posted on 2018-04-23 16:38:50
maiconfaria

the "Device to Device Bandwidth" seems to be a copy to the gpu itself, or not? It's a little bit strange that this communication is faster them device to host as these hardware configuration doesn't allow direct communication between devices.

Posted on 2018-07-13 18:47:46
Donald Kinghorn

My understanding is that peer-to-peer is direct GPU to GPU through UVA (Unified Virtual Memory Access) that's what the kernel module nvidia_uvm is for... Having said that, I really don't know the inner workings of the p-to-p code or ... NCCL for that matter. In the "real world" it seems there is always something to interfere with performance that's hard to explain by looking at performance specs.

Here's an an old webinar slide-deck that has some info,
https://developer.download....

Posted on 2018-07-16 01:09:38
maiconfaria

I thought UVA was only a software layer to overthrow host copies. Seems I was wrong.

Posted on 2018-07-16 02:33:07
Donald Kinghorn

I do like your comment! It brings up something important in general. Looking at stuff like the p2p and bandwidth numbers can be deceiving! In this post I was stunned by the results. I had done the fist test runs using a generator to pull batch sized chunks from storage and saw there was essentially no difference from X8 to X16. That's why I decided to do a full memory load of the data so I wasn't being bottlenecked by that batch load. The thing is it made very little difference. It is not what I expected!

I've been doing hardware/software performance testing and evaluation for a long time. In "the old days" it was pretty easy to make predictions based on specs and low level performance numbers. ... now ... it's almost impossible to predict how a give code + data + hardware + firmware + OS will run. The overall systems have become very complex. Also, there are so many details to track like firmware, driver modules,... and they are dynamic! [ example; the new Intel Skylake-X -W -SP processors have 5 different clocks! ... and security patches can clobber performance for some jobs in unpredictable ways ... ]

All, we can do is report what we see and hope it makes sense :-)
Thanks --Don

Posted on 2018-07-16 16:22:39