Table of Contents
When you are testing GPUs for compute the question "how important is PCIe X16 vs X8 for performance?" seems to always come up. It's often in connection with the question "I want to add a second video card to my system for compute but I only have one X16 slot — will the GPUs be OK at X8?" The answers are; important? "It depends", OK at X8? "Yes". What about for workloads like training deep neural network models with large data sets? I decided to find out for a classification model on a 1.3 million image data set using NVIDIA DIGITS with Caffe.
I looked at performance of deep neural network training with Caffe using GoogLeNet and AlexNet models as well as a few tests from the "CUDA samples" (nbody, bandwidthtest, p2pbandwidth).
How important is X16?
Unfortunately the "it depends" answer is usually best. To really know, you have to test with what you are working on.
What I was working on is this;
NVIDIA DIGITS with Caffe — Performance on Pascal multi-GPU
The blog post linked above has the details of the hardware system, software configuration, and job setup. … however, for this X16 vs X8 testing I only used 1 or 2 GTX 1070s.
Briefly:
Software:
- Ubuntu 14.04.5 plus updates
- NVIDIA display driver version 367.57 (from the CUDA 8 install)
- CUDA 8.0.44-1
- DIGITS 4 (ubuntu14.04-6-ga2-cuda8.0.7-1)
- caffe-nv (0.15.13-1+cuda8.0), cuDNN (5.1.5-1+cuda8.0)
- Training set from Imagenet ILSVRC2012
- 138GB (1.3 million) image data-set
- GoogLeNet and AlexNet models ( I added an AlexNet job run to this X16 vs X8 testing )
Hardware:
- The Peak Single ("DIGITS" GPU Workstation)
- CPU: Intel Core i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
- Memory: 128 GB DDR4 2133MHz Reg ECC
- PCIe: (4) X16-X16 v3
- Motherboard: ASUS X99-E-10G WS
- 1 or 2 NVIDIA GTX 1070 GPUs
X16 to X8 hardware hack (sticky-note method!)
OK, I really don't recommend you do the "sticky note method" at home! However, it's simple and it works, effectively turning an X16 card into X8 (if you do it right). It's the same as putting the card in an X8 slot but, by doing this, we can hold everything else constant. [ Hats off to Matt Bach here at Puget for turning me on to this trick! ]
Results
The table below shows why the "It depends" answer is most appropriate. The Caffe classification model training jobs showed only a small performance drop when run at X8. However, the nbody multi-GPU benchmark showed a significant performance degradation at X8!
Job | X16 | X8 | Comment |
---|---|---|---|
GoogLeNet 1 x 1070 |
32hr | 33hr |
Training time for 30 epochs, 1.3million images. Only small degradation in run-time with X8 |
GoogLeNet 2 x 1070 |
19hr 47min | 20hr 22min | Dual GPU scaling is as expected and again impact of using X8 is small |
AlexNet 1 x 1070 |
8hr 1min | 8hr 27min | AlexNet with Caffe does not scale well on multiple 1070s (or TitanX) GPUs. Presumably I/O bound(?) X8 had only a small impact on run time. |
nbody 1 x 1070 |
4085 GFLOP/s | 4109 GFLOP/s |
nbody benchmark from CUDA samples, (256000 bodies) [results typically vary as much as 10% from run to run] |
nbody 2 x 1070 |
7654 GFLOP/s | 4968 GFLOP/s | Large performance drop on X8 when using multi-GPU! |
For completeness here’s the output from the CUDA samples bandwidth test and P2P bandwidth test which clearly show the bandwidth improvement when using PCIe X16.
X16
[CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX 1070 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 11708.3 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12866.1 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 191565.2 Result = PASS
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, GeForce GTX 1070 Device: 1, GeForce GTX 1070 Device=0 CAN Access Peer Device=1 Device=1 CAN Access Peer Device=0 P2P Connectivity Matrix D\D 0 1 0 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 192.52 11.20 1 11.20 193.57 Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 192.81 10.18 1 10.18 193.76 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 193.52 19.69 1 19.74 191.33 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 193.62 19.68 1 19.70 194.58 P2P=Disabled Latency Matrix (us) D\D 0 1 0 3.22 16.83 1 16.96 3.24 P2P=Enabled Latency Matrix (us) D\D 0 1 0 3.23 8.12 1 7.99 2.76
X8
[CUDA Bandwidth Test] - Starting... Running on... Device 0: GeForce GTX 1070 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6432.6 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6452.7 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 191625.0 Result = PASS
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, GeForce GTX 1070 Device: 1, GeForce GTX 1070 Device=0 CAN Access Peer Device=1 Device=1 CAN Access Peer Device=0 P2P Connectivity Matrix D\D 0 1 0 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 192.71 6.03 1 5.95 193.76 Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 192.81 5.89 1 5.89 192.62 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 196.04 10.30 1 10.43 192.57 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 195.80 10.55 1 10.57 195.85 P2P=Disabled Latency Matrix (us) D\D 0 1 0 3.24 16.63 1 17.47 2.96 P2P=Enabled Latency Matrix (us) D\D 0 1 0 2.82 8.02 1 8.21 2.78
Happy computing! –dbk