PCIe X16 vs X8 with 4 x Titan V GPUs for Machine Learning



PCIe X16 vs PCIe X8 when doing GPU accelerated computing … How much difference does it make? I get asked that question a lot. The answer, of course is, “it depends”. A little better answer is that it probably wont have much effect for real world applications.

In this post I’ll be looking at some common Machine Learning (Deep Learning) application job runs on a system with 4 Titan V GPU’s. I will repeat the job runs at PCIe X16 and X8.

I looked at this issue a couple of years ago and wrote it up in this post, PCIe X16 vs X8 for GPUs when running cuDNN and Caffe. I’ve been doing a lot more Machine Learning work since then and the question of PCIe bandwidth limitations keeps coming up. The machines I run jobs on always have their GPU’s running at X16 either from lanes provided by the CPU and chip-set or with the help of PLX (PEX) PCIe switches. On current Intel single socket systems a PCIe switch is required to get 4 full X16 slots for GPU’s. The alternative possibility for 4 GPU’s on boards that don’t have the PCIe switches is to run the cards at X8 (assuming the motherboard has enough slots to support that). The question is how much performance do you lose from doing that? Lets find out.


System configuration

The system I’m using for this testing is the same system I’ve used recently for tests with Titan V. The recent post on multi-GPU scaling has more details about the configuration and setup, Multi-GPU scaling with Titan V and TensorFlow on a 4 GPU Workstation.

Hardware

System under test,

  • Gigabyte motherboard with 4 X16 PCIe sockets (1 PLX switch on sockets 2,3)
  • Intel Xeon W-2195 18 core (Skylake-W with AVX512)
  • 256GB Reg ECC memory (up to 512GB)
  • 4 x NVIDIA Titan V GPU’s
  • Samsung 256GB NVMe M.2

Software

  • Ubuntu 16.04
  • Docker 18.03.0-ce
  • NVIDIA Docker V2
  • TensorFlow 1.7 (running on NVIDIA NGC docker image)
  • Keras 2.1.5 (with TensorFlow back-end from local install with Anaconda Python).

“Sticky Note” method for switching from X16 to X8

I order to keep everything about the system under test the same the PCIe bandwidth was physically changed on the GPU’s by blocking of half the pins on the cards with cut down “sticky notes”. Don’t try this at home! This is equivalent to putting the cards into X8 slots. The following photo shows 4 Titan V GPU’s with half of the pins blocked off.

Titan V with sticky notes Bar

Test Jobs

Performance was tested with a mixture of benchmark jobs, a “real-world” model training job and, for completeness, a per-to-per bandwidth and latency test.

GoogLeNet Training

This job was run using the TensorFlow docker image on NVIDIA NGC. The application is cnn in the nvidia-examples directory. And, I am using synthetic data for image input. For details on this doing this job run see, NVIDIA Titan V plus Tensor-cores Considerations and Testing of FP16 for Deep Learning. [Note: this job shows clear benefit from using Tensor-cores so that data will be included in the results. ]

Billion Words Benchmark LSTM Train

This job is also using the docker image mentioned above. The application is big_lstm using the billion word news-feed corpus. See, TensorFlow Scaling on 8 1080Ti GPUs – Billion Words Benchmark with LSTM on a Docker Workstation Configuration for example usage.

VGG model in Keras

For a “real-world” test I did an implementation of the VGG CNN using Keras (with the TensorFlow back-end). This was run in a Jupyter notebook with Anaconda Python installed on the machine. The data used is 25000 images of “dogs vs cats” downloaded from Kaggle. The source is included in appendix B.

Per-to-Per bandwidth and latency

The p2pBandwidthLatencyTest from the NVIDIA CUDA samples was used to show the direct bandwidth halving effect of moving from X16 to X8. The output is in appendix A.


Results

The following tables and charts give a pretty clear indication that X16 has only a small advantage over X8. In some cases there is no apparent difference. This does not mean that this will always be the case! These jobs achieve most of their parallelism by distributing batches across the GPU’s so there is very little card-to-card communication. This means that most of the effect of the lower bandwidth at X8 is going to occur during the transfer of data from CPU space to GPU space. How much that effect your particular job will vary. However, these results suggest that the effect may be modest in most cases.

PCIe X16 vs X8 — GoogLeNet Training, TensorFlow FP32 and FP16 (Tensor-cores)
Titan V GPU’s
[Images/second (total batch size)]

Number of GPU’s PCIe X16
FP32
PCIe X16
FP16
PCIe X8
FP32
PCIe X8
FP16
1 851.3 (256) 1370.6 (512) 838 1319
2 1525.1 (512) 2517.0 (1024) 1519 2424
3 2272.3 (768) 3661.3 (1536) 2153 3572
4 3080.2 (1024) 4969.6 (2048) 2943 4707

GoogLeNet Bar

The code for this job run is highly optimized for GPU and there is only a minor difference between X16 and X8. The difference does increase with more GPU’s. This chart also shows the nice improvement from using Tensor-cores (FP16) on the Titan V!


PCIe X16 vs X8 — Billion Words Benchmark LSTM Train, TensorFlow
Titan V GPU’s
[Words per second]

Number of GPU’s PCIe X16 PCIe X8
1 8373 8176
2 15483 14686
3 20462 19178
4 22058 20565

Big LSTM

Again, the difference between X16 and X8 is only minor and increasing with more GPU’s.


PCIe X16 vs X8 — VGG, Keras (TensorFlow) Memory-streaming 25000 Images
Titan V GPU’s
[Training time for 4 epochs (seconds)]

Number of GPU’s PCIe X16 PCIe X8
1 476 476
2 262 269
3 222 202
4 188 195

Keras memory stream

In this Keras implementation of VGG there is even less performance difference between X16 and X8. The multi-GPU scaling beyond 2 GPU’s is also not as good as the previous jobs. The image data was loaded into memory and fed to the model through Python variables. The Python source that was used for this job is given in Appendix B.


PCIe X16 vs X8 — VGG in Keras (TensorFlow) Disk-streaming 25000 Images
Titan V GPU’s
[Training time for 4 epochs]

Number of GPU’s PCIe X16 PCIe X8
1 445 445
2 320 314
3 316 318
4 316 316

Keras disk stream

For this job run the image data was fed to the model in batches loaded from disk and put into the proper format on-the-fly. This caused multi-GPU scaling to fall off completely after 2 GPU’s. And, again, dropping to X8 had almost no effect on the performance. The real bottleneck here is the time taken to convert image data pulled from disk. [Note: the data was stored on a fast NVMe SSD.]


There you have it, PCIe X16 vs PCIe X8. Not a lot of difference from what I’ve seen from the testing above. Still, I think I would always prefer to have X16 for my GPU’s!

What happens if you drop even further? I don’t really know but I suspect the X4 may still be OK. If you are thinking about turning your X1 “coin mining rig” into a machine learning box I think it would probably work OK. If you try that let me know!

Happy computing! –dbk


Appendix A: Peer to peer bandwidth and latency test results

For completeness, I wanted to include the results from running p2pBandwidthLatencyTest (source available from “CUDA samples” )

The bandwidth and latency for this test system look very good. You do some expected bandwidth lowering and latency increase across devices 2 and which are on the PLX switch.

PCIe X16 PCIe X8
./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN V, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN V, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN V, pciBusID: b5, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN V, pciBusID: b6, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one,
 it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 555.65   5.75   5.74   5.76
     1   5.86 554.87   5.72   5.74
     2   5.87   5.87 554.87   5.77
     3   5.75   5.76   5.81 555.65
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 554.87   6.03   6.02   6.02
     1   6.05 554.87   6.01   6.03
     2   4.39   4.39 554.08   4.26
     3   4.40   4.40   4.27 553.29
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 566.53  11.05  11.03  10.99
     1  11.07 564.49  11.15  10.92
     2  11.15  11.16 562.86   6.18
     3  11.05  11.01   6.19 564.49
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 564.05  10.83   8.40   8.43
     1  10.84 563.67   8.40   8.43
     2   8.40   8.39 564.90   8.07
     3   8.43   8.42   8.11 563.67
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3
     0   2.94  16.23  16.50  16.51
     1  16.41   3.03  16.53  16.52
     2  17.40  17.54   3.74  18.90
     3  17.42  17.36  18.73   3.08
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.01   5.38   5.55   5.57
     1   5.26   3.00   5.79   5.80
     2   6.81   6.91   3.00   6.64
     3   6.81   6.89   6.62   3.01

./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN V, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN V, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN V, pciBusID: b5, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN V, pciBusID: b6, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one,
 it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 556.45   3.07   3.04   3.04
     1   3.04 552.51   3.04   3.04
     2   3.15   3.15 554.87   3.16
     3   3.15   3.15   3.15 553.29
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 555.65   5.18   5.17   5.17
     1   5.18 554.87   5.18   5.17
     2   4.22   4.23 555.65   4.03
     3   4.26   4.26   4.07 554.08
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 563.67   5.99   5.87   5.88
     1   6.02 562.46   5.89   5.89
     2   5.91   5.86 561.24   5.98
     3   5.87   5.84   5.98 562.05
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 565.30   8.21   7.27   7.31
     1   8.21 563.27   7.26   7.30
     2   7.26   7.26 562.86   6.21
     3   7.30   7.31   6.23 562.86
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3
     0   2.97  15.87  16.50  16.44
     1  15.86   2.99  16.42  16.42
     2  17.40  17.18   2.99  18.50
     3  17.31  17.32  18.47   2.96
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   2.98   5.42   5.90   5.91
     1   5.46   3.02   5.57   5.62
     2   7.03   6.76   3.06   6.66
     3   6.82   6.73   6.65   3.04

Appendix B: VGG model train on cat vs dog image set (25000 images)

import numpy as np
np.random.seed(42)
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
from keras.callbacks import TensorBoard

from keras.preprocessing.image import ImageDataGenerator

from keras.utils import multi_gpu_model
model = Sequential()

model.add(Conv2D(64, 3, activation='relu', input_shape=(224,224,3)))
model.add(Conv2D(64, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Conv2D(128, 3, activation='relu'))
model.add(Conv2D(128, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Conv2D(256, 3, activation='relu'))
model.add(Conv2D(256, 3, activation='relu'))
model.add(Conv2D(256, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Flatten())
model.add(Dense(4096, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='tanh'))
model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid'))

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 222, 222, 64)      1792      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 220, 220, 64)      36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 110, 110, 64)      0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 110, 110, 64)      256       
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 108, 108, 128)     73856     
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 106, 106, 128)     147584    
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 53, 53, 128)       0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 53, 53, 128)       512       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 51, 51, 256)       295168    
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 49, 49, 256)       590080    
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 47, 47, 256)       590080    
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 23, 23, 256)       0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 23, 23, 256)       1024      
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 21, 21, 512)       1180160   
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 19, 19, 512)       2359808   
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 17, 17, 512)       2359808   
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 8, 8, 512)         0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 8, 8, 512)         2048      
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 6, 6, 512)         2359808   
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 4, 4, 512)         2359808   
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 2, 2, 512)         2359808   
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 1, 1, 512)         0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 1, 1, 512)         2048      
_________________________________________________________________
flatten_1 (Flatten)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 4096)              2101248   
_________________________________________________________________
dropout_1 (Dropout)          (None, 4096)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 4096)              16781312  
_________________________________________________________________
dropout_2 (Dropout)          (None, 4096)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 4097      
=================================================================
Total params: 33,607,233
Trainable params: 33,604,289
Non-trainable params: 2,944
_________________________________________________________________
#parallel_model = multi_gpu_model(model, gpus=2)
#parallel_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Setup data

tensorbrd = TensorBoard('./logs/vgg-1')
batch_size = 64
train_dir= './train'
img_size = (224,224)

train_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=img_size,
        #batch_size=batch_size,
        # the following batch size loads all 25000 images at once
        batch_size=25000,
        class_mode='binary')
Found 25000 images belonging to 2 classes.
for X, Y in train_generator:
    print('data X shape:', X.shape)
    print('labels Y shape:', Y.shape)
    break
data X shape: (25000, 224, 224, 3)
labels Y shape: (25000,)
#parallel_model.fit_generator(train_generator, epochs=4, verbose=1, callbacks=[tensorbrd],
#                             steps_per_epoch=25000 // batch_size,
#                             max_queue_size=25000, workers=16)
#model.fit_generator(train_generator, epochs=4, verbose=1, callbacks=[tensorbrd])
#parallel_model.fit(X,Y, batch_size=batch_size, epochs=4, verbose=1, callbacks=[tensorbrd])
model.fit(X,Y, batch_size=batch_size, epochs=4, verbose=1, callbacks=[tensorbrd])