Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1152
Dr Donald Kinghorn (Scientific Computing Advisor )

Multi-GPU scaling with Titan V and TensorFlow on a 4 GPU Workstation

Written on May 4, 2018 by Dr Donald Kinghorn

I have been qualifying a 4 GPU workstation for Machine Learning and HPC use and it is looking very good! The last confirmation testing I wanted to do was running it with some TensorFlow benchmarks on 4 NVIDIA Titan V GPU's. I have that systems up and running and the multi-GPU scaling looks good.

System configuration

Note: At Puget Systems we build a lot of machines for Machine Learning development work and we have been anxious to have a PCIe X16 x 4 GPU configuration on single-socket Xeon-W. There have been several delays caused by the Intel Spectre/Meltdown mess -- Hardware release delays, broken BIOS updates, broken firmware updates etc.. Things are settling down now but there is still some problems with component availability. We expect to have an "available" release of the system used in this post in a few weeks time.


System under test,

  • Gigabyte motherboard with 4 X16 PCIe sockets (1 PLX switch on sockets 2,3)
  • Intel Xeon W-2195 18 core (Skylake-W with AVX512)
  • 256GB Reg ECC memory (up to 512GB)
  • 4 x NVIDIA Titan V GPU's
  • Samsung 256GB NVMe M.2

... Full configuration options will be available on release.


  • Ubuntu 16.04
  • Docker 18.03.0-ce
  • NVIDIA Docker V2
  • TensorFlow 1.7 (running on NVIDIA NGC docker image)

For details on the system environment setup please see my posts,

Testing Setup

This multi-GPU scaling testing will be using the same convolution neural network models implemented with TensorFlow that I used in my recent post GPU Memory Size and Deep Learning Performance (batch size) 12GB vs 32GB -- 1080Ti vs Titan V vs GV100. The code I'm running is from the TensorFlow docker image on NVIDIA NGC. The application is cnn in the nvidia-examples directory. And, I am using synthetic data for image input.

My command line to start the container is, (after doing docker login nvcr.io to access the NGC docker registry)

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Note: there was a newer image available tagged 18.04-py but when I tried using that image all CNN jobs failed.


Results are for training the convolution neural networks GoogLeNet, ResNet-50 and Inception-4. These are increasingly complex models. The training is using synthetic image data and measures the forward and backward propagation through the networks for 80 batches of images of batch-size given in the tables. Reported results are Images-per-Second.

The scaling looks very good, as seen in the tables, bar-charts and Amdhal's Law curve-fit plots.

The parallel GPU scaling performance does decrease with the increasing complexity of the model but overall looks quite good. This reflects positively on the hardware being testing and on the quality of TensorFlow and the code implementation for these models.

I have included results for both FP32 (32-bit single precision floating point) and FP16 (16-bit half precision floating point). FP16 is what is used by Tensor-cores on the Volta based GPU's like the Titan V. Tensor-cores can significantly increase performance at the risk of numerical instability in the job runs. I believe they should be trialled for real-world work since the performance gains are compeling. I have a discussion about Tensor-cores in this post, NVIDIA Titan V plus Tensor-cores Considerations and Testing of FP16 for Deep Learning.

When looking at parallel scaling I like to include fitting the data to an Amdhal's Law. These curves give give a representation of devaition form the ideal linear scaling. The cure-fit also gives a parallel fraction P that is an indication of the maximum speedup achievable. Maximum speedup is unlikely to exceed 1/(1-P) with any number of compute devices in the system (GPU's in our case).

Here's the expression of Amdhal's Law that I did a regression fit of the data to,

Performance_Images_Per_Second = Images_Per_Second_For_One_GPU/((1-P)+(P/Number_of_GPU's))

GoogLeNet Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)

Number of GPU's FP32
Images/sec (total batch size)
Images/sec (total batch size)
1 851.3 (256)1370.6 (512)
21525.1 (512)2517.0 (1024)
32272.3 (768)3661.3 (1536)
43080.2 (1024)4969.6 (2048)

GoogLeNet Bar

GoogLeNet Amdhal

ResNet-50 Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)

Number of GPU's FP32
Images/sec (total batch size)
Images/sec (total batch size)
1293.0 (64)571.4 (128)
2510.8 (128)978.6 (256)
3701.8 (192)1375.9 (384)
4923.4 (256)1808.9 (512)

ResNet50 Bar

ResNet50 Amdhal

Inception-4 Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)

Number of GPU's FP32
Images/sec (total batch size)
Images/sec (total batch size)
189.2 (32)189.9 (64)
2153.4 (64)321.9 (128)
3205.3 (96)442.7 (192)
4265.9 (128)585.7 (256)

Inception4 Bar

Inception4 Amdhal

Happy computing! --dbk

Appendix: Peer to peer bandwidth and latency test results

For completeness, I wanted to include the results from running p2pBandwidthLatencyTest (source available from "CUDA samples" )

The bandwidth and latency for this test system look very good. You do see some expected bandwidth lowering and letency increase across devices 2 and which are on the PLX switch.


[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN V, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN V, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN V, pciBusID: b5, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN V, pciBusID: b6, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 555.65   5.75   5.74   5.76
     1   5.86 554.87   5.72   5.74
     2   5.87   5.87 554.87   5.77
     3   5.75   5.76   5.81 555.65
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 554.87   6.03   6.02   6.02
     1   6.05 554.87   6.01   6.03
     2   4.39   4.39 554.08   4.26
     3   4.40   4.40   4.27 553.29
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 566.53  11.05  11.03  10.99
     1  11.07 564.49  11.15  10.92
     2  11.15  11.16 562.86   6.18
     3  11.05  11.01   6.19 564.49
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 564.05  10.83   8.40   8.43
     1  10.84 563.67   8.40   8.43
     2   8.40   8.39 564.90   8.07
     3   8.43   8.42   8.11 563.67
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3
     0   2.94  16.23  16.50  16.51
     1  16.41   3.03  16.53  16.52
     2  17.40  17.54   3.74  18.90
     3  17.42  17.36  18.73   3.08
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.01   5.38   5.55   5.57
     1   5.26   3.00   5.79   5.80
     2   6.81   6.91   3.00   6.64
     3   6.81   6.89   6.62   3.01

Tags: Titan V, Multi-GPU, Tensor cores, NVIDIA, Deep Learning

Does the cooling seem sufficient with all four cards in place? Limited space between them.
Great article, the site is a good resource.

Posted on 2018-07-14 16:44:08

With this CPU the 4 GPU don't run at 4 x 16x , maybe 16x/4x/16x/4x

Posted on 2019-01-08 11:31:10
Donald Kinghorn

they're all X16 It's using a PCIe switch (PLX) but, that is actually pretty effective. That board had it's quirks but it worked OK for multi-GPU. This is not my favorite board though! The ASUS Sage boards are a lot nicer (in my opinion). The single socket board is using 1 PLX the dual socket gets all the lanes from the CPU's. On my personal workstation I originally had the board that was used in this post but I swapped it out for a ASUS WS Pro SE which has 3 X16 slots (no PLX) It's been a great board. For out builds here at Puget we are using the Sage boards and they have been very good! For a Xeon-W I would recommend either of these.

Also, for nearly everything I've ever tested X16 vs X8 is only marginally different anyway :-) Take care mate!

Posted on 2019-01-08 17:40:44

Thanks , Your x16 x8 article is really valuable !

You are right I skipped the PCIe switch (PLX) in the article.

PLX seems a great solution as native 4 x16 can be really expensive ( epyc or dual xeon AFAIK )

Posted on 2019-01-09 11:10:30
Donald Kinghorn

PLX is surprisingly good. It works very similar to a network switch, in fact PCI bus traffic looks like network packets! You can have contention with PLX for simultaneous "actions" and it does introduce some latency. ... but overall it's a good thing. And, has you've seen, that bandwidth usually doesn't make a lot of difference. I just did some testing to confirm that P2P was disabled on the new RTX 20xx GPU's ... it is ... but even though the bandwidth is 10 times less than it is with the NVLINK bridge attached the performance difference is less than 10% and that's with code that directly uses P2P!

Posted on 2019-01-09 17:25:16