Multi-GPU scaling with Titan V and TensorFlow on a 4 GPU Workstation
Written on May 4, 2018 by Dr Donald Kinghorn- System configuration
- Testing Setup
- Results
- GoogLeNet Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)
- ResNet-50 Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)
- Inception-4 Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)
- Appendix: Peer to peer bandwidth and latency test results
I have been qualifying a 4 GPU workstation for Machine Learning and HPC use and it is looking very good! The last confirmation testing I wanted to do was running it with some TensorFlow benchmarks on 4 NVIDIA Titan V GPU's. I have that systems up and running and the multi-GPU scaling looks good.
System configuration
Note: At Puget Systems we build a lot of machines for Machine Learning development work and we have been anxious to have a PCIe X16 x 4 GPU configuration on single-socket Xeon-W. There have been several delays caused by the Intel Spectre/Meltdown mess -- Hardware release delays, broken BIOS updates, broken firmware updates etc.. Things are settling down now but there is still some problems with component availability. We expect to have an "available" release of the system used in this post in a few weeks time.
Hardware
System under test,
- Gigabyte motherboard with 4 X16 PCIe sockets (1 PLX switch on sockets 2,3)
- Intel Xeon W-2195 18 core (Skylake-W with AVX512)
- 256GB Reg ECC memory (up to 512GB)
- 4 x NVIDIA Titan V GPU's
- Samsung 256GB NVMe M.2
... Full configuration options will be available on release.
Software
- Ubuntu 16.04
- Docker 18.03.0-ce
- NVIDIA Docker V2
- TensorFlow 1.7 (running on NVIDIA NGC docker image)
For details on the system environment setup please see my posts,
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 1 Introduction and Base System Setup
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 2 Docker and NVIDIA-Docker-v2
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 3 Setup User-Namespaces
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 4 Accessing the NGC Registry
- How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning
Testing Setup
This multi-GPU scaling testing will be using the same convolution neural network models implemented with TensorFlow that I used in my recent post GPU Memory Size and Deep Learning Performance (batch size) 12GB vs 32GB -- 1080Ti vs Titan V vs GV100. The code I'm running is from the TensorFlow docker image on NVIDIA NGC. The application is cnn
in the nvidia-examples directory. And, I am using synthetic data for image input.
My command line to start the container is, (after doing docker login nvcr.io
to access the NGC docker registry)
docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2
Note: there was a newer image available tagged 18.04-py but when I tried using that image all CNN jobs failed.
Results
Results are for training the convolution neural networks GoogLeNet, ResNet-50 and Inception-4. These are increasingly complex models. The training is using synthetic image data and measures the forward and backward propagation through the networks for 80 batches of images of batch-size given in the tables. Reported results are Images-per-Second.
The scaling looks very good, as seen in the tables, bar-charts and Amdhal's Law curve-fit plots.
The parallel GPU scaling performance does decrease with the increasing complexity of the model but overall looks quite good. This reflects positively on the hardware being testing and on the quality of TensorFlow and the code implementation for these models.
I have included results for both FP32 (32-bit single precision floating point) and FP16 (16-bit half precision floating point). FP16 is what is used by Tensor-cores on the Volta based GPU's like the Titan V. Tensor-cores can significantly increase performance at the risk of numerical instability in the job runs. I believe they should be trialled for real-world work since the performance gains are compeling. I have a discussion about Tensor-cores in this post, NVIDIA Titan V plus Tensor-cores Considerations and Testing of FP16 for Deep Learning.
When looking at parallel scaling I like to include fitting the data to an Amdhal's Law. These curves give give a representation of devaition form the ideal linear scaling. The cure-fit also gives a parallel fraction P
that is an indication of the maximum speedup achievable. Maximum speedup is unlikely to exceed 1/(1-P) with any number of compute devices in the system (GPU's in our case).
Here's the expression of Amdhal's Law that I did a regression fit of the data to,
Performance_Images_Per_Second = Images_Per_Second_For_One_GPU/((1-P)+(P/Number_of_GPU's))
GoogLeNet Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)
Number of GPU's | FP32 Images/sec (total batch size) | FP16 Images/sec (total batch size) |
---|---|---|
1 | 851.3 (256) | 1370.6 (512) |
2 | 1525.1 (512) | 2517.0 (1024) |
3 | 2272.3 (768) | 3661.3 (1536) |
4 | 3080.2 (1024) | 4969.6 (2048) |
ResNet-50 Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)
Number of GPU's | FP32 Images/sec (total batch size) | FP16 Images/sec (total batch size) |
---|---|---|
1 | 293.0 (64) | 571.4 (128) |
2 | 510.8 (128) | 978.6 (256) |
3 | 701.8 (192) | 1375.9 (384) |
4 | 923.4 (256) | 1808.9 (512) |
Inception-4 Multi-GPU scaling with 1-4 Titan V GPU's using TensorFlow -- Training performance (Images/second)
Number of GPU's | FP32 Images/sec (total batch size) | FP16 Images/sec (total batch size) |
---|---|---|
1 | 89.2 (32) | 189.9 (64) |
2 | 153.4 (64) | 321.9 (128) |
3 | 205.3 (96) | 442.7 (192) |
4 | 265.9 (128) | 585.7 (256) |
Happy computing! --dbk
Appendix: Peer to peer bandwidth and latency test results
For completeness, I wanted to include the results from running p2pBandwidthLatencyTest
(source available from "CUDA samples" )
The bandwidth and latency for this test system look very good. You do see some expected bandwidth lowering and letency increase across devices 2 and which are on the PLX switch.
./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN V, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN V, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN V, pciBusID: b5, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN V, pciBusID: b6, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 555.65 5.75 5.74 5.76
1 5.86 554.87 5.72 5.74
2 5.87 5.87 554.87 5.77
3 5.75 5.76 5.81 555.65
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 554.87 6.03 6.02 6.02
1 6.05 554.87 6.01 6.03
2 4.39 4.39 554.08 4.26
3 4.40 4.40 4.27 553.29
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 566.53 11.05 11.03 10.99
1 11.07 564.49 11.15 10.92
2 11.15 11.16 562.86 6.18
3 11.05 11.01 6.19 564.49
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 564.05 10.83 8.40 8.43
1 10.84 563.67 8.40 8.43
2 8.40 8.39 564.90 8.07
3 8.43 8.42 8.11 563.67
P2P=Disabled Latency Matrix (us)
D\D 0 1 2 3
0 2.94 16.23 16.50 16.51
1 16.41 3.03 16.53 16.52
2 17.40 17.54 3.74 18.90
3 17.42 17.36 18.73 3.08
P2P=Enabled Latency Matrix (us)
D\D 0 1 2 3
0 3.01 5.38 5.55 5.57
1 5.26 3.00 5.79 5.80
2 6.81 6.91 3.00 6.64
3 6.81 6.89 6.62 3.01
Does the cooling seem sufficient with all four cards in place? Limited space between them.
Great article, the site is a good resource.
With this CPU the 4 GPU don't run at 4 x 16x , maybe 16x/4x/16x/4x
they're all X16 It's using a PCIe switch (PLX) but, that is actually pretty effective. That board had it's quirks but it worked OK for multi-GPU. This is not my favorite board though! The ASUS Sage boards are a lot nicer (in my opinion). The single socket board is using 1 PLX the dual socket gets all the lanes from the CPU's. On my personal workstation I originally had the board that was used in this post but I swapped it out for a ASUS WS Pro SE which has 3 X16 slots (no PLX) It's been a great board. For out builds here at Puget we are using the Sage boards and they have been very good! For a Xeon-W I would recommend either of these.
Also, for nearly everything I've ever tested X16 vs X8 is only marginally different anyway :-) Take care mate!
Thanks , Your x16 x8 article is really valuable !
You are right I skipped the PCIe switch (PLX) in the article.
PLX seems a great solution as native 4 x16 can be really expensive ( epyc or dual xeon AFAIK )
PLX is surprisingly good. It works very similar to a network switch, in fact PCI bus traffic looks like network packets! You can have contention with PLX for simultaneous "actions" and it does introduce some latency. ... but overall it's a good thing. And, has you've seen, that bandwidth usually doesn't make a lot of difference. I just did some testing to confirm that P2P was disabled on the new RTX 20xx GPU's ... it is ... but even though the bandwidth is 10 times less than it is with the NVLINK bridge attached the performance difference is less than 10% and that's with code that directly uses P2P!