Table of Contents
Introduction
This post is an update and expansion of much of the GPU testing I have been doing over the last several months. I am using current (as of this post date) TensorFlow builds on NVIDIA NGC, the most recent display driver and I have results for up to 4 GPU’s, including NVLINK, with several of the cards under test. This is something that I have been promising to do!
Test system
Hardware
- Puget Systems Peak Single (I used a test-bed system with components that we typically use in the Peak Single configured for Machine Learning)
- Intel Xeon-W 2175 14-core
- 128GB Memory
- 2TB Intel 660p NVMe M.2
- RTX Titan (1-2), 2080Ti (1-4), 2080 (1-4), 2070 (1-4)
- GTX 1660Ti (1-2), 1080Ti (1-4)
- Titan V (1-2)
Software
- Ubuntu 18.04
- NVIDIA display driver 418.43 (from Graphics-Drivers ppa)
- TensorFlow 1.13.0-rc0
- Docker 18.09.3-ce
- NVIDIA-Docker 2.0.3
- NVIDIA NGC container registry
- Container image: nvcr.io/nvidia/tensorflow:19.02-py3 for “Big LSTM” and “CNN”
For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 5 Docker Performance and Resource Tuning
TensorFlow Multi-GPU performance with 1-4 NVIDIA RTX and GTX GPU’s
This is all fresh testing using the updates and configuration described above. Hopefully it will give you a comparative snapshot of multi-GPU performance with TensorFlow in a workstation configuration.
CNN (fp32, fp16) and Big LSTM job run batch sizes for the GPU’s
Batch size does affect performance and larger sizes are usually better. The batch size is limited by the amount of memory available on the GPU’s. “Reasonable” values that would run without giving “out of memory” errors were used. Multi-GPU jobs used the same batch sizes settings as single GPU job since they are set per processes. That means the “effective” batch sizes are multiples of the batch size since the jobs are “data parallel”. The batch size information for the different cards and job types is in the table below.
CNN [ResNet-50] fp32, fp16 and RNN [Big LSTM] job Batch Sizes for the GPU’s tested
GPU | ResNet-50 FP32 batch size |
RedNet-50 FP16 (Tensor-cores) batch size |
Big LSTM batch size |
---|---|---|---|
RTX Titan | 192 | 384 | 640 |
RTX 2080 Ti | 64 | 128 | 448 |
RTX 2080 | 64 | 128 | 256 |
RTX 2070 | 64 | 128 | 256 |
GTX 1660 Ti | 32 | 64 | 128 |
Titan V | 64 | 128 | 448 |
GTX 1080 Ti | 64 | N/A | 448 |
GTX 1070 | 64 | N/A | 256 |
TensorFlow CNN: ResNet-50
Docker container image tensorflow:19.02-py3 from NGC,
docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3
Example command lines for starting jobs,
# For a single GPU
python resnet.py --layers=50 --batch_size=64 --precision=fp32
# For multi-GPU's
mpiexec --allow-run-as-root -np 2 python resnet.py --layers=50 --batch_size=64 --precision=fp32
Notes:
- Setting
--precision=fp16
means “use tensor-cores”. --batch_size=
batch sizes are varied to take advantage of available memory on the GPU’s.- Multi-GPU in this version of the CNN docker image is using “Horovod” for parallel execution. That means it is using MPI and in particular OpenMPI is being used in the container image. The numbers in the charts for 1, 2 and 4 GPU’s show very good parallel scaling with horovod, in my opinion!
[ResNet-50 fp32] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU’s
[ResNet-50 fp16] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU’s
The charts above mostly speak for themselves. One thing to notice for these jobs is that the peer-to-peer communication advantage of using NVLINK has only a small impact. That will not be the case for the LSTM job runs.
TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset
Docker container image tensorflow:19.02-py3 from NGC,
docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3
Example job command-line,
python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2
--datadir=./data/ 1-billion-word-language-modeling-benchmark-r13output/ \
--hpconfig run_profiler=False,max_time=240,num_steps=20,num_shards=8,num_layers=2,\
learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,\
state_size=8192,num_sampled=8192,batch_size=448
Note:
--num_gpus=
andbatch_size=
are the only parameters changed for the different job runs.
[Big LSTM] TensorFlow, Training performance (words/second) with 1-4 NVIDIA RTX and GTX GPU’s
Notes:
- Batch size and GPU-to-GPU (peer-to-peer) communication have a significant impact on the performance with this recurrent neural network. The higher end GPU’s have the advantage of both increased numbers of compute cores and the availability of larger memory spaces for data and instructions as well as the possibility of using the high performance NVLINK for communication.
- NVLINK significantly improved performance. This performance improvement was even apparent when using 2 NVLINK pairs with 4 GPU’s. I was a little surprised by this since I expected it to bottleneck on the “memcpy” to CPU memory space needed between the remaining non NVLINK connected pairs.
Happy computing –dbk