Puget Systems print logo


Read this article at https://www.pugetsystems.com/guides/1386
Dr Donald Kinghorn (Scientific Computing Advisor )

TensorFlow Performance with 1-4 GPUs -- RTX Titan, 2080Ti, 2080, 2070, GTX 1660Ti, 1070, 1080Ti, and Titan V

Written on March 14, 2019 by Dr Donald Kinghorn


This post is an update and expansion of much of the GPU testing I have been doing over the last several months. I am using current (as of this post date) TensorFlow builds on NVIDIA NGC, the most recent display driver and I have results for up to 4 GPU's, including NVLINK, with several of the cards under test. This is something that I have been promising to do!

Test system


  • Puget Systems Peak Single (I used a test-bed system with components that we typically use in the Peak Single configured for Machine Learning)
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 2TB Intel 660p NVMe M.2
  • RTX Titan (1-2), 2080Ti (1-4), 2080 (1-4), 2070 (1-4)
  • GTX 1660Ti (1-2), 1080Ti (1-4)
  • Titan V (1-2)


For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning

TensorFlow Multi-GPU performance with 1-4 NVIDIA RTX and GTX GPU's

This is all fresh testing using the updates and configuration described above. Hopefully it will give you a comparative snapshot of multi-GPU performance with TensorFlow in a workstation configuration.

CNN (fp32, fp16) and Big LSTM job run batch sizes for the GPU's

Batch size does affect performance and larger sizes are usually better. The batch size is limited by the amount of memory available on the GPU's. "Reasonable" values that would run without giving "out of memory" errors were used. Multi-GPU jobs used the same batch sizes settings as single GPU job since they are set per processes. That means the "effective" batch sizes are multiples of the batch size since the jobs are "data parallel". The batch size information for the different cards and job types is in the table below.

CNN [ResNet-50] fp32, fp16 and RNN [Big LSTM] job Batch Sizes for the GPU's tested

GPU ResNet-50 FP32
batch size
RedNet-50 FP16 (Tensor-cores)
batch size
batch size
RTX Titan 192 384 640
RTX 2080 Ti64 128 448
RTX 2080 64 128 256
RTX 2070 64 128 256
GTX 1660 Ti32 64 128
Titan V 64 128 448
GTX 1080 Ti64 N/A448
GTX 1070 64 N/A256

TensorFlow CNN: ResNet-50

Docker container image tensorflow:19.02-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3

Example command lines for starting jobs,

# For a single GPU
python resnet.py  --layers=50  --batch_size=64  --precision=fp32

# For multi-GPU's 
mpiexec --allow-run-as-root -np 2 python resnet.py  --layers=50  --batch_size=64  --precision=fp32


  • Setting --precision=fp16 means "use tensor-cores".
  • --batch_size= batch sizes are varied to take advantage of available memory on the GPU's.
  • Multi-GPU in this version of the CNN docker image is using "Horovod" for parallel execution. That means it is using MPI and in particular OpenMPI is being used in the container image. The numbers in the charts for 1, 2 and 4 GPU's show very good parallel scaling with horovod, in my opinion!

[ResNet-50 fp32] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU's


[ResNet-50 fp16] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU's

ResNet-50 with fp16

The charts above mostly speak for themselves. One thing to notice for these jobs is that the peer-to-peer communication advantage of using NVLINK has only a small impact. That will not be the case for the LSTM job runs.

TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:19.02-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3

Example job command-line,

python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 
--datadir=./data/ 1-billion-word-language-modeling-benchmark-r13output/ \
--hpconfig run_profiler=False,max_time=240,num_steps=20,num_shards=8,num_layers=2,\


  • --num_gpus= and batch_size= are the only parameters changed for the different job runs.

[Big LSTM] TensorFlow, Training performance (words/second) with 1-4 NVIDIA RTX and GTX GPU's

LSTM with


  • Batch size and GPU-to-GPU (peer-to-peer) communication have a significant impact on the performance with this recurrent neural network. The higher end GPU's have the advantage of both increased numbers of compute cores and the availability of larger memory spaces for data and instructions as well as the possibility of using the high performance NVLINK for communication.
  • NVLINK significantly improved performance. This performance improvement was even apparent when using 2 NVLINK pairs with 4 GPU's. I was a little surprised by this since I expected it to bottleneck on the "memcpy" to CPU memory space needed between the remaining non NVLINK connected pairs.

Happy computing --dbk

Tags: Multi-GPU, TensorFlow, RTX, GTX, Machine Learning, NVIDIA

Very nice benchmark, and just in a perfect time I was looking for such comparison.
The only thing I slightly miss is Tesla K40 and K80 on those graphs.
Could you explain me what is the purpose of those two cards? All benchmarks I found suggest, they are worse than 1080, but the price says it should be fantastic. I'm confused.

Posted on 2019-03-15 14:35:17

The K40 and K80 are very old cards at this point. With many Quadro and Tesla models you can tell the generation by the letter at the start of the model number. K = Kepler, which was before Maxwell, Pascal, Volta, and the latest Turing (RTX series) GPUs. That puts it several generations back, and so not really able to hold its own against modern cards.

Tesla cards in general are compute-focused GPUs, which don't have video outputs since they are built for compute workloads instead of actually displaying graphics. They often share similar specs with some of the high-end Quadro cards in the same generation, but may come in passive versions designed for use in very specialized rackmount chassis. Like Quadro cards, they are usually a lot more expensive than GeForce cards with similar performance - but usually have more VRAM and may have other features like better FP64, ECC memory, etc.

Posted on 2019-03-15 16:43:16
Donald Kinghorn

like William said :-)

The K80 was a "workhorse" dual GPU card, that and the K40 (single GPU) really established the NVIDIA platform for compute. There are still big clusters running these cards, but it's debatable whether they are worth the power consumption given that the newer cards deliver so much more performance per watt.

They are out of date. Any 1070 or 2070 or higher GPU's will be much faster. Also, that was "compute capability" 3.5,3.7 Volta and Turing are at 7.2, 7.5. People are building software now that does not support older than 5.0 and 6.0 (Maxwell and Pascal). The latest CUDA 10.1 does still support Kepler (3.5) and TensorFlow still supports 3.5 but I don't expect this to be the case once the big legacy systems get shut down.

I would not have been able to test the K40 using the NGC docker images. NGC only supports compute capability 6.0 or greater!

The K40 is comparable to the original Titan (or Titan black) and the K80 is like a Titan Z. Those were/are nice cards but not worth consider for any new builds.

Posted on 2019-03-15 18:41:15
Mark Johnstone

Great work as always. I would be interested to see these graphs normalized to GPU cost.

Posted on 2019-03-16 03:31:25
Donald Kinghorn

Yes! That is interesting to look at but it can be scary because some of the GPU's are expensive! It can be hard to decide on what is going to give you best value. I do like the 2080Ti a lot. I wish is was a bit less expensive but it is what I would recommend for most uses. However, any recent NVIDIA GPU "greater" than the 1070 is nice!

Posted on 2019-03-18 01:36:53