Puget Systems print logo


Read this article at https://www.pugetsystems.com/guides/1386
Dr Donald Kinghorn (Scientific Computing Advisor )

TensorFlow Performance with 1-4 GPUs -- RTX Titan, 2080Ti, 2080, 2070, GTX 1660Ti, 1070, 1080Ti, and Titan V

Written on March 14, 2019 by Dr Donald Kinghorn


This post is an update and expansion of much of the GPU testing I have been doing over the last several months. I am using current (as of this post date) TensorFlow builds on NVIDIA NGC, the most recent display driver and I have results for up to 4 GPU's, including NVLINK, with several of the cards under test. This is something that I have been promising to do!

Test system


  • Puget Systems Peak Single (I used a test-bed system with components that we typically use in the Peak Single configured for Machine Learning)
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 2TB Intel 660p NVMe M.2
  • RTX Titan (1-2), 2080Ti (1-4), 2080 (1-4), 2070 (1-4)
  • GTX 1660Ti (1-2), 1080Ti (1-4)
  • Titan V (1-2)


For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning

TensorFlow Multi-GPU performance with 1-4 NVIDIA RTX and GTX GPU's

This is all fresh testing using the updates and configuration described above. Hopefully it will give you a comparative snapshot of multi-GPU performance with TensorFlow in a workstation configuration.

CNN (fp32, fp16) and Big LSTM job run batch sizes for the GPU's

Batch size does affect performance and larger sizes are usually better. The batch size is limited by the amount of memory available on the GPU's. "Reasonable" values that would run without giving "out of memory" errors were used. Multi-GPU jobs used the same batch sizes settings as single GPU job since they are set per processes. That means the "effective" batch sizes are multiples of the batch size since the jobs are "data parallel". The batch size information for the different cards and job types is in the table below.

CNN [ResNet-50] fp32, fp16 and RNN [Big LSTM] job Batch Sizes for the GPU's tested

GPU ResNet-50 FP32
batch size
RedNet-50 FP16 (Tensor-cores)
batch size
batch size
RTX Titan 192 384 640
RTX 2080 Ti64 128 448
RTX 2080 64 128 256
RTX 2070 64 128 256
GTX 1660 Ti32 64 128
Titan V 64 128 448
GTX 1080 Ti64 N/A448
GTX 1070 64 N/A256

TensorFlow CNN: ResNet-50

Docker container image tensorflow:19.02-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3

Example command lines for starting jobs,

# For a single GPU
python resnet.py  --layers=50  --batch_size=64  --precision=fp32

# For multi-GPU's 
mpiexec --allow-run-as-root -np 2 python resnet.py  --layers=50  --batch_size=64  --precision=fp32


  • Setting --precision=fp16 means "use tensor-cores".
  • --batch_size= batch sizes are varied to take advantage of available memory on the GPU's.
  • Multi-GPU in this version of the CNN docker image is using "Horovod" for parallel execution. That means it is using MPI and in particular OpenMPI is being used in the container image. The numbers in the charts for 1, 2 and 4 GPU's show very good parallel scaling with horovod, in my opinion!

[ResNet-50 fp32] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU's


[ResNet-50 fp16] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU's

ResNet-50 with fp16

The charts above mostly speak for themselves. One thing to notice for these jobs is that the peer-to-peer communication advantage of using NVLINK has only a small impact. That will not be the case for the LSTM job runs.

TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:19.02-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3

Example job command-line,

python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 
--datadir=./data/ 1-billion-word-language-modeling-benchmark-r13output/ \
--hpconfig run_profiler=False,max_time=240,num_steps=20,num_shards=8,num_layers=2,\


  • --num_gpus= and batch_size= are the only parameters changed for the different job runs.

[Big LSTM] TensorFlow, Training performance (words/second) with 1-4 NVIDIA RTX and GTX GPU's

LSTM with


  • Batch size and GPU-to-GPU (peer-to-peer) communication have a significant impact on the performance with this recurrent neural network. The higher end GPU's have the advantage of both increased numbers of compute cores and the availability of larger memory spaces for data and instructions as well as the possibility of using the high performance NVLINK for communication.
  • NVLINK significantly improved performance. This performance improvement was even apparent when using 2 NVLINK pairs with 4 GPU's. I was a little surprised by this since I expected it to bottleneck on the "memcpy" to CPU memory space needed between the remaining non NVLINK connected pairs.

Happy computing --dbk

Tags: Multi-GPU, TensorFlow, RTX, GTX, Machine Learning, NVIDIA

Very nice benchmark, and just in a perfect time I was looking for such comparison.
The only thing I slightly miss is Tesla K40 and K80 on those graphs.
Could you explain me what is the purpose of those two cards? All benchmarks I found suggest, they are worse than 1080, but the price says it should be fantastic. I'm confused.

Posted on 2019-03-15 14:35:17

The K40 and K80 are very old cards at this point. With many Quadro and Tesla models you can tell the generation by the letter at the start of the model number. K = Kepler, which was before Maxwell, Pascal, Volta, and the latest Turing (RTX series) GPUs. That puts it several generations back, and so not really able to hold its own against modern cards.

Tesla cards in general are compute-focused GPUs, which don't have video outputs since they are built for compute workloads instead of actually displaying graphics. They often share similar specs with some of the high-end Quadro cards in the same generation, but may come in passive versions designed for use in very specialized rackmount chassis. Like Quadro cards, they are usually a lot more expensive than GeForce cards with similar performance - but usually have more VRAM and may have other features like better FP64, ECC memory, etc.

Posted on 2019-03-15 16:43:16
Donald Kinghorn

like William said :-)

The K80 was a "workhorse" dual GPU card, that and the K40 (single GPU) really established the NVIDIA platform for compute. There are still big clusters running these cards, but it's debatable whether they are worth the power consumption given that the newer cards deliver so much more performance per watt.

They are out of date. Any 1070 or 2070 or higher GPU's will be much faster. Also, that was "compute capability" 3.5,3.7 Volta and Turing are at 7.2, 7.5. People are building software now that does not support older than 5.0 and 6.0 (Maxwell and Pascal). The latest CUDA 10.1 does still support Kepler (3.5) and TensorFlow still supports 3.5 but I don't expect this to be the case once the big legacy systems get shut down.

I would not have been able to test the K40 using the NGC docker images. NGC only supports compute capability 6.0 or greater!

The K40 is comparable to the original Titan (or Titan black) and the K80 is like a Titan Z. Those were/are nice cards but not worth consider for any new builds.

Posted on 2019-03-15 18:41:15
Mark Johnstone

Great work as always. I would be interested to see these graphs normalized to GPU cost.

Posted on 2019-03-16 03:31:25
Donald Kinghorn

Yes! That is interesting to look at but it can be scary because some of the GPU's are expensive! It can be hard to decide on what is going to give you best value. I do like the 2080Ti a lot. I wish is was a bit less expensive but it is what I would recommend for most uses. However, any recent NVIDIA GPU "greater" than the 1070 is nice!

Posted on 2019-03-18 01:36:53

Thank you very much for sharing these informative benchmarks!
I am most interested in your results with four RTX 2080Ti GPUs in combination with NVlink adapters. Exactly which type of adapter did you use? What is the spacing of the GPUs on your motherboard, two slot distance?

Posted on 2019-03-28 12:56:11
Donald Kinghorn

Yes! That was more effective than I expected. I took a photo of the test bench I was using during that testing. Note that these are the Gigabyte cards which I like a lot for this use since they have a slight bevel at the back of the card to lower the pressure for air moving to the blower. It's a nice design touch.

Posted on 2019-03-29 15:34:06
Donald Kinghorn

Those actually look like they might be the Quadro bridges ?? My colleague William George wrote up a post with a nice table showing the compatibility of the different bridges https://www.pugetsystems.co...

Posted on 2019-03-29 15:43:33

That looks very promising, thanks for the details!

Posted on 2019-03-30 11:18:42

what model Gigabyte card is that??

Posted on 2019-04-05 15:29:49
Donald Kinghorn

... let me see if I can find it on our parts page ...

I do like that card. We gave them positive feedback on it and requested they shorten the shroud a couple mm so it will fit in a couple of other chassis easier.

Posted on 2019-04-08 16:32:23

thanks for the update...i have been looking at the Asus 2080ti Turbo model as i have a 1080ti Turbo model already and would this be comparable to the Gigabyte card. The Asus too looks to made especially for multi-gpu setups (blower cards).

Posted on 2019-04-08 16:39:55
Donald Kinghorn

We have been using those too. (we use a variety because of supply issues) They have also been good I don't think they would give you any trouble. I do prefer the Gigabyte because of the air ramp and more aggressive fan but really as long as you have good air movement in the chassis either of them are good ...EVGA is nice too but we did have a bad batch of them come through near the first release ... historically they have been great

Posted on 2019-04-09 15:26:30

Gigabyte also has an Aorus 2080ti Turbo model as well but it does not seem to be available retail yet. It looks like a more performance oriented blower version as its base and overclock clocking is much higher...

Posted on 2019-04-09 19:06:38
Donald Kinghorn

That is probably a great card but I will caution you a bit ... Your workload will put a lot of stress on the cards and "historically" overclocked cards have a much higher failure rate with compute loads (often from memory overheating) . That's just a caution, for the last few generation the NVIDIA cards have been amazingly solid, even overclocked. I think manufactures get tired of support RMA's and add more protections on the boards and it sounds like Gigabyte has done that with that one :-)

Posted on 2019-04-10 19:47:33

Thanks Don. I agree that the Titan Xp has been rock solid overclocked and I would prefer the Titan V but it is twice the money and I would like to get 4 cards. I am so having a hard time what gpu card to buy as I really should get the Titan series as they run really good under windows.
May start calculating if I really need 4 cards and just use 2 like I am currently doing. Some of the Monte Carlo benchmarks have seen with 2080ti indicates double the performance compared to 1080ti!!! I have just been reviewing a new workstation/server motherboard from Supermicro, X11SPA-TF that runs a single socket lga3647 with 12 dimms, which has been updated for Cascade Lake Xeons. Definitely fits the bill for a server as it also includes 4 x 16 pcie slots for quad gpu cards as well quad m.2 and 10G networking too. As much as like the AMD Epyc, I feel way more comfortable buying an Intel Xeon for server duties and the Cascade Lake specs look much better as it includes optane dimm capability.

...but which gpu card(s) should i buy?? Need to make up my mind within the next 6 months... 2 Titan RTX or 4 2080ti blower cards.

Posted on 2019-04-10 21:36:46

Don, just thought I would let you know that I reduced the need for 4 gpu cards and I can now get the same real time Monte Carlo results with 2 gpu cards only. This is really fabulous as it opens up many cheaper and/or faster computer configurations down the road. For right now though, I will be ordering 2 2080Ti blower cards (Asus probably) by the end of May as I am getting my Monte Carlo simulations results in around 1.3 seconds per calculation run which is amazing!! I calculate that once i use dual 2080Ti cards, i should be able to get closer to 500ms per calculation run which will enable me to keep up with the market in real time. (calculation run = 500,000 simulations x 4096 batches x 2 cards ~= 4 Billion threads)

Will continue to keep you posted...

Posted on 2019-04-20 20:30:40
Donald Kinghorn

Nice! Getting your code optimized to 2 cards should indeed make configurations easier and you will be getting good utilization out of the hardware.

Posted on 2019-04-22 17:18:44

in the test with peer 2 peer performance on the gtx and rtx it became clear that the rtx didn't support p2p. fortunately the Asus WS C422 SAGE is a single cpu thus single root so the 4 rtx 2080ti without the nvlink still has good results with resnet. Would that score also be good if you'd take an EPYC cpu and a Asrock EPYCD8-2T? (also single cpu, thus single root but a 4-die-in-1 cpu and infintity fabric in between, and if i'm not mistaken, also dividing the pcie lanes in to 4 groups)

Some people are stoked about the 128 pcie lanes, but how does it perform in a real case scenario?

Posted on 2019-04-02 19:50:41
Donald Kinghorn

I like the C422 SAGE a lot. I do want to do some single vs multi-root testing on dual socket boards. Dual root can cause difficulties because of the mem access patterns but, it is becoming more common to see programs that will scale multi-GPU and multi-node. Things like Horovod (MPI) probably negate a lot of the issues that we associate with dual-root PCI complexes. ... will probably test some of that ...

That Asrock board looks nice! It has been surprising to me that we haven't seen more boards like that. ( a lot of those PCIe lanes get used up in the die cluster but still it seems like doing a board like that is an obvious thing to do!) I don't know if I'll get to test with any EPYC stuff or not. We have tried to get testing samples from AMD and never got response from them. Maybe with the new processors coming out we will be able to get something for testing.

Posted on 2019-04-03 21:07:18

I ran these tests on the EPYCD8-2T (7441P, 8x16GB 2666 RAM) with 4x 1080ti's, ResNet50 fp32 scores 803 (vs 760 here). The additional 5% is probably due to a higher factory clock on the cards (since the single gpu test is also 5% faster), the most important part is that it's not slower.

The configuration in this post is technically also not a single root setup (and neither is the EPYC), search the article on STH "How Intel Xeon Changes Impacted Single Root Deep Learning Servers" to get an explanation. It would be interesting to see the results with an ASUS ESC 4000/8000 G4 or similar to see the performance of a true single root setup. On the newer Intel CPU's they need to split(PLX) a single 16x connection to get the best p2p bandwidth and latency (unless the cards support nvlink ofc).

A motherboard like in the ESC8000 with a single EPYC CPU could have 10 GPU's on single root with 32 lanes to the CPU (32 lanes per NUMA), which would make the best possible consumer based GPU server. Rome might remove the need for PLX chips alltogether and be a true single root 128 lane(+) CPU...

Posted on 2019-04-27 07:07:30

Hi Donald, which type of NVLINK did you use in the testing? the rtx 6000, or GV100? Thank you very much!

Posted on 2019-04-05 22:12:23
Donald Kinghorn

The bridge I used is in a photo below. It's funny, we have a bunch of these sitting around in "labs" I just grabbed a couple asked if they worked on RTX and put them on. I think these were the newer RTX ones (double space).
My colleague William George wrote a short post on compatibility of the various types of bridges, it has a very handy table in it!

Posted on 2019-04-08 16:44:05

The bridges which work best for putting RTX cards next to each other (as is required if you are doing four GPUs) are the Quadro RTX 6000 / 8000 bridges. NVIDIA doesn't make the GeForce or Titan branded bridges in a 2-slot width, and the RTX 5000 uses a physically shorter connector (so it doesn't work with the other models). I believe the GV100 bridges would also work, but we don't have any of those on hand to test... and they cost a lot more too, so unless you've already got some there is no reason to spend more on those (unless you are specifically using GV100 cards, of course).

Posted on 2019-04-08 16:58:24
James Emmanuel Jones

Does the 1660 Ti actually have usable Tensor cores? I thought they were disabled/not included as that's the budget card for the series

Posted on 2019-07-11 16:00:24
Donald Kinghorn

That's a good question. It doesn't look very good ... on ResNet-50

fp32 119 img/sec fp16 152 img/sec which is less than 30% speed-up

2070 gives
fp32 191 fp16 338 which is closer to 80% speed-up

No TensorCores on the 1660Ti! The RTX and Tensorcores are gone on that arch, Tu116. I didn't know that when I did the post. The speed-up with fp16 on the 1660Ti is probably just memory efficiency from the smaller "word" size.

There is a good write-up on AnandTech https://www.anandtech.com/s...

Posted on 2019-07-11 19:43:08
birbfakes - deepfaker

Do you think that the 2060 Super should compare to the 2070 in these benchmarks? 8GB VRAM and more tensor cores is a nice boost I think.
Also would the NVLink on 2070 Super be worth getting over the 2060 Super without? Thinking of making a new dedicated setup using 3 cards

Posted on 2019-08-21 17:38:57
Donald Kinghorn

I haven't tested the 2060 Super but my guess is that it started life as a 2070 :-) I would expect the performance to be close to the 2070. The first attraction of the 2070 Super for me was NVLINK but even without that, the original 2070 is a very nice card. So yes, the 2060 Super is a sweet deal. 3 of those vs 1 2080Ti ...

For your new setup the main things to consider are: is 8GB enough for for your input batches? will your code and job scale well to 3 cards? and will you take a hit because of the lack of P2P. If your application will scale data parallel (like the TF CNN using Horovod) then then P2P probably wont matter much ( it would be nice to have all 3 cards on X16 thought since all communication has to go through CPU space.) ... also, consider the power draw of 3 cards, you'll want a pretty beefy PSU. .... but, yes, 3 of the 2060 Supers could be really nice!

I just realized this is not my most recent post with these GPU's ... I did the same testing as this post adding in the 2070 Super
https://www.pugetsystems.co... be sure to have a look at that if you haven't already.

Posted on 2019-08-21 22:34:15