Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1386
Dr Donald Kinghorn (Scientific Computing Advisor )

TensorFlow Performance with 1-4 GPUs -- RTX Titan, 2080Ti, 2080, 2070, GTX 1660Ti, 1070, 1080Ti, and Titan V

Written on March 14, 2019 by Dr Donald Kinghorn


This post is an update and expansion of much of the GPU testing I have been doing over the last several months. I am using current (as of this post date) TensorFlow builds on NVIDIA NGC, the most recent display driver and I have results for up to 4 GPU's, including NVLINK, with several of the cards under test. This is something that I have been promising to do!

Test system


  • Puget Systems Peak Single (I used a test-bed system with components that we typically use in the Peak Single configured for Machine Learning)
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 2TB Intel 660p NVMe M.2
  • RTX Titan (1-2), 2080Ti (1-4), 2080 (1-4), 2070 (1-4)
  • GTX 1660Ti (1-2), 1080Ti (1-4)
  • Titan V (1-2)


For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning

TensorFlow Multi-GPU performance with 1-4 NVIDIA RTX and GTX GPU's

This is all fresh testing using the updates and configuration described above. Hopefully it will give you a comparative snapshot of multi-GPU performance with TensorFlow in a workstation configuration.

CNN (fp32, fp16) and Big LSTM job run batch sizes for the GPU's

Batch size does affect performance and larger sizes are usually better. The batch size is limited by the amount of memory available on the GPU's. "Reasonable" values that would run without giving "out of memory" errors were used. Multi-GPU jobs used the same batch sizes settings as single GPU job since they are set per processes. That means the "effective" batch sizes are multiples of the batch size since the jobs are "data parallel". The batch size information for the different cards and job types is in the table below.

CNN [ResNet-50] fp32, fp16 and RNN [Big LSTM] job Batch Sizes for the GPU's tested

GPU ResNet-50 FP32
batch size
RedNet-50 FP16 (Tensor-cores)
batch size
batch size
RTX Titan 192 384 640
RTX 2080 Ti64 128 448
RTX 2080 64 128 256
RTX 2070 64 128 256
GTX 1660 Ti32 64 128
Titan V 64 128 448
GTX 1080 Ti64 N/A448
GTX 1070 64 N/A256

TensorFlow CNN: ResNet-50

Docker container image tensorflow:19.02-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3

Example command lines for starting jobs,

# For a single GPU
python resnet.py  --layers=50  --batch_size=64  --precision=fp32

# For multi-GPU's 
mpiexec --allow-run-as-root -np 2 python resnet.py  --layers=50  --batch_size=64  --precision=fp32


  • Setting --precision=fp16 means "use tensor-cores".
  • --batch_size= batch sizes are varied to take advantage of available memory on the GPU's.
  • Multi-GPU in this version of the CNN docker image is using "Horovod" for parallel execution. That means it is using MPI and in particular OpenMPI is being used in the container image. The numbers in the charts for 1, 2 and 4 GPU's show very good parallel scaling with horovod, in my opinion!

[ResNet-50 fp32] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU's


[ResNet-50 fp16] TensorFlow, Training performance (Images/second) with 1-4 NVIDIA RTX and GTX GPU's

ResNet-50 with fp16

The charts above mostly speak for themselves. One thing to notice for these jobs is that the peer-to-peer communication advantage of using NVLINK has only a small impact. That will not be the case for the LSTM job runs.

TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:19.02-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:19.02-py3

Example job command-line,

python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 
--datadir=./data/ 1-billion-word-language-modeling-benchmark-r13output/ \
--hpconfig run_profiler=False,max_time=240,num_steps=20,num_shards=8,num_layers=2,\


  • --num_gpus= and batch_size= are the only parameters changed for the different job runs.

[Big LSTM] TensorFlow, Training performance (words/second) with 1-4 NVIDIA RTX and GTX GPU's

LSTM with


  • Batch size and GPU-to-GPU (peer-to-peer) communication have a significant impact on the performance with this recurrent neural network. The higher end GPU's have the advantage of both increased numbers of compute cores and the availability of larger memory spaces for data and instructions as well as the possibility of using the high performance NVLINK for communication.
  • NVLINK significantly improved performance. This performance improvement was even apparent when using 2 NVLINK pairs with 4 GPU's. I was a little surprised by this since I expected it to bottleneck on the "memcpy" to CPU memory space needed between the remaining non NVLINK connected pairs.

Happy computing --dbk

Tags: Multi-GPU, TensorFlow, RTX, GTX, Machine Learning, NVIDIA

Very nice benchmark, and just in a perfect time I was looking for such comparison.
The only thing I slightly miss is Tesla K40 and K80 on those graphs.
Could you explain me what is the purpose of those two cards? All benchmarks I found suggest, they are worse than 1080, but the price says it should be fantastic. I'm confused.

Posted on 2019-03-15 14:35:17

The K40 and K80 are very old cards at this point. With many Quadro and Tesla models you can tell the generation by the letter at the start of the model number. K = Kepler, which was before Maxwell, Pascal, Volta, and the latest Turing (RTX series) GPUs. That puts it several generations back, and so not really able to hold its own against modern cards.

Tesla cards in general are compute-focused GPUs, which don't have video outputs since they are built for compute workloads instead of actually displaying graphics. They often share similar specs with some of the high-end Quadro cards in the same generation, but may come in passive versions designed for use in very specialized rackmount chassis. Like Quadro cards, they are usually a lot more expensive than GeForce cards with similar performance - but usually have more VRAM and may have other features like better FP64, ECC memory, etc.

Posted on 2019-03-15 16:43:16
Donald Kinghorn

like William said :-)

The K80 was a "workhorse" dual GPU card, that and the K40 (single GPU) really established the NVIDIA platform for compute. There are still big clusters running these cards, but it's debatable whether they are worth the power consumption given that the newer cards deliver so much more performance per watt.

They are out of date. Any 1070 or 2070 or higher GPU's will be much faster. Also, that was "compute capability" 3.5,3.7 Volta and Turing are at 7.2, 7.5. People are building software now that does not support older than 5.0 and 6.0 (Maxwell and Pascal). The latest CUDA 10.1 does still support Kepler (3.5) and TensorFlow still supports 3.5 but I don't expect this to be the case once the big legacy systems get shut down.

I would not have been able to test the K40 using the NGC docker images. NGC only supports compute capability 6.0 or greater!

The K40 is comparable to the original Titan (or Titan black) and the K80 is like a Titan Z. Those were/are nice cards but not worth consider for any new builds.

Posted on 2019-03-15 18:41:15
Mark Johnstone

Great work as always. I would be interested to see these graphs normalized to GPU cost.

Posted on 2019-03-16 03:31:25
Donald Kinghorn

Yes! That is interesting to look at but it can be scary because some of the GPU's are expensive! It can be hard to decide on what is going to give you best value. I do like the 2080Ti a lot. I wish is was a bit less expensive but it is what I would recommend for most uses. However, any recent NVIDIA GPU "greater" than the 1070 is nice!

Posted on 2019-03-18 01:36:53

Thank you very much for sharing these informative benchmarks!
I am most interested in your results with four RTX 2080Ti GPUs in combination with NVlink adapters. Exactly which type of adapter did you use? What is the spacing of the GPUs on your motherboard, two slot distance?

Posted on 2019-03-28 12:56:11
Donald Kinghorn

Yes! That was more effective than I expected. I took a photo of the test bench I was using during that testing. Note that these are the Gigabyte cards which I like a lot for this use since they have a slight bevel at the back of the card to lower the pressure for air moving to the blower. It's a nice design touch.

Posted on 2019-03-29 15:34:06
Donald Kinghorn

Those actually look like they might be the Quadro bridges ?? My colleague William George wrote up a post with a nice table showing the compatibility of the different bridges https://www.pugetsystems.co...

Posted on 2019-03-29 15:43:33

That looks very promising, thanks for the details!

Posted on 2019-03-30 11:18:42

what model Gigabyte card is that??

Posted on 2019-04-05 15:29:49
Donald Kinghorn

... let me see if I can find it on our parts page ...

I do like that card. We gave them positive feedback on it and requested they shorten the shroud a couple mm so it will fit in a couple of other chassis easier.

Posted on 2019-04-08 16:32:23

thanks for the update...i have been looking at the Asus 2080ti Turbo model as i have a 1080ti Turbo model already and would this be comparable to the Gigabyte card. The Asus too looks to made especially for multi-gpu setups (blower cards).

Posted on 2019-04-08 16:39:55
Donald Kinghorn

We have been using those too. (we use a variety because of supply issues) They have also been good I don't think they would give you any trouble. I do prefer the Gigabyte because of the air ramp and more aggressive fan but really as long as you have good air movement in the chassis either of them are good ...EVGA is nice too but we did have a bad batch of them come through near the first release ... historically they have been great

Posted on 2019-04-09 15:26:30

Gigabyte also has an Aorus 2080ti Turbo model as well but it does not seem to be available retail yet. It looks like a more performance oriented blower version as its base and overclock clocking is much higher...

Posted on 2019-04-09 19:06:38
Donald Kinghorn

That is probably a great card but I will caution you a bit ... Your workload will put a lot of stress on the cards and "historically" overclocked cards have a much higher failure rate with compute loads (often from memory overheating) . That's just a caution, for the last few generation the NVIDIA cards have been amazingly solid, even overclocked. I think manufactures get tired of support RMA's and add more protections on the boards and it sounds like Gigabyte has done that with that one :-)

Posted on 2019-04-10 19:47:33

Thanks Don. I agree that the Titan Xp has been rock solid overclocked and I would prefer the Titan V but it is twice the money and I would like to get 4 cards. I am so having a hard time what gpu card to buy as I really should get the Titan series as they run really good under windows.
May start calculating if I really need 4 cards and just use 2 like I am currently doing. Some of the Monte Carlo benchmarks have seen with 2080ti indicates double the performance compared to 1080ti!!! I have just been reviewing a new workstation/server motherboard from Supermicro, X11SPA-TF that runs a single socket lga3647 with 12 dimms, which has been updated for Cascade Lake Xeons. Definitely fits the bill for a server as it also includes 4 x 16 pcie slots for quad gpu cards as well quad m.2 and 10G networking too. As much as like the AMD Epyc, I feel way more comfortable buying an Intel Xeon for server duties and the Cascade Lake specs look much better as it includes optane dimm capability.

...but which gpu card(s) should i buy?? Need to make up my mind within the next 6 months... 2 Titan RTX or 4 2080ti blower cards.

Posted on 2019-04-10 21:36:46

Don, just thought I would let you know that I reduced the need for 4 gpu cards and I can now get the same real time Monte Carlo results with 2 gpu cards only. This is really fabulous as it opens up many cheaper and/or faster computer configurations down the road. For right now though, I will be ordering 2 2080Ti blower cards (Asus probably) by the end of May as I am getting my Monte Carlo simulations results in around 1.3 seconds per calculation run which is amazing!! I calculate that once i use dual 2080Ti cards, i should be able to get closer to 500ms per calculation run which will enable me to keep up with the market in real time. (calculation run = 500,000 simulations x 4096 batches x 2 cards ~= 4 Billion threads)

Will continue to keep you posted...

Posted on 2019-04-20 20:30:40
Donald Kinghorn

Nice! Getting your code optimized to 2 cards should indeed make configurations easier and you will be getting good utilization out of the hardware.

Posted on 2019-04-22 17:18:44

in the test with peer 2 peer performance on the gtx and rtx it became clear that the rtx didn't support p2p. fortunately the Asus WS C422 SAGE is a single cpu thus single root so the 4 rtx 2080ti without the nvlink still has good results with resnet. Would that score also be good if you'd take an EPYC cpu and a Asrock EPYCD8-2T? (also single cpu, thus single root but a 4-die-in-1 cpu and infintity fabric in between, and if i'm not mistaken, also dividing the pcie lanes in to 4 groups)

Some people are stoked about the 128 pcie lanes, but how does it perform in a real case scenario?

Posted on 2019-04-02 19:50:41
Donald Kinghorn

I like the C422 SAGE a lot. I do want to do some single vs multi-root testing on dual socket boards. Dual root can cause difficulties because of the mem access patterns but, it is becoming more common to see programs that will scale multi-GPU and multi-node. Things like Horovod (MPI) probably negate a lot of the issues that we associate with dual-root PCI complexes. ... will probably test some of that ...

That Asrock board looks nice! It has been surprising to me that we haven't seen more boards like that. ( a lot of those PCIe lanes get used up in the die cluster but still it seems like doing a board like that is an obvious thing to do!) I don't know if I'll get to test with any EPYC stuff or not. We have tried to get testing samples from AMD and never got response from them. Maybe with the new processors coming out we will be able to get something for testing.

Posted on 2019-04-03 21:07:18

I ran these tests on the EPYCD8-2T (7441P, 8x16GB 2666 RAM) with 4x 1080ti's, ResNet50 fp32 scores 803 (vs 760 here). The additional 5% is probably due to a higher factory clock on the cards (since the single gpu test is also 5% faster), the most important part is that it's not slower.

The configuration in this post is technically also not a single root setup (and neither is the EPYC), search the article on STH "How Intel Xeon Changes Impacted Single Root Deep Learning Servers" to get an explanation. It would be interesting to see the results with an ASUS ESC 4000/8000 G4 or similar to see the performance of a true single root setup. On the newer Intel CPU's they need to split(PLX) a single 16x connection to get the best p2p bandwidth and latency (unless the cards support nvlink ofc).

A motherboard like in the ESC8000 with a single EPYC CPU could have 10 GPU's on single root with 32 lanes to the CPU (32 lanes per NUMA), which would make the best possible consumer based GPU server. Rome might remove the need for PLX chips alltogether and be a true single root 128 lane(+) CPU...

Posted on 2019-04-27 07:07:30

Hi Donald, which type of NVLINK did you use in the testing? the rtx 6000, or GV100? Thank you very much!

Posted on 2019-04-05 22:12:23
Donald Kinghorn

The bridge I used is in a photo below. It's funny, we have a bunch of these sitting around in "labs" I just grabbed a couple asked if they worked on RTX and put them on. I think these were the newer RTX ones (double space).
My colleague William George wrote a short post on compatibility of the various types of bridges, it has a very handy table in it!

Posted on 2019-04-08 16:44:05

The bridges which work best for putting RTX cards next to each other (as is required if you are doing four GPUs) are the Quadro RTX 6000 / 8000 bridges. NVIDIA doesn't make the GeForce or Titan branded bridges in a 2-slot width, and the RTX 5000 uses a physically shorter connector (so it doesn't work with the other models). I believe the GV100 bridges would also work, but we don't have any of those on hand to test... and they cost a lot more too, so unless you've already got some there is no reason to spend more on those (unless you are specifically using GV100 cards, of course).

Posted on 2019-04-08 16:58:24
James Emmanuel Jones

Does the 1660 Ti actually have usable Tensor cores? I thought they were disabled/not included as that's the budget card for the series

Posted on 2019-07-11 16:00:24
Donald Kinghorn

That's a good question. It doesn't look very good ... on ResNet-50

fp32 119 img/sec fp16 152 img/sec which is less than 30% speed-up

2070 gives
fp32 191 fp16 338 which is closer to 80% speed-up

No TensorCores on the 1660Ti! The RTX and Tensorcores are gone on that arch, Tu116. I didn't know that when I did the post. The speed-up with fp16 on the 1660Ti is probably just memory efficiency from the smaller "word" size.

There is a good write-up on AnandTech https://www.anandtech.com/s...

Posted on 2019-07-11 19:43:08
Apology Savarkar

Lambdalabs got +44% speedup on an average. FP16 vs FP32 on 2080Ti

Posted on 2020-03-08 08:48:14
birbfakes - deepfaker

Do you think that the 2060 Super should compare to the 2070 in these benchmarks? 8GB VRAM and more tensor cores is a nice boost I think.
Also would the NVLink on 2070 Super be worth getting over the 2060 Super without? Thinking of making a new dedicated setup using 3 cards

Posted on 2019-08-21 17:38:57
Donald Kinghorn

I haven't tested the 2060 Super but my guess is that it started life as a 2070 :-) I would expect the performance to be close to the 2070. The first attraction of the 2070 Super for me was NVLINK but even without that, the original 2070 is a very nice card. So yes, the 2060 Super is a sweet deal. 3 of those vs 1 2080Ti ...

For your new setup the main things to consider are: is 8GB enough for for your input batches? will your code and job scale well to 3 cards? and will you take a hit because of the lack of P2P. If your application will scale data parallel (like the TF CNN using Horovod) then then P2P probably wont matter much ( it would be nice to have all 3 cards on X16 thought since all communication has to go through CPU space.) ... also, consider the power draw of 3 cards, you'll want a pretty beefy PSU. .... but, yes, 3 of the 2060 Supers could be really nice!

I just realized this is not my most recent post with these GPU's ... I did the same testing as this post adding in the 2070 Super
https://www.pugetsystems.co... be sure to have a look at that if you haven't already.

Posted on 2019-08-21 22:34:15
Lê Khắc Phương

Hi Don, thank you for your benchmark, I learn a lot from that.
In this article, your CPU have 48 PCI lanes, so 4x GPU can run at x16/x8/x16/x8, but i see the 1TB NVMe also (which use 4 PCE lanes), so your 4x GPU may be run at x16/x8/x8/x8 instead.
1. Can you confirm it?
2. And does x8/x8 or x16/x8 matter to run with nvlink?
3. I have two option: AMD Threadripper 2950x (support x16/x8/x16/x8) and Intel I9 9940x (support x16/x8/x8/x8) to work with 4x RTX 2080ti. Which should i choose?

Posted on 2019-11-01 02:34:38
Donald Kinghorn

You are welcome :-) I have a long but hopefully helpful reply for you...

1) The motherboard I used was ASUS C422 Sage it provides X16/X16/X16/X16 which is why we use it. There are 48 lanes from the CPU and 24 lanes from the C422 chipset but, the board does also use a PCIe switch (PLX) on the last 2 X16 slots. PCIe data flow looks very similar to network traffic. All 4 slots measure bandwidth as full X16 and latency mostly as expected.

[ PLX switches work well but are difficult to implement for board makers and do sometimes fail ] Expect us to have a 4 x X16 system with no PLX switches soon :-) ]

2) NVLINK is direct GPU-GPU Peer-to-Peer (P2P) but with RTX only for 2 GPU's simultaneously.
X8 vs X16 will have some impact when traffic is between 2 pairs of GPU's or to CPU mem. The 20xx RTC cards do not have P2P over PCIe so they have to go back through CPU space to communicate if you don't have NVLINK. The RTX Quadro's and the older 10xx cards did have P2P over PCIe. ( This is why I generally recommend NVLINK bridges with the RTX 2080Ti cards.)
However, the PCIe impact is usually small. Take a look at these posts ...

P2P peer-to-peer on NVIDIA RTX 2080Ti vs GTX 1080Ti GPUs https://www.pugetsystems.co...

PCIe X16 vs X8 with 4 x Titan V GPUs for Machine Learning https://www.pugetsystems.co...

3) You should be able to get an X299 motherboard with 2 X16 The Gigabyte board we use supports 2 x X16 but only 3 GPU's total (again, you get lanes from the chipset)

The TR 2950x is a great CPU, but, we have had significant trouble with failures (motherboard and CPU) than we have had with Intel Core-X. (The rumor is that there will be a nice price cut on Core-X sometime too :-)

Given the choice I would go with the Intel 9940X (but wait a bit) because the platform is very solid. Also, Core-X coupled with AVX512 optimized libs like MKL is just outstanding.

If you are doing 4 x 2080Ti go ahead and get the 2 NVLINK bridges. You will likely get performance increase relative to the extra cost. I think it's a worth it.

Since I have mentioned AMD ... I would not get "last gen" Threadripper at this time. However, just to let you know, the Ryzen Zen2 CPU's and motherboards are looking really good (there was some early MB troubles that seem to have been resolved) ... AND, I am really looking forward to testing the new Threadripper!

Another thing about the CPU's. They don't make much difference with GPU workloads unless you are doing something that has a heavy portion of CPU load along with the GPU acceleration (like molecular dynamics ). Those 4 x 2080Ti's will be wonderful on any platform :-) [be sure you have enough memory! 4 2080Ti's will be happiest if you have 128GB of CPU memory for buffering and pinning. That is a strong recommendation from me and NVIDIA]

Posted on 2019-11-01 15:54:50
Mark Palatucci

Hi Donald - I've had a little trouble replicating these multi-GPU results on Horovod using the same container. I've got 2080 Ti's (founder's edition) cards running on: Asus X299 SAGE, 9820X (10 core 20 thread), 64 GB, 1 TB 970 Pro, and 2x 2080 Ti Founder's Edition. Linux is 18.04 with 5.0.0-32 kernel and Nvidia 430 driver. I get slightly better performance single GPU on FP32, around 310 imgs/second. I suspect this might be because the cards are watercooled, and nvidia-smi dmon shows clocks around 1995 mhz since temps remain around 42 degrees C under full load.

For multi-GPU, if I run the cmd above for fp16 and 2 gpus, I get horrible scaling with around 550 imgs/second. If I add the params "-bind-to none -map-by slot" to the MPI command I get around 720 imgs/sec. And then if I increase the batch on command line to 128, then I'll get closer to the 910 img/second number that you have above for 2x 2080 Tis with no NVLink.

I didn't realize that the bind/map commands could make such a difference. Saw this on a Uber slide about horovod and gave it a try and it had a big impact.

Posted on 2019-11-04 21:49:38
Donald Kinghorn

Hi Mark, I'm pretty sure I did use batch size of 128 on the 12GB GPU's. That's "my bad" for not being more explicit in what parameters I used for the testing. I some times just forget to put enough details in the posts. It always frustrates me when I go back to check on something and realize that I didn't record it. Getting the batch sizes near their max before the dreaded OOM message usually increases performance (for benchmarking)

Different systems can respond quite differently to MPI bind/map flags. I didn't set any flags when I ran these jobs but it is good idea to experiment. Thanks for the heads up on that!

Posted on 2019-11-05 02:53:33
Donald Kinghorn

Hi Mark, I'm pretty sure I did use batch size of 128 on the 12GB GPU's. That's "my bad" for not being more explicit in what parameters I used for the test. I some times just forget to put in the posts. It always frustrates me when I go back to check on something and realize that I didn't record it. Getting the batch sizes near their max before the dreaded OOM message usually increases performance (for benchmarking)

Different systems can respond quite differently to MPI binding flags. I didn't set any flags when I ran these jobs but it is good idea to experiment.

Thanks for the heads up on that! I'm particularly curious about the "map-by slot"

Posted on 2019-11-05 21:11:42
Eric C. Bohn

Can you share some specific instances where 8GB VRAM on a 2060/2070/2080 Super isn't enough? There doesn't seem to be a clear or examples out there. In benchmarks it seems people are using the same batch size of 64 for these GPUs. I don't know the performance implications of mini-batching, but I've read it can be an effective means of managing memory on large models.

Posted on 2020-01-10 16:00:54
Donald Kinghorn

Yes, this is a great question! There are three main places where the available mem becomes limiting

- batch size
- model size (number of parameters)
- Input data feature size

A typical example is training ResNet-50 with the "normal" images size of 224x224 (x3 RGB) Approx. 250 million parameters ...

Batch size is pretty obvious and it is a meta-parameter that you optimize for model training. Bigger usually means faster convergence but may or may not give better generalization in the fit. ... but, bigger is usually better.
... Iet me test something, OK on a Titan V 12GB, ResNet-50 fp32 batch_size=64 uses 9452MiB (according to nvidia-smi) I think this runs on 2070 with 8GB too but mem is allocated a little more conservatively(TF tries to run with less memory when it has too). I can use batch 96 but 128 fails (batch_size=96 give 318 img/sec, batch_size=64 gives 300 img/sec)

If I try to run ResNet-101 (101 layers) at fp32 with batch_size=96 it fails with Out-Of-Mem (OOM). That runs at batch_size=64 but ResNet-152 fails with OOM

Input data feature size is a big problem! What if you want to use larger input images? 224x24 is pretty small! What if you have 3D data?

I have done some testing with larger imput features and will be doing a post on input data size limits when PNY sends us a Quadro RTX 8000 48GB to finish the testing.

Not having enough GPU memory for the problem you want to solve is a "show stopper". A lot of research work goes into trying to run larger problems with limited memory. Think about the size of some medical image data! 3D MRI scans or microscope tissue images ...

Posted on 2020-01-10 17:37:59
Eric C. Bohn

Thank you Donald, your replies are much appreciated!

In your example of the Titan V 12GB on ResNet-50 with a batch size of 64, what is the memory use in fp16? 6300MiB-ish?

3D images aren't out of scope of things I could be working with. Is even 11GB memory (2080 Ti) not enough to be useful with 3D images? I imagine at some point I'm not going to be able to reasonably work on a single 4 (2080 Ti or 2060 Super) GPU box.

Posted on 2020-01-10 18:03:39
Donald Kinghorn

interestingly at fp16 with batch size 64 it still shows 9450MiB ( I think more mem is being used because it can be) ... at fp16 batch size 128 is no problem and it still shows the same mem usage!

At batch size 192 nvidia-smi still shows 9450MiB but TF starts to complain with

2020-01-10 23:35:43.972123: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.10GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

At batch size 256 it fails with OOM

I looks like analyzing this this is more complicated than I had hoped :-) I may need to increase resolution of smi output or use some other method of debugging.

Note: that what I just did was TF 1.13 (same as what was in this post) I'm sure things are different in 2.0 and I'm sure there are tools for working with larger input that I'm not familiar with. The frameworks keep getting better! Figuring out exactly what the limits are in a 4GPU workstation will be really interesting! This testing just got bumped up in priority for me :-)

Posted on 2020-01-11 00:02:45
Eric C. Bohn

I look forward to your tests!

Posted on 2020-01-11 04:07:42

Hi Guys! I need some help. I'm newbie in Machine Learning, but i want to start learning. Can you help me with hardware configuration for workstation, not the expensive one, but "smart" one. I want to try image recongnition project, and maybe some small niche translation models training. I was thinking of 4 used Nvidia Cuda Gpu's (from miners) but which processor, ram, hdd should i use? Point me to some direction. Thank you!

Posted on 2020-01-21 23:16:12
Donald Kinghorn

Hey Alex, For a good setup to get started with and do some serious work just think of a modest "gaming rig". Work out a budget to include at least an RTX 2070super. (you could go with 2060super too to save a bit of $). It's good to start with 1 GPU to keep your dev work simple. (if you are looking for used, then a 1080Ti is a great card if you can score a good deal)

You can call our sales folks if you want to get something from us and they we be able to put together a good quote for you. If you want to do your own build you can look at our config for a Genesis 1 https://www.pugetsystems.co... (That default sys would be really nice!) Something that is actually closer to a "gaming-rig" config that would still be very nice as a ML workstation would be a "Spirit" https://www.pugetsystems.co... again, that default would be pretty decent.

Then if you want to use Windows be sure to read https://www.pugetsystems.co... Then learn some Python: scikit learn, numpy, and then dive into TensorFlow (Keras) and try some projects

Wishing you the best for your new learning endeavor! This is a great field of study and the possibilities are endless!

Posted on 2020-01-22 01:45:17

Thank you for your reply. I think i can get 1080ti for 370$ per card. i see that option thay you propose can't hold 4 cards at a time? Also will AMD processor better for my purposes?

Posted on 2020-01-22 21:32:16
Donald Kinghorn

that's pretty good on the 1080Ti's those are great cards. I still recommend staying with 1 or maybe 2.

To get full 4 x X16 you currently have to move up to something like Xeon-W on a C422 chipset and you will have a PLX switch to get the 4th X16 PCIe slot. You can use a board that will give you 4 x X8 and you honestly shouldn't see much performance hit.

We will be offering a (really nice!) sys with Xeon 64L that will have 4 x X16 without PLX but this is a significant increase in price.

The new AMD stuff is great. ( Intel is supported better with software ... it's getting better for AMD though) You would still unfortunately be limited to max of 3 X16 though and that's with TR. But again you could go with a board with 4 slots at X8 and be OK. I wouldn't hesitate to recommend AMD. check out some of my recent posts ... and expect a new one soon :-)

Posted on 2020-01-23 00:29:26
Apology Savarkar

Thanks a lot for these charts.

Posted on 2020-03-08 08:28:17