Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1146
Dr Donald Kinghorn (Scientific Computing Advisor )

GPU Memory Size and Deep Learning Performance (batch size) 12GB vs 32GB -- 1080Ti vs Titan V vs GV100

Written on April 27, 2018 by Dr Donald Kinghorn

How much performance effect is there from having more GPU memory available for Deep Learning workloads? I have an NVIDIA Quadro GV100 for a few days and this card has 32GB of HBM2 (high-bandwidth-memory). I thought it would be interesting to see how this could be exploited to increase performance of some neural network "training" jobs. In particular this will allow setting an important hyper-parameter, "batch size", to larger values.

For Deep Learning training problems the "learning" part is really minimizing some cost(loss) function by optimizing the parameters (weights) of a neural network model. This is a formidable task since there can be millions of parameters to optimize. The loss function that is being minimized includes a sum over all of the training data. It is typical to do some optimization method that is a variation of stochastic gradient descent over small batches of the input training data. The optimization is done over these batches until all of the data has been covered. One complete cycle through all of the training data is usually called an "epoch". Optimization is iterated for some number of epochs until the loss function is minimized and accuracy of the models predictions have reached an acceptable accuracy (or it has just stopped improving).

This "batch size" can range from numbers as small as 1 to values of 1024 or higher. A value of 1 would mean that only 1 input feature vector of the training data is affecting the optimization parameters during that iteration. It is usually better to have more of the data loaded for each iteration to improve the learning quality and convergence rate. A key limiting factor in how many data points can be included in each optimization step is the available memory of the compute hardware. Batch-size is an important hyper-parameter of the model training. Larger batch sizes may (often) converge faster and give better performance.

There are two main reasons the batch size might improve performance.

  • A larger batch size "may" improve the effectiveness of the optimization steps resulting in more rapid convergence of the model parameters.
  • A larger batch size can also improve performance by reducing the communication overhead caused by moving the training data to the GPU. This causes more compute cycles to run on the card with each iteration.

The first reason listed above is in perhaps the more interesting of the two. However, it can depend heavily on other hyper-parameters and the characteristics of the dataset being used as well as the network model and computational framework. I simply do not have a good way to tests this at the moment. Batch size is an adjustable hyper-parameter. If you have more GPU memory available you can try larger sizes! The effects of batch size is mostly an open question and there is some interesting work that has been done. For example, "AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks".

The second reason is also important and can be measured relatively easily. It also gives a good indication of the potential gains from larger GPU memory. This is more of a direct measure of per iteration impact from hardware performance. The tables that follow will show the impact on "Images Per Second" being processed by three Deep Neural Network models.

I'm going to look at three CNN benchmark examples (with synthetic data) using TensorFlow, "GoogLeNet", ResNet-50 and "Inception-4". I ran TensorFlow with GPU acceleration in a docker container on my local system using the TensorFlow image from NGC ( see my posts on doing this beginning with this post ).

Following is my command line to start the docker container.

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.04-py2

That docker image is built with TensorFlow 1.7.0.

The test code used is in the nvidia-examples/cnn directory of the above image.


The tables below show the performance effect of increasing batch size for three convolution neural network models of increasing complexity. The batch size is limited by the available GPU memory and, in general, performance increases with larger batch size.

The numbers in parenthesis are using Tensor-cores, (FP16)

(fail) indicates that the batch size was too large for available memory

The use of Tensor-cores/FP16 on the Volta GPU's increases performance and also allows for larger batch sizes over "standard" FP32.

Note: For the Inception-4 model the Loss function over-flowed when using Tensor-cores/FP16. The values increased to INF (infinity) and then became NAN's (not a number).

GoogLeNet with varying Batch Size using 1080Ti, Titan V, GV100 -- Training performance (Images/second)

Batch Size 1080Ti Titan V (FP16)GV100 (FP16)
32478605 (752)621 (759)
64539720 (1027)758 (1059)
128566803 (1236)856 (1284)
256576874 (1350)901 (1409)
512 failfail (1467)918 (1500)
1024failfail882 (1559)
1024failfailfail (1490)

GoogLeNet batch size vs ips

ResNet-50 with varying Batch Size using 1080Ti, Titan V, GV100 -- Training performance (Images/second)

CNN Model 1080Ti Titan V (FP16)GV100 (FP16)
16178228 (318)240 (326)
32203257 (441)286 (451)
64216301 (534)298 (551)
128failfail (595)334 (611)
256failfail335 (627)
512failfailfail (630)

ResNet50 batch size vs ips

Inception-4 with varying Batch Size using 1080Ti, Titan V, GV100 -- Training performance (Images/second)

CNN Model 1080Ti Titan V (FP16)GV100 (FP16)
166178 (124)83 (126)
326889 (170)93 (175)
64failfail (198)105 (203)
128failfail109 (223)
256failfailfail (229)

Inception4 batch size vs ips

I think the "winner" in this testing is the Titan V. It's a great workstation GPU for Machine/Deep Learning and the option to use Tensor-cores is compelling. The 1080Ti is also a great card and probably the overall best price/performance value for machine learning. The GV100 is very expensive and probably best suited for it's intended use, real-time ray-tracing.

I wish I had more time to spend with the GV100. There is no doubt it is a great card. It is not really intended for use in Deep Learning. However, it provided a great opportunity to test the effect of increasing batch size on model training performance. The actual intended use for the GV100 is together with a second card connected with NVLINK 2. In that configuration it is capable of high resolution real-time ray-tracing. There were some stunning examples of that at GTC18!

Happy computing --dbk

Tags: Deep Learning, Tensor-cores, NVIDIA, 1080Ti, Titan V, GV100

This is very helpful, thank you!

Posted on 2018-12-17 23:40:06
Benjamin Lynch

This is largely an outdated question as it relates to the Titan V due to the release of the Titan RTX, but do you remember if there was a dramatic improvement in convergence with batch size increases?

The Titan V is clearly the more affordable option, but if there is a big improvement in either model performance (better absolute results) or training time due to fewer epocs needing to be run, then the GV100 might actually be the better option and would arguably be worth the price premium.

On more current hardware, this might be a question of whether the Quadro RTX cards are a good option for this workload with the ability to use NVBridge across multiple cards and scale up memory capacity way beyond what consumer cards can offer. The mid range Quadro RTX cards look really appealing if this is the case.

Posted on 2019-01-26 21:38:13
Donald Kinghorn

In general the larger batch will/should increase the convergence rate and reduce the total run-time because there is more cross correlation and fewer iterations BUT, optimization can be tricky! It's hard to predict what will work best in any given situation. I've seen jobs where simple fast algorithms with lots of steps out performed more sophisticated routines that made larger/better steps. However, in general more memory is a good thing and if the input dimensions are large it can be difference between being able to do a calculation or not.

I love the Titan V. I think it is the best computing device ever made for a workstation. Having great double precision available on Volta is wonderful (and more scientific programmers should know this) I'll be writing up a post on the RTX Titan this week ... the memory is nice but the cooling is bad, the fp64 is bad and the P2P bandwidth without NVLINK is bad and using more that 2 cards is not feasible (because of cooling ... In fact, even using 2 cards is questionable) .

I've done some testing with the RTX Quadro 6000, very nice cards and they could be used on 4 GPU workstations easily. I am pretty sure the RTX Quadro cards only allow 2 cards to be connected with NVLINK (same as the GeForce cards) but they do have full P2P over PCIe (none of the RTX GeForce cards do including the RTX Titan) ... So, the RTX Quadros will be very good (but very expensive) workstation compute devices. For multi-GPU server chassis systems V100 Tesla is probably a better way to go for around the same cost).

The RTX 2080Ti is a great card and there are blower fan versions available and I don't think too many applications are going to suffer from lack of bandwidth over the PCIe bus (I'll be doing 1-4 card testing soon )

Posted on 2019-01-28 18:06:47
Christopher Steel

Great article. I would like to see it updated to include the Titan RTX.

Posted on 2019-06-28 23:55:14
Donald Kinghorn

thanks! I enjoyed this one too ... I do have some general performance testing with newer cards ranging from 1660-Ti and 1070 up to RTX Titan and 2080Ti 1-4 way multi-GPU. I used basically the same ResNet-50 testing alone with an LSTM problem. But, I didn't vary the batch size other than to take advantage of the capacity of the cards I was using ...

Posted on 2019-07-01 15:40:58