Table of Contents
How much performance effect is there from having more GPU memory available for Deep Learning workloads? I have an NVIDIA Quadro GV100 for a few days and this card has 32GB of HBM2 (high-bandwidth-memory). I thought it would be interesting to see how this could be exploited to increase performance of some neural network “training” jobs. In particular this will allow setting an important hyper-parameter, “batch size”, to larger values.
For Deep Learning training problems the “learning” part is really minimizing some cost(loss) function by optimizing the parameters (weights) of a neural network model. This is a formidable task since there can be millions of parameters to optimize. The loss function that is being minimized includes a sum over all of the training data. It is typical to do some optimization method that is a variation of stochastic gradient descent over small batches of the input training data. The optimization is done over these batches until all of the data has been covered. One complete cycle through all of the training data is usually called an “epoch”. Optimization is iterated for some number of epochs until the loss function is minimized and accuracy of the models predictions have reached an acceptable accuracy (or it has just stopped improving).
This “batch size” can range from numbers as small as 1 to values of 1024 or higher. A value of 1 would mean that only 1 input feature vector of the training data is affecting the optimization parameters during that iteration. It is usually better to have more of the data loaded for each iteration to improve the learning quality and convergence rate. A key limiting factor in how many data points can be included in each optimization step is the available memory of the compute hardware. Batch-size is an important hyper-parameter of the model training. Larger batch sizes may (often) converge faster and give better performance.
There are two main reasons the batch size might improve performance.
- A larger batch size “may” improve the effectiveness of the optimization steps resulting in more rapid convergence of the model parameters.
- A larger batch size can also improve performance by reducing the communication overhead caused by moving the training data to the GPU. This causes more compute cycles to run on the card with each iteration.
The first reason listed above is in perhaps the more interesting of the two. However, it can depend heavily on other hyper-parameters and the characteristics of the dataset being used as well as the network model and computational framework. I simply do not have a good way to tests this at the moment. Batch size is an adjustable hyper-parameter. If you have more GPU memory available you can try larger sizes! The effects of batch size is mostly an open question and there is some interesting work that has been done. For example, “AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks”.
The second reason is also important and can be measured relatively easily. It also gives a good indication of the potential gains from larger GPU memory. This is more of a direct measure of per iteration impact from hardware performance. The tables that follow will show the impact on “Images Per Second” being processed by three Deep Neural Network models.
I’m going to look at three CNN benchmark examples (with synthetic data) using TensorFlow, “GoogLeNet”, ResNet-50 and “Inception-4”. I ran TensorFlow with GPU acceleration in a docker container on my local system using the TensorFlow image from NGC ( see my posts on doing this beginning with this post ).
Following is my command line to start the docker container.
docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.04-py2
That docker image is built with TensorFlow 1.7.0.
The test code used is in the
nvidia-examples/cnn directory of the above image.
The tables below show the performance effect of increasing batch size for three convolution neural network models of increasing complexity. The batch size is limited by the available GPU memory and, in general, performance increases with larger batch size.
The numbers in parenthesis are using Tensor-cores, (FP16)
(fail) indicates that the batch size was too large for available memory
The use of Tensor-cores/FP16 on the Volta GPU’s increases performance and also allows for larger batch sizes over “standard” FP32.
Note: For the Inception-4 model the Loss function over-flowed when using Tensor-cores/FP16. The values increased to INF (infinity) and then became NAN’s (not a number).
GoogLeNet with varying Batch Size using 1080Ti, Titan V, GV100 — Training performance (Images/second)
|Batch Size||1080Ti||Titan V (FP16)||GV100 (FP16)|
|32||478||605 (752)||621 (759)|
|64||539||720 (1027)||758 (1059)|
|128||566||803 (1236)||856 (1284)|
|256||576||874 (1350)||901 (1409)|
|512||fail||fail (1467)||918 (1500)|
ResNet-50 with varying Batch Size using 1080Ti, Titan V, GV100 — Training performance (Images/second)
|CNN Model||1080Ti||Titan V (FP16)||GV100 (FP16)|
|16||178||228 (318)||240 (326)|
|32||203||257 (441)||286 (451)|
|64||216||301 (534)||298 (551)|
|128||fail||fail (595)||334 (611)|
Inception-4 with varying Batch Size using 1080Ti, Titan V, GV100 — Training performance (Images/second)
|CNN Model||1080Ti||Titan V (FP16)||GV100 (FP16)|
|16||61||78 (124)||83 (126)|
|32||68||89 (170)||93 (175)|
|64||fail||fail (198)||105 (203)|
I think the “winner” in this testing is the Titan V. It’s a great workstation GPU for Machine/Deep Learning and the option to use Tensor-cores is compelling. The 1080Ti is also a great card and probably the overall best price/performance value for machine learning. The GV100 is very expensive and probably best suited for it’s intended use, real-time ray-tracing.
I wish I had more time to spend with the GV100. There is no doubt it is a great card. It is not really intended for use in Deep Learning. However, it provided a great opportunity to test the effect of increasing batch size on model training performance. The actual intended use for the GV100 is together with a second card connected with NVLINK 2. In that configuration it is capable of high resolution real-time ray-tracing. There were some stunning examples of that at GTC18!
Happy computing –dbk