NVIDIA Titan V plus Tensor-cores Considerations and Testing of FP16 for Deep Learning

I finally have my hands on a Titan V again. It’s been since December! I haven’t been able to check any out of inventory here at Puget Systems since for some reason customers keep buying them. Our testing cards have been traveling around to various meetings and trade shows. I’ve had one this week for testing and next week I should have 4 for multi-GPU scaling evaluation on Machine Learning workloads.

In the post I did in December I didn’t get good results with Tensor-cores from the naive testing I did at that time. I really liked the Titan V then and gave it a recommendation. I like it even more now. I did promise that I would revisit Tensor-cores and report back. I’m reporting back, and the results look good!

Tensor-cores are one of the unique new features of the NVIDIA Volta architecture. They are available in the Tesla V100, the new Quadro GV100 and the Titan V. I’m going to quote from the Wikipedia Volta page since it is good concise definition of what a Tensor-core is.

“Tensor cores: A tensor core is a unit that multiplies two 4×4 FP16 matrices, and then adds a third FP16 or FP32 matrix to the result by using fused multiply–add operations, and obtains an FP32 result that could be optionally demoted to an FP16 result. Tensor cores are intended to speed up the training of neural networks.”

Much of Machine Learning/AI computation comes down to simple numerical linear algebra operations of which matrix (tensor) multiplication is fundamental. Algorithms can be designed to use partitioning to smaller block sizes to take advantage of special hardware units and memory architectures. NVIDIA has implemented some of these operations in the CUDA libraries to take advantage of the Tensor-core structure. This means that frameworks like TensorFlow that leverage these libraries can take advantage of the potential speedup from this hardware.

What is precision? What does FP16 mean?

FP16 means 16-bit floating point numbers. This is also known as half-precision. On GPU’s calculations are most commonly done with FP32 32-bit, single precision data types. That’s because that is where GPU’s offer the highest performance. Traditionally scientific computation on CPU’s is done with 64-bit FP64 double precision. What does that mean for numerical accuracy? Here’s an example,

  • Half precision, FP16: 14239812.438711809 is the same as 1424000
  • Single precision, FP32: 14239812.438711809 is the same as 14239812.
  • Double precision FP64: 14239812.438711809 is the same as 14239812.43871181

In other words, FP16 is good enough for around 4 meaningful digits, FP32 is good for 8 digits and FP64 is good for 16 digits. There are other “precisions” too. “Extended precision” will usually give 20 good digits and most CPU’s and compilers will use this internally for intermediate steps in operations like sin, cos, exp, log, bessel, erf etc. so that they don’t lose precision while they are being computed. It is occasionally necessary to use quad precision (FP128) for calculations that are numerically unstable, that gives about 34 good digits.

If you want to use Tensor-cores it means that you will be using FP16 half-precision for at least part of your calculations. Is that OK?

Does precision matter? (… How about for for Deep Learning?)

Well, the answer to that is “it depends”. There is a whole discipline devoted to answering questions about precision and errors in calculations. Numerical Analysis, is the field of study that takes a careful look how computational algorithms are affected by approximation in numbers. On a computer, numbers other than binary (one’s and zeros) are approximations. The “precision” is an indicator of the “goodness” of that approximation.

In the definition of Tensor-cores they talk about things like saving the FP16 operation to an FP32 result. That’s important. Here’s why; When you multiply numbers together the number of “good” digits is preserved, but, when you add (or subtract) numbers you can lose significant digits. This is commonly called accumulation error. Most calculations come down to multiply and add type of operations. You may be able to get away with lower precision for the multiply part if you are careful about the accumulation (add) part. You usually want higher precision for accumulation. That’s why they say “and obtains an FP32 result” in the definition of Tensor-cores. In general that idea is called mixed precision, and, it can work OK if you are careful with your algorithms.

If you are an engineer computing the stresses on components for a bridge you would never even consider FP16. But what if you are trying to “learn” a million parameters for some Deep Learning neural network model … maybe that’s OK.

Why FP16 and Tensor-cores are likely OK for Deep Learning

Notice that I said likely. I haven’t done much testing myself and I have been doing numerical computing for long enough to see that there could be problems from using such low precision. Computer “Learning” is really just numerical optimization. That is adjusting some set of parameters so that they will minimize a “loss function” that is a measure of how close predicted values from your model are to known true values (that’s supervised learning). Optimization problems can be unstable but here are a few reasons I think it is probably OK to use FP16 mixed precision for training Deep Neural Networks (DNN’s);

Low parameter sensitivity

A DNN with many layers may have millions of parameters (weights) that need to be optimized. The models are so large and complex that small changes to optimization parameters are unlikely to have much impact on the overall predictive capability of the model. For example what difference to you think it would make for parameter 1,240,037 to have a value of 1.238 vs 1.241? Probably not much! I have not studied sensitivity of optimization parameters for DNN’s but I feel that given the huge number of parameters FP16 may be adequate to represent significance in the model.

A problem related to parameter sensitivity that could be exacerbated by FP16 is that of “Vanishing” or “Exploding” gradients. The gradient is a vector of partial derivatives that is used to calculate a direction of descent to minimize a loss function. It’s the key ingredient for “learning”. Loss of precision in the gradient can lead to underflow or overflow of the values and cause numerical instability and cause the learning to fail. However, this can be addressed by dynamic “clipping” or re-scaling of the gradient to maintain precision. This is not unreasonable to do and it is good practice in any case.

Avoidance of over-fitting

One of the problems when training neural networks is over-fitting of parameters. That happens when the optimization find (learns) parameters to fit the training data almost exactly but then doesn’t generalize well to test samples. This is a big problem because of the large number of parameters. You see methods employed like regularization, dropout, early stopping etc.. These methods are used to avoid over-fitting and give a “smoother fit” in the model. In a general sense these methods are trying to constrain the optimization from getting “too good”. It may be that using lower precision when computing parameter updates may actually help the overall fit of the models.(??)

Those are my intuitive feelings for why FP16 is probably OK for DNN’s. It may be perfectly fine for “inference” too. That’s when you have developed a model and are using it for predictions. You often need to reduce the size and complexity of your trained models so they can be deployed for real-time use, possibly on simple hardware. Precision reduction may be useful for this too.

I have used the “likely”, “probably”, “may” a lot. That’s because I’m an inherent skeptic. However, I think it is definitely worth while to make an effort to explore FP16 and mixed precision with Tensor-cores if you have access to Volta architecture GPU’s. I have some results from some testing that may encourage you to try this yourself.

Tensor-core performance results for a CNN benchmark

The system I’m testing on may become the new Puget Systems primary Machine Learning workstation. I won’t commit to that until I have more multi-GPU testing done. That should be happening in the next few weeks. The system is based on an Intel Xeon-W single socket CPU using Registered ECC memory and having 4 X16 slots for GPU acceleration. I should be testing the system with 4 Titan V GPU’s over the next couple of weeks.

The results in the table below are from running various convolution neural network models implemented with TensorFlow and using synthetic training data. The code I’m running is from the TensorFlow docker image on NVIDIA NGC. The application is cnn in the nvidia-examples directory.

See the following post for details of how to setup your workstation to use NGC.

My command line to start the container is, (after doing docker login nvcr.io to access the NGC doker registry)

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

The following table shows why you might want to look at Tensor-cores!

CNN benchmarks with 1080Ti, Titan V, Titan V + Tensor-cores — Training performance (Images/second)

CNN Model 1080Ti Titan V Titan V + Tensor-cores
GoogLeNet 515 696 960
Resnet-50 206 274 505
VGG-13 155 220 377
Inception3 132 195 346

This is obviously not a comprehensive evaluation of Tensor-cores! However, it does motivate further investigation. I am curious to do more testing. Expect to see more on Titan V in the next few weeks. I hope to be testing 4 cards for scaling soon and I certainly will include job runs that utilize Tensor-cores.

Happy computing! –dbk