Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1141
Dr Donald Kinghorn (Scientific Computing Advisor )

NVIDIA Titan V plus Tensor-cores Considerations and Testing of FP16 for Deep Learning

Written on April 20, 2018 by Dr Donald Kinghorn

I finally have my hands on a Titan V again. It's been since December! I haven't been able to check any out of inventory here at Puget Systems since for some reason customers keep buying them. Our testing cards have been traveling around to various meetings and trade shows. I've had one this week for testing and next week I should have 4 for multi-GPU scaling evaluation on Machine Learning workloads.

In the post I did in December I didn't get good results with Tensor-cores from the naive testing I did at that time. I really liked the Titan V then and gave it a recommendation. I like it even more now. I did promise that I would revisit Tensor-cores and report back. I'm reporting back, and the results look good!

Tensor-cores are one of the unique new features of the NVIDIA Volta architecture. They are available in the Tesla V100, the new Quadro GV100 and the Titan V. I'm going to quote from the Wikipedia Volta page since it is good concise definition of what a Tensor-core is.

"Tensor cores: A tensor core is a unit that multiplies two 4×4 FP16 matrices, and then adds a third FP16 or FP32 matrix to the result by using fused multiply–add operations, and obtains an FP32 result that could be optionally demoted to an FP16 result. Tensor cores are intended to speed up the training of neural networks."

Much of Machine Learning/AI computation comes down to simple numerical linear algebra operations of which matrix (tensor) multiplication is fundamental. Algorithms can be designed to use partitioning to smaller block sizes to take advantage of special hardware units and memory architectures. NVIDIA has implemented some of these operations in the CUDA libraries to take advantage of the Tensor-core structure. This means that frameworks like TensorFlow that leverage these libraries can take advantage of the potential speedup from this hardware.

What is precision? What does FP16 mean?

FP16 means 16-bit floating point numbers. This is also known as half-precision. On GPU's calculations are most commonly done with FP32 32-bit, single precision data types. That's because that is where GPU's offer the highest performance. Traditionally scientific computation on CPU's is done with 64-bit FP64 double precision. What does that mean for numerical accuracy? Here's an example,

  • Half precision, FP16: 14239812.438711809 is the same as 1424000
  • Single precision, FP32: 14239812.438711809 is the same as 14239812.
  • Double precision FP64: 14239812.438711809 is the same as 14239812.43871181

In other words, FP16 is good enough for around 4 meaningful digits, FP32 is good for 8 digits and FP64 is good for 16 digits. There are other "precisions" too. "Extended precision" will usually give 20 good digits and most CPU's and compilers will use this internally for intermediate steps in operations like sin, cos, exp, log, bessel, erf etc. so that they don't lose precision while they are being computed. It is occasionally necessary to use quad precision (FP128) for calculations that are numerically unstable, that gives about 34 good digits.

If you want to use Tensor-cores it means that you will be using FP16 half-precision for at least part of your calculations. Is that OK?

Does precision matter? (... How about for for Deep Learning?)

Well, the answer to that is "it depends". There is a whole discipline devoted to answering questions about precision and errors in calculations. Numerical Analysis, is the field of study that takes a careful look how computational algorithms are affected by approximation in numbers. On a computer, numbers other than binary (one's and zeros) are approximations. The "precision" is an indicator of the "goodness" of that approximation.

In the definition of Tensor-cores they talk about things like saving the FP16 operation to an FP32 result. That's important. Here's why; When you multiply numbers together the number of "good" digits is preserved, but, when you add (or subtract) numbers you can lose significant digits. This is commonly called accumulation error. Most calculations come down to multiply and add type of operations. You may be able to get away with lower precision for the multiply part if you are careful about the accumulation (add) part. You usually want higher precision for accumulation. That's why they say "and obtains an FP32 result" in the definition of Tensor-cores. In general that idea is called mixed precision, and, it can work OK if you are careful with your algorithms.

If you are an engineer computing the stresses on components for a bridge you would never even consider FP16. But what if you are trying to "learn" a million parameters for some Deep Learning neural network model ... maybe that's OK.

Why FP16 and Tensor-cores are likely OK for Deep Learning

Notice that I said likely. I haven't done much testing myself and I have been doing numerical computing for long enough to see that there could be problems from using such low precision. Computer "Learning" is really just numerical optimization. That is adjusting some set of parameters so that they will minimize a "loss function" that is a measure of how close predicted values from your model are to known true values (that's supervised learning). Optimization problems can be unstable but here are a few reasons I think it is probably OK to use FP16 mixed precision for training Deep Neural Networks (DNN's);

Low parameter sensitivity

A DNN with many layers may have millions of parameters (weights) that need to be optimized. The models are so large and complex that small changes to optimization parameters are unlikely to have much impact on the overall predictive capability of the model. For example what difference to you think it would make for parameter 1,240,037 to have a value of 1.238 vs 1.241? Probably not much! I have not studied sensitivity of optimization parameters for DNN's but I feel that given the huge number of parameters FP16 may be adequate to represent significance in the model.

A problem related to parameter sensitivity that could be exacerbated by FP16 is that of "Vanishing" or "Exploding" gradients. The gradient is a vector of partial derivatives that is used to calculate a direction of descent to minimize a loss function. It's the key ingredient for "learning". Loss of precision in the gradient can lead to underflow or overflow of the values and cause numerical instability and cause the learning to fail. However, this can be addressed by dynamic "clipping" or re-scaling of the gradient to maintain precision. This is not unreasonable to do and it is good practice in any case.

Avoidance of over-fitting

One of the problems when training neural networks is over-fitting of parameters. That happens when the optimization find (learns) parameters to fit the training data almost exactly but then doesn't generalize well to test samples. This is a big problem because of the large number of parameters. You see methods employed like regularization, dropout, early stopping etc.. These methods are used to avoid over-fitting and give a "smoother fit" in the model. In a general sense these methods are trying to constrain the optimization from getting "too good". It may be that using lower precision when computing parameter updates may actually help the overall fit of the models.(??)

Those are my intuitive feelings for why FP16 is probably OK for DNN's. It may be perfectly fine for "inference" too. That's when you have developed a model and are using it for predictions. You often need to reduce the size and complexity of your trained models so they can be deployed for real-time use, possibly on simple hardware. Precision reduction may be useful for this too.

I have used the "likely", "probably", "may" a lot. That's because I'm an inherent skeptic. However, I think it is definitely worth while to make an effort to explore FP16 and mixed precision with Tensor-cores if you have access to Volta architecture GPU's. I have some results from some testing that may encourage you to try this yourself.

Tensor-core performance results for a CNN benchmark

The system I'm testing on may become the new Puget Systems primary Machine Learning workstation. I won't commit to that until I have more multi-GPU testing done. That should be happening in the next few weeks. The system is based on an Intel Xeon-W single socket CPU using Registered ECC memory and having 4 X16 slots for GPU acceleration. I should be testing the system with 4 Titan V GPU's over the next couple of weeks.

The results in the table below are from running various convolution neural network models implemented with TensorFlow and using synthetic training data. The code I'm running is from the TensorFlow docker image on NVIDIA NGC. The application is cnn in the nvidia-examples directory.

See the following post for details of how to setup your workstation to use NGC.

My command line to start the container is, (after doing docker login nvcr.io to access the NGC doker registry)

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

The following table shows why you might want to look at Tensor-cores!

CNN benchmarks with 1080Ti, Titan V, Titan V + Tensor-cores -- Training performance (Images/second)

CNN Model 1080Ti Titan VTitan V + Tensor-cores

This is obviously not a comprehensive evaluation of Tensor-cores! However, it does motivate further investigation. I am curious to do more testing. Expect to see more on Titan V in the next few weeks. I hope to be testing 4 cards for scaling soon and I certainly will include job runs that utilize Tensor-cores.

Happy computing! --dbk

Tags: Titan V, Tensor cores, NVIDIA, Deep Learning, Machine Learning

Can you tell me what motherboard you are using for that has an intel xeon-w single cpu that allows me to use 4 x16 PCIe gpu cards??
Very interested as I want to do some near real-time Monte Carlo simulations using 4 gpu cards and I know of only 2 motherboards that are both dual processor intel SP (Tyan s7106 and Asus 621 SAGE).
Keep up the good work

Posted on 2018-05-05 14:27:46
Donald Kinghorn

Thank! That was the Gigabyte MW51-HP0. I have been qualifying that board for use in our systems (and had one in my personal system for several months). It looked good at first and then the Spectre/Meltdown issues hit and we had lots of problems with it. The testing in this post was using an early release BIOS. We did get a good BIOS update from them recently and I have been using it in my latest tests. ( this one in particular might be of interest, https://www.pugetsystems.co... )

There will be some other options in a couple of months but I'm OK with this Gigabyte board now. I still have a bit more testing to do looking for issues (Ubuntu 14.04 and 18.04). But at this point it looks like a good workstation board. Right at the moment it is in very short supply but at the end of May there should be another run of the board delivered. We will likely have it offered in a workstation build.

Posted on 2018-05-08 15:52:12

The Gigabyte MW51-HP0 does look like a good workstation board for AI development. Using Nvidia GPU's for Monte Carlo simulation development is nothing short of amazing!!! Now that i have successfully developed a solution, I am moving towards a prototype production environment scalable to multiple GPU cards. My current hardware environment is an Asus X399 motherboard with 64GB overclocked to 3.9GHZ. An excellent workstation but I really like the idea of using a cheap server board as a GPU server and keeping my X399 as my development/visualization workstation. Since I will be using at least up to 4 Titan XP GPU cards in TCC mode for my GPU server, i do not need initially more than 64GB of memory as I am able to fully use almost all of the 12GB of video memory for my simulations. The key part of your system I am interested in is the 16Bit PCIe transfer rate as I am definitely passing back to my host program , gigabytes worth of 32Bit floating point values in less than a second.

The Titan V looks very impressive but from a cost performance point of view for my Monte Carlo simulations in 32-bit mode, it is cost prohibitive. Howerver I do like the idea of 16-bit processing on the Titan V which should theoretically at least double my 32-bit performance but since I am calculating option prices from my Monte Carlo simulations, the decreased precision of 16bit vs 32bit fp may not be a good idea. I am really hoping that Nvidia will release a Titan Xv by the end of the year that is the same price as the current Titan Xp without the Tensor cores but includes full speed 16-bit processing and at least 16GB of memory. I am also certain that they will release a faster Titan V with at least 16GB as well. I currently have a 1080TI and Titan XP and i definitely will be using the Titan series for production as I can run them in TCC mode and use all of the GPu memory for my CUDA application.

I definitely would be interested if you could, in a direct comparison of a simple CUDA c/c++ application that does a comparison of 32-bit floating point processing compared to 16-bit on the Titan V using single/multiple GPU cards...

Posted on 2018-05-10 09:58:15
Donald Kinghorn

like this one? :-) https://www.pugetsystems.co...
I have scaling comparison with fp16 and fp32 in there with 4 Titan V's.

TensorFlow is a C++ application but it is definitely not simple :-)

You can hit instability with fp16. That's not a lot of good digits! In the testing I've done using the convolution models with TensorFlow the big models like Inception4 produced NAN's from the loss function. I think it is always worth trying (if you have it available) but you have to be careful with algorithms and things like gradient scaling etc...

You mentioned PCIe bandwidth. I was pretty happy with the performance on that board. I never know what to expect from a PLX switch but so far most of the systems I've tested have worked pretty well. The effect of X16 vs X8 is a question that comes up a lot! It's hard to test because it depends heavily on how the code being run was designed and the bus communication requirements of the actual job being run. Things like card-to-card communication and data transfer, card to CPU space data and com ... I going to do some testing on this system with the 4 Titan V's and block half the PCIe lanes and see what happens. I want to use the CNN benchmarks I've been running but with a real data set (not the synthetic data I've bee using) I'll probably have that posted within the next couple of weeks ... should be interesting!

Posted on 2018-05-10 19:44:02

YES!!! Excellent...looks like approx 80% speedup going from 32bit to 16bit fp.
Basically One Titan V using 16bit fp is almost twice as fast as 2 Titan Xp's in 32bit fp!!! Quite good but I will wait until the Titan Xv comes out later in the year or maybe when the faster Titan V comes out later this year (...just speculating!!!) I think i need to do some cloud CUDA development to test 16bit fp vs 32bit fp for the custom app that I have developed.

The PCIe bandwidth definitely needs to be explored if possible as my simulations on average take close to 2 seconds to run per card and therefore I would like to know how much of that time is taken up by transferring data to the host from the GPU card...transferring up to 8 gigabytes of fp32 results in near real time is the goal.

I am looking at various motherboards capable of running 4 x 16 lanes PCIe GPU cards including Intel Xeon E4 v4 series as well as their newer Scalable Processors. I do all my accumulation on the host cpu which in this case could be the one I use for my workstation via a network if I use a GPU server which would allow me to use low core cpu's. Otherwise I may need high core count cpu's on the GPU server to do my accumulation functions for the Monte Carlo simulations. Therefore VERY important which mother board I use that still allows high bandwidth regardless of whether the PCIe slots are 16 lanes or 8 lanes.

Lots of decisions and I hope you can help me out!!!
Thanks again...

Posted on 2018-05-10 20:09:03
Donald Kinghorn

... I'll add a couple more things ... Yes, the fp16 (Tensor-core) results are impressive. The thing I don't know is how good they are. I'm still a bit skeptical how much tolerance there is for numerical error when optimization code is running. Testing your code in the cloud is a good thing. It wont cost too much to see if it will work OK or not. [ it doesn't take long in the cloud to ring up the cost of your own new hardware ... but for testing it's great]

I'm hopeful that by the end of the year we will see PCIe v4. Hoping that will the on the Intel Ice Lake chipsets ... that may mostly eliminate bandwidth and latency problems with multi-GPU ??? not sure, ... but it could be a big step forward. I was disappointed that they didn't use it with the now current, Xeon's ... they could have but didn't

Posted on 2018-05-11 18:23:04

For my production system, the minimum number of Titan V cards i need in fp32 mode for near real time simulations under 5 seconds is 8.
I am only willing to use server motherboards with 4 x 16 PCIe slots with NO plx chips. PCIe Gen 4 would be great but I don't think any affordable gen 4 GPU cards will come out until late 2019/2020 as most GPU apps are not taxed to transfer multi gigabytes of data between the GPU and host.

For absolute real time, i need to do my current simulations which take approx 2 seconds , in under 1/60 of a second. This will not realistically happen for quite a while as that means I need a GPU card that is at least an order of magnitude faster than Titan V!!! I have made my CUDA algorithm as fast as possible for now but I may need to look at ways to make it even faster but I don't think that will be easy unless I take more advantage of the underlying hardware architecture of CUDA.

I am hopeful that a single CPU Amd epyc motherboard will come out with 4 x 16 PCIe slots very soon as I would definitely jump on them. The nearest one that I know of is a Supermicro board with 3 16PCIe slots with NO plx chips. Still do not understand why there are currently no Amd epyc motherboards that support 4 16bit PCIe slots natively considering the cpu supports 128 PCIe lanes directly!!

I am really enjoying your info on deep learning and AI using TensorFlow on a Nvidia GPU as I will definitely me moving to add more intelligence once my main Monte Carlo app is complete as I do want to move to automated trading using AI/machine learning as soon as possibe...

Posted on 2018-05-11 20:09:04