Table of Contents
NIVIDA announced availability of the the Titan V card Friday December 8th. We had a couple in hand for testing on Monday December 11th, nice! I ran through many of the machine learning and simulation testing problems that I have done on Titan cards in the past. Results are not the near doubling in performance of past generations… but read on.
NVIDA has set high expectations with it’s incredible performance leaps with past generation updates to the Titan. My initial tests with the Titan V are not showing that large performance gain of the past. I am optimistic that I’ll be able to better exploit the unique features of the card after I spend more time with it. There is no doubt that this is the best desktop “video card” available and it has already shown surprisingly good results on the desktop in areas where it wasn’t really expected to excel. I feel that it is a bargain for $3000 since it is basically an active cooled, desktop system compatible, video card that has most of the same features as the Tesla V100. However, it is no secret that the GeForce 1080Ti and the Titan Xp cards offer fantastic (single precision) compute performance at a significantly lower cost.
I’ll start with a description of the test setup followed by testing results and comments about the performance.
Testing the Titan V against Titan Xp
I’m not going to go over all of the spec details and press release info. There is plenty of that scattered all over the web. I’m just going to run some compute jobs on the Titan V and report the results.
Test set-up (Puget Systems Peak Mini)
NVIDIA Titan V and Titan Xp (mostly run with one card in the system at a time)
Intel Skylake-X 7960X 16-core CPU.
EVGA X299 microATX motherboard
64GB DDR4 2666MHz memory
CUDA 9.1 installed on the system
NVIDIA GPU Cloud (NGC) docker registry
I did most of my testing using NVIDIA built docker images. The NGC docker registry is now available for local desktop use. You will need to be a registered NVIDIA developer to receive an API key to access the NGC docker registry. I highly recommend it!
Note: I will be doing an updated series of posts in early 2018 on how to install, configure, and use docker and NVIDIA docker on your desktop system. That will be a refresh of my earlier series on this topic and will utilize version 2 of NVIDIA docker and will discus use of the NCG docker registry.
NVIDIA CUDA nbody
NVIDIA DIGITS with Caffe
CUDA nbody and NAMD Molecular Dynamics Simulation.
The first thing I do when I get a new NVIDA card is to setup a CUDA development environment and compile the included samples. Then I run nbody as a benchmark. This is a classical physics, many-body force (gravity) calculation. The second test job I like to run is a molecular dynamics calculation on the million atom “satellite tobacco mosaic virus a.k.a. stmv” using NAMD. These are both challenging mathematical compute applications and have excellent GPU acceleration.
CUDA nbody and NAMD on Titan V and Titan Xp
|Card||nbody single precision GFLOP/s||NAMD run time (sec)||NAMD day/ns|
|Titan V (Volta)||7159 (9107)*||24.6||0.480|
|Titan Xp||8409 (7600)*||25.8||0.525|
If I did not specify “-numbodies” the results in parentheses were obtained! That was 81920 bodies for the Titan V case. That’s significantly better for Titan V and I’m not sure why.
For NAMD the number that is most meaningful is day/ns, smaller is better. (that’s how much of a day it takes to simulate 1 nano-second)
For nbody I did runs using CUDA 9.1 installed directly on the machine I was using and compared with job runs using the NVIDA CUDA 9.0 dev docker image from the NGC repository, results were essentially the same for both.
NAMD was run using the NIVIDA docker image in the NGC repository.
Old Results from my “NVIDIA Titan GPUs (3 generations)” post
|Card||nbody single precision GFLOP/s||NAMD run time (sec)||NAMD day/ns|
|Titan X (Pascal)||7507||41||0.570|
|TITAN X (Maxwell)||4292||55||0.889|
I included this table of results from a post I did a bit over a year ago. That was CUDA 8.0rc and an older version of NAMD. That table illustrates the remarkable performance gains of past generations of Titan. I’m not completely sure what to think of the nbody result for the Titan V. nbody is not necessarily a “good” benchmark but it served well to show the relative performance in the past. The Titan V results for NAMD are very good but not dramatically so. NAMD can be CPU bound, however, I did check results with fewer CPU cores and got nearly the same results.
The Titan V did well on these tests but it was not the astoundingly better performance that we have seen in the past.
nbody Double Precision (fp64)
One of the very strong features of the Titan V is it’s terrific double precession floating point compute performance. This is something not typically exploited on GPU’s because in the past single precision on GPU’s has been much faster. GeForce cards have nearly all had “crippled” fp64 performance (with the exception of the original Titan). The Titan V has the full double precision (fp64) performance of the Tesla V100. Volta has the highest ratio of double (fp64) to single (fp32) performance of any architecture NVIDA has produced. The ratio is 1:2, that means fp64 is half the performance of fp32. That is really good! Here’s what happens when you run the nbody simulation with fp64 and compare with the Titan Xp
Double precision (fp64) nbody results with Titan V and Titan Xp
|Card||nbody double precision (fp64) GFLOP/s|
|Titan V (Volta)||4456|
That’s a performance increase over the Titan Xp of 1280% i.e. almost 13 times faster!
Machine Learning Tests with Titan V and Titan Xp
The Titan V was lanuch at the 2017 NIPS conference (Conference and Workshop on Neural Information Processing Systems). NIPS is an important conference for the machine learning crowd. NVIDIA GPU’s are half of the reason that we have seen such an explosion of interest and activity in machine learning and AI. The other half is tht we now have mountains of data to work with. The heavy compute end of machine learning is largely driven by NVIDIA GPU’s. (and they play a very important role on the deployment i.e. inference, end of things too!)
Since this is probably why you are reading this post lets get to some results.
These are preliminary results of some “standard” machine learning test job runs. They are not exploiting Tensor-cores and half-precision. Tensor-cores are one of the unique hardware features of the Volta architecture and have the largest potential for dramatic performance gains. However, they utilize half-precision and will “require” code tuning and possibly complete rethinking of algorithm implementation. I was not able to find anything that benefited from Tensor-cores “out-of-the-box”. I’ll let you know when I do!
Convnet benchmark with Tensorflow
Convnet is a convenient “convolution neural network” (CNN) benchmark set that can be run on many machine learning frameworks. I count 19 frameworks on the GitHub repo. You can get the source and run scripts (Python) from the convnet-benchmarks GitHub page. I ran forward and backward propagation steps for the the GoogleNet V1 CNN using Tensorflow with 100 steps with a batch size of 128.
Note: I’m using NVIDIA docker V2 in case you are wondering about
docker run --runtime=nvidia. You can use the old plugin syntax
nvidia-docker run in both version 1 and 2. … I’ll write about setting up NVIDIA docker V2 soon.)
Convnet benchmark, GoogleNetV1 with Tensorflow on Titan V and Titan Xp
|Titan V (Volta)||0.164 sec/batch|
|Titan Xp||0.201 sec/batch|
Approx. 20% speedup with Titan V
Tensorflow LSTM Language Model Training
This is an LSTM language modeling training run using a very large word corpus. It is included in the “nvidia-examples” directory of the Tensorflow docker image in the NGC repository.
Tensorflow LSTM (Train) on 1 Billion Word Benchmark Dataset on Titan V and Titan Xp
|Card||Tensorflow LSTM (Train)
1 billion word dataset
|Titan V (Volta)||8227 words per second|
|Titan Xp||7541 words per second|
Approx. 10% speedup with Titan V
DIGITS v6.0 with Caffe ImageNet Model Training
I have tested with NVIDIA DIGITS in the past, for example, NVIDIA DIGITS with Caffe – Performance on Pascal multi-GPU. NVIDIA DIGITS has a nicely done browser based interface and the new version 6.0 now includes Tensorflow in addition to Caffe and Torch frameworks. I used a training image set from IMAGENET Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) and ran 1 epoch uisng GoogleNet with a batch size of 64.
DIGITS 6.0 with Caffe, GoogLeNet Model Training on 1.3 Million Image Dataset on Titan V and Titan Xp
|Card||Caffe GoogleNet CNN
ImageNet 1.3 million images 1 epoch
|Titan V (Volta)||31 min.|
|Titan Xp||35 min.|
Approx. 12% speedup with Titan V
Conclusions and Recommendations
The Titan V is a great card and it has better compute performance than the Titan Xp. However, I didn’t see large performance improvements with the limited testing that I did. It’s not like the (phenomenal) performance gains we have seen by in the past. I used programs that I would consider to be common for use with GPU acceleration but they did not exploit the new hardware features of the Titan V.
The double precision of the Volta architecture is outstanding and this has been fully enabled on the Titan V. However, developers have long used single precision for GPU acceleration since it has offed the best performance and was relatively easy to adapt algorithms to.
The most intriguing new hardware feature of the titan V is the Tensor-cores. The performance potential of these hardware units could possibly give an order of magnitude performance increase to algorithms that can exploit them. However, this requires the used of half-precision (fp16) and could be challenging for developers to exploit. I did briefly try to get some jobs running with Caffe2 and TensorRT that would utilize the Tensor-cores but in the short time that I have been working with the card I was not able to get useful results. Support for Tensor-cores is available in NVIDIA’s cuDNN and cuBLAS libraries so I expect to see more programs using this feature soon. I will continue to work on that and will certainly write about it in the future.
Personally I think the idea of Tensor-cores is brilliant. However, I’m not to excited about half-precision (fp16). That’s only 4 digits of precision … “what could possibly go wrong”. I can see how you could get away with that in some cases but I’m still waiting to be convinced that it’s a “good thing”.
Recommendation: I feel the that the Titan V is a bargain at $3000. It has most of the performance and features of the Tesla V100 in a desktop workstation friendly design. For developers working on new CUDA code I would certainly recommend it. For those developers on tighter budgets and those mostly interested in using existing programs the Titan Xp and 1080Ti offer very good performance for a more modest cost, especially the 1080Ti.
Secondary Recommendation: The docker images available in the NVIDIA NGC repository are very good. This is another example of NVIDIA’s excellent support of the ecosystem around GPU accelerated computing. Highly recommended! I will be writing about how to setup and utilize NVIDIA docker and this repository soon.
Happy Computing! –dbk