Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1345
Dr Donald Kinghorn (Scientific Computing Advisor )

RTX Titan TensorFlow performance with 1-2 GPUs (Comparison with GTX 1080Ti, RTX 2070, 2080, 2080Ti, and Titan V)

Written on January 30, 2019 by Dr Donald Kinghorn
Share:


The RTX Titan is NVIDIA's latest release of the venerable Titan line. Does it live up to expectations? How is the compute performance for machine learning? Is the peer-to-peer performance the same as with the other new RTX 20xx cards? I'll answer these questions and give some recommendations for using the new RTX Titan.

I was able to get some testing time with 2 of the new RTX Titan GPU's. There is good news and bad news.

The good news

The RTX Titan has good fp32 and fp16 compute performance. It is similar in characteristics to the RTX 2080Ti but it has twice the memory and better performance.

Having 24GB of memory opens some new possibilities,

  • Larger batch sizes for deep learning jobs. That can increase performance and improve convergence in some circumstances.
  • Input data with a very large number of features, for example bigger images.
  • Having more memory helps to avoid the dreaded OOM (out-of-memory) messages that can come up in a variety of situations.

The lager amount of memory on the RTX Titan is probably its best distinguishing feature for compute. Sometimes not-enough-memory is a "show-stopper". GPU memory is expensive so I feel the RTX Titan is quite reasonably priced for a 24GB card. The similar (but better) RTX Quadro 6000 with 24GB of memory is more than 2 times more expensive than the RTX Titan.

The bad news

OK, now for the "bad" ... one thing that limits some configuration options, and a couple of other minor disappointments.

The worst thing about the RTX Titan is the cooling solution for the card.

At this time the RTX Titan is only available with a dual side fan arrangement. That configuration requires at least one empty slot next to the card to allow for proper cooling. That means you will be limited to two RTX Titans even on a motherboard with 4 X16 slots and good chassis cooling.

Our production engineers have worked up a good chassis and cooling setup that will work well for 2 RTX Titans. This can be configured with an NVLINK bridge using the wide spacing bridge. This configuration may not be listed on our usual configuration pages so please contact sales@pugetsystems.com or 1 888 784-3872 extension 1 for more information.

A minor disappointment with the RTX Titan is the lack of peer-to-peer over PCIe i.e. you have to have an NVLINK bridge to get "P2P". This is not that important in my opinion since in practice it does not seem to have much impact on most multi-GPU workloads. Note, that the RTX Quadro 6000 does have P2P over PCIe when not using the NVLINK bridge (same as the Pascal based cards). The other reason this is not a big concern is that you can't use more than 2 RTX Titans in a workstation system anyway (because of cooling). So, if you do use 2 cards you may as well get the NVLINK bridge too.

The last disappointment is really not an issue with the RTX Titan, it's just part of the design of Turing GPU's. I'm referring to lack of good fp64 performance. The RTX GPU's are a lot like the older GTX NVIDIA cards where the double precision performance is only a small fraction of the single precision (fp32) performance. This is not important for the majority of application since typically, fp32 is used for GPU compute. It's only a disappointment to me because I love the Titan V for its great fp64 performance which is important to me personally for my general scientific computing work. The Titan V is basically a desktop version of the high performance Tesla V100 compute accelerators and I really love the Titan V. I'm looking forward to the next Titan card that is based on a Tesla class GPU.


Test system

Hardware

  • Puget Systems Peak Single (I used my personal system that is similar to this, but not quite as nice!)
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 1TB Samsung NVMe M.2
  • RTX Titan (2)

Software

I used the NVIDIA NGC docker images that I have used in several recent posts. There are newer version of these container images and I will be using them in future posts. I have worked with these newer docker images and they do give better performance with the Turing RTX GPU's. However, the relative performance is similar to the older versions and I want to include my older testing results for other GPU's in this post.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning


TensorFlow performance with 1-2 RTX Titan GPU's

I am including relevant results for all of my recent testing with the RTX GPU's. The two latest posts being, P2P peer-to-peer on NVIDIA RTX 2080Ti vs GTX 1080Ti GPUs and RTX 2080Ti with NVLINK - TensorFlow Performance (Includes Comparison with GTX 1080Ti, RTX 2070, 2080, 2080Ti and Titan V). Both of these posts may be of interest. After this post I will be updating my testing benchmarks and plan on doing a large multi-GPU comparison post.

The CNN code I am using is from an older NGC docker image with TensorFlow 1.4 linked with CUDA 9.0 and NCCL. There is more recent versions of this docker image that uses "horovod" on top of MPI for multi-GPU parallelism. I will be using that in future posts. The LSTM "Billion Word" benchmark I'm running is using a newer version (but not the newest) with TensorFlow 1.10 linked with CUDA 10.0.

I'll give the command-line inputs for reference.

With the addition of the RTX Titan test results the tables and plots are getting bigger and now include GTX 1080 Ti, RTX 2070, 2080, 2080 Ti, Titan V and RTX Titan.

TensorFlow CNN: ResNet-50

Docker container image tensorflow:18.03-py2 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Example command line for job start,

NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2 --fp16

Note, --fp16 means "use tensor-cores". I have used batch sizes of 64 for most fp32 tests, batch size 128 for fp16 and on the RTX Titan I used 192 for fp32 and 384 for fp16, taking advantage of the 24GB memory the RTX Titan.

[ResNet-50] - GTX 1080Ti, RTX 2070, 2080, 2080Ti, Titan V and RTX Titan - using TensorFlow, Training performance (Images/second)

GPU FP32
Images/sec
FP16 (Tensor-cores)
Images/sec
RTX 2070192 280
GTX 1080 Ti207 N/A
RTX 2080207 332
RTX 2080 Ti280 437
RTX Titan 294481
Titan V299 547
2 x RTX 2080 364552
2 x 1080 Ti 367N/A
2 x RTX 2080+NVLINK373 566
2 x RTX 2080 Ti 470750
2 x RTX 2080 Ti+NVLINK500 776
2 x RTX Titan 572941
2 x RTX Titan+NVLINK577 958

ResNet-50 with RTX GPU's


TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:18.09-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.09-py3

Example job command-line,

/NGC/tensorflow/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256

"Big LSTM" - GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V and RTX Titan - TensorFlow - Training performance (words/second)

GPU words/second
RTX 2070 4740
RTX 2080 5071
GTX 1080 Ti6460
Titan V (Note:1)7066
Titan V (Note:2)8373
2 x RTX 2080 8882
RTX 2080 Ti8945
RTX Titan 9095
2 x RTX 2080+NVLINK 9711
2 x GTX 1080Ti 11462
2 x RTX 2080 Ti 15770
2 x RTX 2080 Ti+NVLINK 16977
2 x RTX Titan 17863
2 x RTX Titan+NVLINK 18118

LSTM with RTX GPU's

  • Note: With only 8GB memory on the RTX 2070 and 2080 I had to drop the batch size down to 256 to keep from getting "out of memory" errors. Batch size 448 was used for 1080Ti and RTX 2080Ti. Batch size 640 was used for the RTX Titan.
  • Note:1 For whatever reason this result for the Titan V is worse than expected. This is TensorFlow 1.10 linked with CUDA 10 running NVIDIA's code for the LSTM model. The RTX 2080Ti performance was very good!
  • Note:2 I re-ran the "big-LSTM" job on the Titan V using TensorFlow 1.4 linked with CUDA 9.0 and got results consistent with what I have seen in the past. I have no explanation for the slowdown with the newer version of "big-LSTM".

Should you get an RTX Titan for machine learning work?

It was pretty easy to say "yes" for the RTX 2080Ti as a compute accelerator, especially since they are available with good blower based coolers. Therein lies the problem with the RTX Titan, the cooling is not very good for a GPU at this performance level. It generates a lot of heat under load and dumps that heat right into your chassis. It takes extra care to design chassis cooling adequate to remove that heat and keep the GPU from throttling it's clocks down under heavy load. (We have solved that cooling problem.)

The RTX Titan would be excellent for a single GPU setup. The 24GB of memory will allow for development work on problems that would be difficult or impossible without it. Using two RTX Titans in a system along with a wide-spaced NVLINK bridge is a good high performance configuration as long as the overall system is designed to provide sufficient cooling.

For a multi-GPU (more than 2 GPU) system that needs this level of capability and performance I would recommend the RTX Quardo 6000. This Quadro card has same amount of memory, it has P2P over PCIe enabled, and it has a great cooling design. The only downside of the RTX Quadro is the cost. (I have done testing with 2 RTX Quadro 6000's.)

Overall, all of the RTX GPU's are very good as compute devices. For machine learning workloads they are worthy updates to the (wonderful) "Pascal" based GTX GPU's with better performance and the addition of "tensor-cores". The RTX GPU's are also innovative! Outside of compute I am looking forward to seeing what developers do with the ray-tracing capabilities of these cards.

Happy computing! --dbk

Tags: RTX Titan, Machine Learning, NVIDIA
Lawrence Barras

Hey Dr. Don! Another great post. I've been running the RTX-Titan in pairs, as well as the RTX2080ti. Good point on the cooling, but I have been able to run them in pairs. The gaming NVLinks are all 3 and 4 slot spacing, for the cooling issue. They do work OK with 3 and 4-slot NVLink bridges, with the 4-slot spacing being the best.

Most of the gaming boards are slotted with this in mind now, with "sli" having 3 or 4 slot spacing for 2x GPU's.

The bummer is most of the workstation class boards are 2-slot spacing for GPU's.That is limiting for the shroud cooling cards like these.

My Titans are running in a pair on a X99E WS Xeon board with a 4-slot gaming bridge, in a high airflow case. The cooling is fine there and I haven't seen an issue with cooling or performance. I just got a 2-slot Quadro RTX link and the Titans will work with it, presumably the 2080ti will too. This WS board is designed for 4-way GPU and it previously had 4x 1080ti in it. So I'm wasting slots, but they are productive with the extra memory and overall performance. Just not able to run as many processes simultaneously as I'd like.

My 2080ti are running on a desktop (ie, gaming) x299 board with 3-slot spacing. Cooling is adequate.

So, we're pretty much stuck at 2x RTX Titans or 2080ti no matter what. I have in the past, added EVGA's hybrid cooling kits to 1080ti cards with good success. The TitanRTX is physically identical to the 2080ti, for which there are open loop water blocks and thus, waterblocks exist.

But, without any 3x or 4x NVLink's, the question is "why"? Only if you need to run them in a WS class motherboard which can only run them in 2-slot spacing, or you wish to run 2x2 pairs in a 4-way chassis.

I don't know if it is possible to construct a 4-way NVLink bridge. I'd be tempted if I had the schematic and a source for the connectors and parts, as there are some unknown active components on even the no-LED quadro NVLink.

Posted on 2019-02-01 22:58:04
Donald Kinghorn

Hi Lawrence, I forgot to enable comment notifications (again :-) I'm glad your Titan's are rock'n

Posted on 2019-02-11 19:29:17
Jasper Vicenti

I mentioned this in a previous post but thought I'd draw your attention to the slow Titan V. I was able to see a 43% speedup in a compute-heavy CUDA test.

Regarding your note about Titan V performance on Ubuntu, you may want to look into updating your driver to 415.25 or greater and re-test. See here for more details: TITAN V max clock speed locked to 1,335 Mhz. You need to keep an eye on cooling to limit thermal throttling as well.

Running sudo nvidia-smi --cuda-clocks=OVERRIDE after updating to nvidia-415 did indeed improve performance, however you also may need to override the maximum fan setting (60% by default), which requires setting the cool-bits configuration.

As root:

nvidia-xconfig -a --cool-bits=31 --allow-empty-initial-configuration
# now reboot

# set persistent mode
nvidia-smi -pm 1
# increase power limit from 250 watts to 300 watts
nvidia-smi -pl 300
# increase clocks when running CUDA
nvidia-smi --cuda-clocks=OVERRIDE
# start x to allow fan control
startx &
DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=80" # or whatever you can tolerate

# to revert all changes:
DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -a "[gpu:0]/GPUFanControlState=0"
nvidia-smi --cuda-clocks=RESTORE_DEFAULT
nvidia-smi -pl 250
nvidia-smi -pm 0

Posted on 2019-02-06 11:46:05
lemans24

That is quite the speedup for Titan V!!!
What kind of cuda test was it?? fp32 or fp64 or mix of both?

I have a Titan Xp that overclocks out of the box all day around 1820 Mhz.
The Tian V is around twice as expensive but I doubt it is more than 20% faster than my current Xp,
My mind tells me to suck it up and get quad Titan V but i think I maybe better off getting 3 more Titan Xp's
Not in a rush yet by future proofing makes me think saving for the Titan V makes more sense but the price/ performance for fp32 Cuda compute loads is definitely in the wheelhouse with the Titan Xp...

Posted on 2019-02-06 22:54:22
Donald Kinghorn

Hey Jasper, Nice performance tweaks! That's a significant performance boost. It could be a little hard on the cards so I would add the "standard disclaimer" ... still that's pretty interesting!

I have to add a sad note. It looks like the Titan V is EOL. We have already dropped them and I don't know how long there will be any available. I think the Titan V was the best workstation class compute device ever created! (so far)

I'm hoping NVIDIA has some announcements about new Tesla GPU's at GTC ... and then does another Titan based on that. ... it might be a while before anything like that happens though.

Posted on 2019-02-11 19:39:15
lemans24

So both Titan Xp and V are EOL ??
Not good...looks like quad 2080ti cards are the ticket then.
Disappointed as running the cards in TCC mode is way more efficient than just using the windows drivers
The good news is that 2080ti in CUDA code seems to be way faster than Titan Xp so this maybe a benefit as I can get a blower version and overclock it for maybe 15% cheaper than the list Titan Xp price in Canada
I think Nvidia will probably wait until 7nm chip production is more stable before bringing out their next generation Tesla based architecture...

Posted on 2019-02-14 10:55:59
Donald Kinghorn

I'm afraid it's true about the Pascal cards and Titan V :-(

The RTX 2080Ti seems to be a really good card (in a blower version). There have been some failures in one batch but I think that has been sorted out now. It's going to be the workhorse like the 1080Ti was ... it is quite a bit faster. I still haven't tested in a 4-8 GPU setup but hope to do that near the end of the month.

I'm hoping to hear good news from NVIDIA at the GTC but I don't have any real info.

Posted on 2019-02-15 00:46:25
lemans24

I finally made up my mind to at least get a single 2080ti blower card as a replacement for my 1080ti and do some performance testing.
I looked at a performance review of 2080ti doing pure CUDA tests and one of them was for Monte Carlo simulations. This particular test was over twice as fast as on the 1080ti and I can't sit on the fence any longer.

If indeed it is twice as fast the 1080ti for my Monte Carlo simulations than I may just get 2 of them and up my machine to a threadripper w2990 instead of buying another server with quad gpu cards.

Do you think you guys will be able to test an Intel 28 core w3175x based system in the near future??

Posted on 2019-02-20 11:50:03
Donald Kinghorn

Everything I've done on the 2080Ti has impressed me. I'll be curious to here how it works with your code.

I did test a quick test with the 3175X but haven't written a post (yet, may or may not do that). I tweeted the results here
https://twitter.com/dbkingh...

Posted on 2019-02-20 19:36:52
lemans24

definitely do a post on the w3175x!!

Looks to be twice as fast as 14-core Intel i9 which puts it way ahead of the threadripper 2990x even though thats mainly from 512-AVX code.
I think a fantastic workstation would be a w3175x with dual Titan RTX cards and 192GB ECC memory which would be at least twice as expensive as a threadripper equivalent...

Posted on 2019-02-22 00:15:55
Donald Kinghorn

Yes, it would be over 3 times the faster for stuff where AXV512 vector unit is cranking. It would make for a very interesting "workstation" platform but there are a couple of serious problems. 1 the board is huge, it would likely require custom chassis work, 2 it's very hot so serious cooling is need and that is fairly loud (no matter how you do it) 3 it's not actually supported by Intel and that's the biggest problem. There is no "real" supply no warranty, no road-map ... it's just not really viable as a product but, yes!, it is tempting!

If it was more of a serious product it could be great. I just don't expect Intel to commit to it though. I think it was mostly marketing .... They are working on new platform hardware so it would have a short life time in any case ... unless it's a "test case"

Posted on 2019-02-22 18:33:02
Donald Kinghorn

I made a couple of edits in the post to better reflect my feelings about the RTX Titan. I didn't want it look like the cooling was a showstopper. It's a great card but I am disappointed with the cooling. I do recommend it! I'm only sad that it's only usable with at most 2 cards in a workstation. Still 2 of those with an NVLINK bridge in a good system design is pretty sweet. Our production folks have a system worked out that works well.

Posted on 2019-02-11 19:23:30
Ivan

Hello Sir Donald, I want to ask some questions regarding combining 2 RTX 2080 8GB Card
I'm using Z370G motherboard (miniATX) series --> Link: https://www.asus.com/Mother...
After putting 1 RTX2080 8GB OC card on top PCIE lane, it seems, the space for second card is really tight, and I am not sure whether the 2nd card will fit, eventhough the spec said the mobo support 2 way 2 SLI,
The second question is, even if the 2nd card fit, I don't find NVLINK bridge for 2 tightly spaced RTX 2080 card (2slots) https://www.pugetsystems.co..., so is it possible to use other NVLINK bridge?

Posted on 2019-07-22 18:05:15
Donald Kinghorn

The 2 slot RTX Quadro NVLINK bridge should work

I need to ask you about the fans on your GPU's ... If you are using GPU's with blower style fans you should be OK like this, https://www.gigabyte.com/us... (The Gigabyte blower cards are great because they have an airflow ramp at the ends of the cards)

If you have the dual (or triple) side fan type then they wont work. They would overheat when placed next to each other. ( those cards just blow heat into the case, not out the rear of the card)

Posted on 2019-07-22 22:14:34
Ivan

Thank you for the reply Sir.
My first card is PALIT RTX 2080 8GB OC (Dual Fan) on the first PCIe 3.0 x16 (top slot)
If I add Gigabyte RTX 2080 blower style below it (2nd slot), is it OK? I mean will there be a compability issue of 2 different brand with 1st card OC + 2nd card non-OC, and 1st card dual fan + 2nd card blower?

Posted on 2019-07-23 00:47:51
Donald Kinghorn

They should be OK but I would switch the order of the cards. Get something like a that Gigabyte card with a good blower (the Gigabyte is actually my favorite) Use that one in the first slot then put the one with the dual fans next to it. There will be more open space for air to get to those dual fans that way. Keep in mind that the dual fan card blows hot air into the case (the blower card blows the hot air out the back of the case) You want to be sure you have good air flow in your case. It might be a good idea to leave the side panel off of the case so you don't get heat build up.

The differences in the cards shouldn't really matter too much. Most of the multi-GPU scaling will be from data parallelism i.e. you will have the same batch-size of data going to each card with some gradient accumulation after that. Should work without trouble.

Posted on 2019-07-23 02:10:30
Ivan

I see, thank you very much for the explanation Sir Donald

Posted on 2019-07-23 02:24:45