Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1267
Dr Donald Kinghorn (Scientific Computing Advisor )

RTX 2080Ti with NVLINK - TensorFlow Performance (Includes Comparison with GTX 1080Ti, RTX 2070, 2080, 2080Ti and Titan V)

Written on October 26, 2018 by Dr Donald Kinghorn

This post is a continuation of the NVIDIA RTX GPU testing I've done with TensorFlow in; NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux and NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0. The same job runs as done in these previous two posts will be extended with dual RTX 2080Ti's. I was also able to add performance numbers for a single RTX 2070.

If you have read the earlier posts then you may want to just scroll down and check out the new result tables and plots.

Test system


  • Puget Systems Peak Single
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 1TB Samsung NVMe M.2
  • GPU's
  • GTX 1080Ti
  • RTX 2070
  • RTX 2080 (2)
  • RTX 2080Ti (2)
  • Titan V


Two TensorFlow builds were used since the latest version of the TensorFlow docker image on NGC does not support multi-GPU for the CNN ResNet-50 training test job I like to use. For the "Big LSTM billion word" model training I use the latest container with TensorFlow 1.10 linked with CUDA 10.0. Both of the test programs are from "nvidia-examples" in the container instances.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Part 5 Docker Performance and Resource Tuning

There are two links available:

GPU 0: GeForce RTX 2080 Ti (UUID:

  • Link 0, P2P is supported: true
  • Link 0, Access to system memory supported: true
  • Link 0, P2P atomics supported: true
  • Link 0, System memory atomics supported: true
  • Link 0, SLI is supported: true
  • Link 0, Link is supported: false
  • Link 1, P2P is supported: true
  • Link 1, Access to system memory supported: true
  • Link 1, P2P atomics supported: true
  • Link 1, System memory atomics supported: true
  • Link 1, SLI is supported: true
  • Link 1, Link is supported: false

Those two links get aggregated over the NVLINK bridge!

In summary, NVLINK with two RTX 2080 Ti GPU's provides the following features and performance,


  • Peer-to-Peer memory access: Yes
  • Unified Virtual Addressing (UVA): Yes


  • cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 44.87GB/s

That is twice the unidirectional bandwidth of the RTX 2080.


The terminal output below shows that two RTX 2080 Ti GPU's with NVLINK provides,

  • Unidirectional Bandwidth: 48 GB/s

  • Bidirectional Bandwidth: 96 GB/s

  • Latency (Peer-To-Peer Disabled),

    • GPU-GPU: 12 micro seconds
  • Latency (Peer-To-Peer Enabled),

    • GPU-GPU: 1.3 micro seconds

Bidirectional bandwidth over NVLINK with 2 2080 Ti GPU's is nearly 100 GB/sec!

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 528.83   5.78
     1   5.81 531.37
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 532.21  48.37
     1  48.38 532.37
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 535.76  11.31
     1  11.42 536.52
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 535.72  96.40
     1  96.40 534.63
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.93  12.10
     1  12.92   1.91

   CPU     0      1
     0   3.77   8.49
     1   8.52   3.75
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.93   1.34
     1   1.34   1.92

   CPU     0      1
     0   3.79   3.08
     1   3.07   3.76

First, don't expect miracles from that 100GB/sec bidirectional bandwidth, ...

The convolution neural network (CNN) and LSTM problems I'll test will not expose much of the benefit of using NVLINK. This is because their multi-GPU algorithms achieve parallelism mostly by distributing data as independent batches of images or words across the two GPU's. There is little use of GPU-to-GPU communication. Algorithms with finer grained parallelism that need more direct data and instruction access across the GPU's would benefit more.

The TensorFlow jobs the I have run with 2 GPU's and NVLINK are giving around 6-8% performance boost. That is right around the percentage cost increase of adding the NVLINK bridge. It looks like you get what you pay for, which is a good thing! I haven't tested anything yet where the (amazing) bandwidth will really help. You may have ideas where that would be a big help?? I have a lot more testing to do.

I am using benchmarks that I used in the recent post "NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux". The CNN code I am using is from an older NGC docker image with TensorFlow 1.4 linked with CUDA 9.0 and NCCL. I'm using this in order to have a multi-GPU support utilizing the NCCL communication library for the CNN code. The most recent version of that code does not support this. The LSTM "Billion Word" benchmark I'm running is using the newer version with TensorFlow 1.10 link with CUDA 10.0.

I'll give the command-line inputs for reference.

The tables and plots are getting bigger! I've been adding to the testing data over the last 3 posts. There is now comparison of GTX 1080 Ti, RTX 2070, 2080, 2080 Ti and Titan V.

TensorFlow CNN: ResNet-50

Docker container image tensorflow:18.03-py2 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Example command line for job start,

NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2 --fp16

Note, --fp16 means "use tensor-cores".

ResNet-50 - GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V - TensorFlow - Training performance (Images/second)

FP16 (Tensor-cores)
RTX 2070 192 280
GTX 1080 Ti 207 N/A
RTX 2080 207 332
RTX 2080 Ti 280 437
Titan V 299 547
2 x RTX 2080 364 552
2 x RTX 2080+NVLINK 373 566
2 x RTX 2080 Ti 470 750
2 x RTX 2080 Ti+NVLINK 500 776


ResNet-50 with RTX GPU's


TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:18.09-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.09-py3

Example job command-line,

/NGC/tensorflow/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256

"Big LSTM" - GTX 1080Ti, RTX 2070, RTX 2080, RTX 2080Ti, Titan V - TensorFlow - Training performance (words/second)

RTX 2070 (Note:1) 4740
GTX 1080 Ti 6460
RTX 2080 (Note:1) 5071
RTX 2080 Ti 8945
Titan V (Note:2) 7066
Titan V (Note:3) 8373
2 x RTX 2080 8882
2 x RTX 2080+NVLINK 9711
2 x RTX 2080 Ti 15770
2 x RTX 2080 Ti+NVLINK 16977



  • Note:1 With only 8GB memory on the RTX 2070 and 2080 I had to drop the batch size down to 256 to keep from getting "out of memory" errors. That typically has a big (downward) influence on performance.
  • Note:2 For whatever reason this result for the Titan V is worse than expected. This is TensorFlow 1.10 linked with CUDA 10 running NVIDIA's code for the LSTM model. The RTX 2080Ti performance was very good!
  • Note:3 I re-ran the "big-LSTM" job on the Titan V using TensorFlow 1.4 linked with CUDA 9.0 and got results consistent with what I have seen in the past. I have no explanation for the slowdown with the newer version of "big-LSTM".

Should you get an RTX 2080Ti (or two, or more) for machine learning work?

I've said it before ... I think that is an obvious yes! For ML/AI work using fp32 or fp16 (tensor-cores) precision the new NVIDIA RTX 2080 Ti looks really good. The RTX 2080 Ti may seem expensive but I believe you are getting what you pay for. Two RTX 2080 Ti's with the NVLINK bridge will cost less than a single Titan V and can give double (or more) of the performance in some cases. The Titan V is still the best value when you need fp64 (double precision). I would not hesitate to recommend the 2080 Ti for machine learning work.

I did get my first testing with the RTX 2070 in this post but I'm not sure if it is a good value or not for ML/AI work. However, from the limited testing here it looks like it would be a better value than the RTX 2080 if you have a tight budget.

I'm sure I will do 4 GPU testing before too long and that should be very interesting.

Happy computing! --dbk

Tags: NVLINK, RTX 2080 Ti, TensorFlow, CUDA, NVIDIA, ML/AI, Machine Learning

Thank you , your write ups are about the only place I've been able to get any useful information on ML using linux, NVLINK and the RTX 2080 cards.

Posted on 2018-10-30 23:39:09

Thank you for doing this research. This is absolutely fantastic and fits the exact situation I'm in right now. I wanted to ask which of the Nvidia bridges you had used for this setup. On the Nvidia nvlink product pages they list bridges for the gp100 and gv100. The gv100 is listed at 200 GB/s (100 GB/s for a single bridge) and gp100 at 160 GB/s (80 GB/s for a single bridge). Should I be our purchasing the gv100 bridge? Thanks in advance

Posted on 2018-10-31 14:14:29

Chiming in for Don, you need to use the GeForce RTX NVLink bridges that you can get from https://www.nvidia.com/en-u... . It is buried in that page, but search for "3 SLOT" and it will take right right to the purchase button.

You cannot use GV100 bridges - those are for the first generation NVLink and do not work on the RTX cards.

Posted on 2018-10-31 16:15:22

GP100 bridges won't work, but GV100 bridges actually might be okay on the GeForce RTX cards. We don't have any to test, but I've seen others online show them working. However, they are VERY expensive. Get the bridge Matt suggested if you need a 3- or 4-slot size, or wait for the upcoming Quadro RTX bridge if you need a 2-slot solution.

Posted on 2018-10-31 16:18:27

Oh interesting. The reason I had been looking into the gv was because I was looking for a 2-slot solution. I had been looking around and found no reference to Nvidia making a 2-slot version. Was this upcoming Quadro RTX bridge listed somewhere?

Posted on 2018-10-31 17:06:21

They aren't listed for sale yet, but there's a "notify me" button on this page if you scroll down a bit: https://www.nvidia.com/en-u...

Posted on 2018-10-31 17:13:33
Maxi H

Hi, so I see the Quadro RTX bridges on online stores now, have you had a chance to test it (compute and gaming)? I am trying to find a 2-slot solution.

Posted on 2018-11-26 02:02:05

We have not gotten any of the new Quadro RTX bridges yet, and they still show a "notify me" button on NVIDIA's own product page. What online stores have you seen selling them already?

Posted on 2018-11-26 19:50:15
Maxi H

Yeah, they are all Japanese sites. https://nttxstore.jp/_II_EA...
Multiple shops are listing it, and they all claim the manufacturer is ELSA with the following product number
2 slot: P3396
3 slot: P3397
2 slot HB (for RTX 6000/8000): P3394
3 slot HB: P3395
I do not know the exact ETAs for these but I can ask.

Posted on 2018-11-29 02:47:25
Maxi H

Just got answers from one of the retailers and confirmed that its not coming before the end of this year. :(
Why on Earth were these listed is a complete mystery.

Posted on 2018-11-29 08:25:57

We got the Quadro RTX NVLink bridges in last week, and they work just fine on the GeForce RTX cards. We still need to get a pair of Quadro RTX cards (we have one, but need a second) in order to test whether the GeForce branded bridges will work on them or not. Once we have the full set of data, we will likely publish a brief overview article charting the compatibility.

Posted on 2019-01-09 17:18:06
Maxi H

Great news!
I just got word from our distributor in Japan also. :)

Posted on 2019-01-16 06:44:17

Hi Donald,

This article is exactly what I have been looking for. Thanks a lot.

Posted on 2018-11-02 07:45:46
Mr Ocr

Donald Kinghorn, can you test Mixed-Precision training.

Posted on 2018-11-05 21:26:57
Donald Kinghorn

using fp16 (tensor-cores) is basically using mixed precision. It's 16 bit multiplies and adds with 32 bit accumulators. This is from a README for the CNN testing code.

With the --fp16 flag the model is trained using 16-bit floating-point operations. This provides optimized performance on Volta's TensorCores. For more information on training with FP16 arithmetic see Training with Mixed Precision.
The link is to here https://docs.nvidia.com/dee...

Posted on 2018-11-05 23:23:46

Don, could you also add test result for Titan Xp??

I have been trying to find out why my Titan Xp seems to be about 20% faster than my 1080ti when running the same CUDA code.
Did not find a good explanation but I installed MSI Afterburner and i see that the Titan Xp is running the boost clock to 1835mhz when the temp is under 75c!!!
Since I have a dual card setup, the 1080ti CUDA code runs in parallel with the Titan Xp code and it only boosts the clock to 1480mhz with the same temps.
Both of these cards are blower versions and I am running them nearly 8 hours a day. I will be testing them today to run nearly 24 hours and I will report back the results by the end of the week.

If this is the explanation then I will definitely be waiting for a Titan RTX or Titan V 32GB version as I want to run 4 cards in a well ventilated server case.
I will be eventually experimenting with tensorflow but I just got sidetracked with Infer.Net which allows you to do some probabilistic programming...you may want to look into this as it is a open source tool from Microsoft

Posted on 2018-11-05 23:49:26
Donald Kinghorn

I wont be testing GPU's again for a bit but you did give me a good idea for a post. If I spend a day at the office I could probably setup a machine and run the CNN and LSTM jobs I've been using on a big collection of cards.

I will probably be testing CPU's for for a few weeks and I'm working on a new ML/AI project :-)

I do have an older post that has an interesting comparison https://www.pugetsystems.co...
That is using different testing code but still might be interesting.

You might like this too (I do!) ... we have a search bar at the top of the blog and article pages now. I'm using it quite a bit because I have so many post I can't find things. It's at the top of the main HPC blog page https://www.pugetsystems.co... I searched there for titan Xp to find that older post!

Posted on 2018-11-06 00:32:14

Ok...I will start using that search bar...but thanks for the reply.

Titan Xp is definitely faster than the 1080ti and all the benchmarks for the RTX 2080ti do not include the Titan Xp and I think this card from a price/performance point of view when running CUDA is probably a better deal to use than Titan V or 1080ti and maybe the 2080ti. The older post confirms my thinking that staying with a Titan Xp until I can get a another Titan card that is 50% faster seems to be the way to go for me especially running a quad server under Windows Server 16 which would also allow me to run the gpu drivers in TCC mode.

Keep up the great articles and I hope you continue running examples with source code too...

Posted on 2018-11-06 04:06:10
Donald Kinghorn

... I love the search bar, I could never find stuff in my old posts :-)
You are right about the Titan Xp that is a really good card and it does definitely outperform the 1080Ti, it always seemed to get a little better memory utilization too ... I just happen to have a 1080Ti in my personal system so results from that end up in a lot of my posts :-) We have one Titan Xp in Puget-Labs so it is often in use for various testing.

I'm wondering when we will see the next Titan. A possibility for an announcement is at NIPS in December. I hope it's not pushed out until GTC in March. The Quadro's are out and shipping now so I'm hopeful for December

Posted on 2018-11-06 23:37:26

Thanks Don

I will continue to overclock the Titan Xp to 75c and see how it performs.
I have really optimized my monte carlo CUDA app and it is running near real time now. It takes about 4 seconds to run 4 batches with each batch containing 2048 simulation runs. The real fun part is that EACH simulation run does 1 million monte carlo simulations...yep 16 Billion simulations within 4 seconds. I don't exactly know how many floating point instructions I am executing per simulation but I must be getting close to a Teraflop of instructions within 4 seconds and this is running in multi gpu mode with the Titan Xp and the 1080ti!!

When I first ran these monte carlo simulations on my 16 core threadripper, it used to take nearly an hour, which I though was crazy slow!!
I don't know how you can run these AI training jobs in days!! Lol

Anyway, yes I really hope Nvidia soon releases a Titan RTX and/or an updated Titan V with more than 12GB memory running at full speed.
Amd has their new Radeon Instrinct AI gpu which looks to be similar speeds as the V100 but I really hope they undercut the price and Nvidia will starts to drop their prices...one can only hope!!!

Posted on 2018-11-08 00:08:59
Lawrence Barras

Excellent writeup! I finally got around to installing a pair of 2080TI FE cards. They appear to run at full GPU boost with the latest nvidia Linux drivers. My results mirror yours, but I've been testing on a bit older PC and I think it is PCIe starved a bit. (2 cards run in x8 each on this one, not enough lanes on i7-6700k). The RTX2080ti-FE are honestly outrunning my TitanV on many tests with the same NGC tensorflow image. (18.10). For whatever reason, the Titan-V is still limited to 1335mhz gpu clock when running CUDA apps. The RTX 2080ti seems in all ways happy to run up to full boost until it thermal limits. I think your Xeon-W is able to do a better job keeping the cards fed with data.

If anyone is curious, the RTX NVlink is mechanically reversed from the GV-NVink on the Volta-series cards (Tesla V and Quadro GV). They will not interchange. The card edges appear the same, but the big and small tabs are reversed. I opened the RTX Link, no active electronics other than the LED circuits. The HB SLI bridge from the 1080ti is also completely different.

Posted on 2018-11-07 07:41:35

They seem to have just put the connector edge on "backwards" on the GeForce RTX cards, compared to the older Quadro GP100 and GV100. You can flip the bridges around and it will physically plug in just fine. In terms of functionality, we tested Quadro GP100 era bridges and they do *not* work on the GeForce RTX cards. We don't have any GV100 bridges here, but someone online showed them being used on GeForce RTX 2080 Ti cards and appearing to function. We also did test that the newer GeForce RTX bridge works on a pair of Quadro GP100 cards, though because of the larger physical size of the plastic around the bridge we could only fit one (instead of the pair that card is designed to have).

You can see more info, along with pictures, on another of our articles: https://www.pugetsystems.co...

Posted on 2018-11-07 19:33:54

'...For whatever reason, the Titan-V is still limited to 1335mhz gpu clock when running CUDA apps'

Curious...are you running custom CUDA apps, c/cc++ or third party CUDA apps ??
Did you try something like MSI Afterburner or something similar if you are running under Linux??

I have been running my custom monte carlo simulation CUDA code on a Titan Xp and 1080ti using Windows 10 and the Titan Xp is way better for serious application deployment. Even though the 2080ti is less than half the price of the Titan V, I need to make sure my app runs rock solid during trading hours which for options are 23 hours a day and 6 days a week. I have been testing the 1080ti and it just does not seem to be as reliable as the Titan Xp. The Titan Xp on the other hand has been class A excellent and have not run into any problems running my code for extended periods of time.

The Titan V looks like a beast but I really hope that Nvidia comes out with a Titan RTX that will replace the Titan Xp with the same reliability as well.

Posted on 2018-11-10 02:24:22
Hery Ratsimihah

Please let us know when they break!

Posted on 2018-11-19 11:04:03

This is a great write-up and probably the first one to cover the current options (1080Ti vs 2070/2080/2080Ti) for single GPU users.

I find the 2070 to be a very interesting offer vs both the 1080Ti and the 2080 non-TI, for CNN with mixed-precision.

I ran a quick test on Cifar10 with Fastai V1 for the 2070 vs 1080Ti: if you manage to get Fp16 running, the 2070 will deliver slightly faster results indeed.
Then the 2080 non-Ti is rather "meh..." compared to the cheaper 2070.

Posted on 2018-11-16 20:06:58
Donald Kinghorn

I agree on that ... I think the sweet-spot is the 2070 and then 2080TI. If the 2080 had 12GB mem it would be a different story. Having said that, I would be pretty happy if I had a 2080 Too :-) I do still have my borrowed (and beloved) Titan V that I'll be getting under load with a project again soon. Even when the new Titan comes out I think the 2080Ti will be the staple for ML/AI stuff. There are just times when I have to have 64-bit, which I hope NVIDIA keeps strong in the next Titan.

Posted on 2018-11-19 17:26:18

I already have a 1070 and was thinking about adding a 2070. Is mixing the two a good idea? That is, will the 1070 play well with 2070, slow it up, or ???

Posted on 2018-11-20 00:54:28
Donald Kinghorn

You should be fine with doing that. The newer drivers for the 2070 work well with the 10xx cards and so does the included CUDA runtime lib. I like both of those cards. The 1070 was a lot of performance for the cost and it looks like the 2070 is continuing that. You might have some trouble if you try to run multi-GPU jobs on both cards together ... not sure. They both support a wide variety of cuda compute levels but of course the 2070 supports some newer ops in level 7.x including support for fp16 tensor cores

Posted on 2018-11-20 02:53:07

Can't tell you how useful your posts have been.

Posted on 2018-11-24 23:15:25
Frédéric Precioso

Hello Donald,
I can only join to thank you for the very useful tests you have run and shared here.
I have one tricky question on your test with NVLINK: does your processus see the double RAM? Like 24 Gb of GPU RAM?
Because one of my PhD student is working on video analysis and she has to deal with batches so large that we run tests with batch size = 2, which is definitely not big enough for a good statistical stability of the training.
I just got on research projects 2 Tesla V100 32 Gb which are actually much cheaper than the Quadro GV100 32 Gb (which is just out).
If with the NVLINK there's a way to make the process see the double of RAM as one, it would just increase amazingly our potential.

I went through the post here: https://www.pugetsystems.co...
but I could not find a clear answer in it.

If anyone here has an idea if it is possible and if yes how difficult it is, we'll be grateful.

Posted on 2018-11-28 00:19:45
Donald Kinghorn

There are some misconceptions about what is often referred to as "memory pooling". I don't fully understand it myself, but, for compute there is no magic but, you do have good communication latency and bandwidth with NVLINK ... When you have multiple GPU's most code that does any kind of batching can utilize all of the GPU's AND all of their memory. I have run some jobs that worked by splitting up a given batch size across the GPU's and some that run multiples of a given batch size on each GPU ... hope that makes sense!

For example: I've had a problem that I could run with a max batch size of 128 on 1 GPU but then be able to increase that parameter to 256 with 2 GPU's. I've also done job runs that would still require a batch size of 128 but then run 2 batches simultaneously on 2 GPU's

Bottom line is that you will probably be able to get good utilization of your hardware. It's not necessarily bad to only have a batch size of 2 other than long run times. The bad thing is when you can't even load 1! I've been pretty impressed with how well batched stochastic optimization works. There are lots of variations on the methods and it is worth trying different ones. I had surprising luck with a not-often-used method called Rprop ...

Posted on 2018-11-28 01:34:19
Donald Kinghorn

Just saw NVIDIA's announcement of Titan RTX !! I'll be testing as soon as we can get one. Will probably have to borrow from YouTube reviewer :-)

Posted on 2018-12-03 17:09:51
Lawrence Barras

Yes indeed these look exactly what I need, signed up on the notify list ASAP. I only wish the cards were blower-style instead of side fan. There's a new NVLink for the Titan RTX, but I think it is just a color-matching plastics. I can only wish for 2-slot bridge.

I have a workstation with an X99-E motherboard in it, which would love to have 4x of these in it. Provided they can stay cool. I put a pair of the RTX2080ti -FE cards in it with the 4-space NVLink. It is a monster and at least for the work I'm doing, is doing better than the Titan V's.

Posted on 2018-12-14 21:11:29
Donald Kinghorn

Hi Lawrence, I'm still waiting for a card here ... We do have the 2080Ti and 2070's with blower fans now. The last post I put up had some testing with 4 2080Ti's (that was NAMD testing on a Threadripper system). 4 of them together do still get hot but, as far as I could tell there was no throttling. The larger memory on the RTX Titan will be nice! I only wish it had the good fp64 performance of Volta since I often need that.

Posted on 2018-12-17 17:08:57
Lawrence Barras

Two Titan RTX cards ordered this morning. I looked at the Titan RTX NVLinks, but near as I can tell, they're the same 3 or 4-slot NVLinks as the RTX product, just different color plastics. I sure wish they'd offer a two-slot bridge. Could go 4x cards and 2x linked in the X99 workstation. In any case, the extra memory will be good for me.

Agree on the Volta - king of fp64. But had to make a decision to let the Titan-V's go in favor of Turing. Hope the Titan RTX isn't disappointing.

Posted on 2018-12-19 01:10:28
Donald Kinghorn

Woo hoo! congrats, I'll be curious to see what you think of them ... and yes I think you are right about the difference being the color on the bridge

Posted on 2018-12-20 01:55:39
Lawrence Barras

Well, I got them! So, they're gold and the "TITAN" lights up in white. I set up some tests in a Xeon E5-2697 V3 (older 14 core) with 128gb RDIMM ECC, Asus X99-E USB3.1 WS board, ubuntu 18.04, driver 415.25, tensorflow:18.12-py3. Big LSTM with NVLink on the 2080ti FE cards would net me just over 15,000 WPS, and just under as the cards heated up and began to throttle. GTX1080ti pair turns in just over 10,000 wps. This workstation always lags behind your reported numbers, likely due to the older E5-V3 Xeon and registered memory.

The Titan RTX pair with NVLink report just over 16,000 wps and throttle slightly down as they warm up to just under 16,000.

The big plus is with the extra VRAM (24gb each), I can up the command to batch size of 512 and see 18,000 wps, throttling slightly back to just under 17,000 after several minutes.

Also, they don't heat as quickly, or to as high of a temperature, despite the factory power limit being set to 280w vs. the FE cards at 260 watts. I'd suspect these GPU chips are top of the binning process, in addition to having all the cores turned on.

So basically, you get 24gb VRAM (big plus) and 7% better speed and cooler running for about twice as much money. ($1199 vs $2499). So you could have 2x RTX-Titan or 4x RTX2080TI cards. Tough call on which option is better.

We sold all of our Titan-V, partly because of the clock-throttling imposed on them. But it looks like 415.25 and later drivers allow the Titan-V to overclock if you want them to. I'd be curious to see the impact of that. The T-Rex isn't clock-throttled.

Docker image tensorflow-18.03-py2 gives an error stating the Titan RTX isn't supported in that version of the container.

Posted on 2018-12-28 03:50:14
Donald Kinghorn

Happy New Year Lawrence! Thanks for posting your testing! I think your suspicion about the binning is probably correct, these should be the best chips. I'm looking forward to getting my hands on them. I think we have gotten some in now but we have had a lot of customers waiting on them so I might have a little trouble getting some under test until supply is better.

Posted on 2019-01-02 20:47:14

Excellent write up!!!
These Titan RTX's would only be a better deal if you use up to 24GB it seems??
I am looking into water cooling the Titan RTX as quad non-blower cards are a no go in my setup.
I am hoping the Titan RTX will over clock and still allow me to run 24/7 if i go with a quad water cooled card setup.
I do options trading and time is money!!!
I am currently running Titan xp and the price/performance ratio is amazing compared to these new cards.
Hoping Nvidia gets smart and release a 32GB version of the Titan V sometime this year...

Posted on 2019-01-02 22:38:11
Youngsuk Jung

I heard there is peer to peer speed issue between GPUs on TITAN V in multi-GPUs system. (1/3 speed of previous GPU before TITAN V such TITAN xp).
Did you hear about this issue? And it will affect the performance of machine learning?

Posted on 2018-12-04 09:33:51
Donald Kinghorn

That's not what I saw in this post https://www.pugetsystems.co...
Scaling was good and the p2p testing looked great with 4 Titan V's

I've also looked at the effect of PCIe X16 vs X8 in general a couple of times it really doesn't make much difference for ML workloads
https://www.pugetsystems.co... This post has p2p testing with Titan V's at X8 an X16

Have a look ... enjoy!

Posted on 2018-12-04 20:00:06

This is an excellent comparison, thank you Donald! However, this left me wondering, would you say 2 RTX 2070s could outperform a RTX 2080 TI (Since getting 2070 x 2 approximately equates in price as getting a 2080 TI)?

Posted on 2018-12-10 22:47:41
Donald Kinghorn

Good point! I just did a little testing with 2 x 2070's for molecular dynamics, NAMD and did a quick ResNet-50 run too. 2 x 2070's did a little better than 1 2080Ti for NAMD.

For ResNet-50 with 2 x 2070 I got 349 images/sec at fp32 and 506 images/sec at fp16 so it was indeed better performance than a single 2080Ti for that job too.

With 4 x 2080Ti I got 941 images/sec at fp32 and 1310 images/sec at fp16

This was from some Threadripper 2990WX testing that I will be writing up this week. I'm getting a bunch of data points so I'll probably do a full GPU ML/AI performance post after the new year when the blower style 2070 cards are available so I can use 4 of them :-) ... and I'll be back on an Intel platform too ...

Posted on 2018-12-11 19:45:13

Thanks Donald for the quick reply! I await your new piece of the full GPU performances. It seems to me that if I could scale the 2070s up (since for every 2080 TI I could get 2 2070), it would be better value that the 2080 TIs, I will get more VRAM and performance for the same price that is if my setup can hold enough video cards.

Posted on 2018-12-11 21:55:20
Donald Kinghorn

it is a trade off ... performance usually drops off with more cards and it is extra load on the system etc. And, not everything will scale to multiple cards. I usually favor "few over many" ... but, everything you just said is right on! I especially like having more total mem ... and you could run multiple jobs rather than multi-GPU for a single job ... :-)

Posted on 2018-12-12 23:21:41
Arline Perkins

Donald, will you be releasing any results for the Titan RTX? I just learned today that FP32 accumulate in mixed performance training on the 20xx series is "crippled" at half speed, but not so for the Titan RTX. So I'm curious if there would be a much bigger speed gain in MPT performance, on that particular card.

Posted on 2018-12-20 07:11:05
Donald Kinghorn

I will be doing testing as soon as I can get my hands on a couple of cards. But I might have to wait because we've had a lot of customers waiting on them. I'm going to check Monday to see if I can get a test rig setup going. You have give me something interesting to check out. If you have a link with some info about the accumulate issue I'd like to check it out. Thanks!

Posted on 2019-01-02 20:52:13
Arline Perkins

Gotcha. Customers first, we want you to stay in business! The first "skeleton in the closet" I saw was post #12 on this Nvidia discussion board. It doesn't mention the Titan, which hadn't been released yet.

This article confirms the half-speed cripple on the RTX 20's, but adds that the Titan RTX does not have its tensor FP32 accumulate crippled. Quote: "NVIDIA's GeForce Turing cards are limited to half-speed throughput in FP16 mode, even though Turing can accumulate at FP32 for still greater precision. This limitation doesn't exist on the Titan RTX, so it can handle full-speed FP32 accumulation throughput on its tensor cores."

I did some back of the envelope calculations from one of your earlier RTX articles (which had a nice Titan V in there for comparison) and my guess is we'd see about 15-20% lowered performance due to the half-speed thing. But I'd be curious to see what the real results are. Thanks!

Posted on 2019-01-04 16:09:37
Donald Kinghorn

Thanks! Expect to see a couple of new posts soon.

Posted on 2019-01-07 22:06:52
Armillariella Mellea

What would you recommend for a budget in the 300-600 USD price range for ML and for fp64 workloads? I'm building a machine just for my humble beginnings and have noticed that second hand Titan Z or Tesla K80 are also great value considering their fp64 performance (also the K80 has ECC).

Posted on 2018-12-27 21:42:13
Donald Kinghorn

If you can grab a Titan Z for a good price then that is a reasonable thing to do if you need fp64. An active cooled K80 would be nice too. Those cards are still very useful. On the other hand you might be able to get a 1080Ti for a reasonable price. Those are great cards especially for ML work with fp32. The fp64 is crippled but it still works and is OK for code development and testing work. You can get a V100 instance on AWS for around $0.92/hr with spot pricing if you have a job that you need to run with high performance using fp64. It's a hassle to set all of that up, and to use it, but it is an option as long as you don't need too much compute time (it can add up quickly).

Unless you really really need the fp64 performance I would go with the 1080Ti (that's the main card I use personally ... but I do have access to a Titan V for fp64 stuff :-) ... if you do want the fp64, and I can understand that! ... then you are on the right track with what you are considering. Best of luck with your efforts, GPU computing is wonderful!

Posted on 2019-01-02 21:29:23

Great article ! lots of useful info
Did you test if peer 2 peer was available without the nvlink bridge ? If not, can you try ?
I can't get it to work on my system with 2080tis, no problem with older cards ...


Posted on 2018-12-30 07:36:04
Donald Kinghorn

wow, you are the second person to say that ... I need to check this out. In another post a fellow asked about this on his new 8 x 2080Ti system. My suggestion was to try running the p2pBandwidthLatencyTest from the CUDA samples. I think I need to see what is happening. P2P should just go direct card to card over the PCIe bus if the NVLINK bridge is not there. NVLINK should just be a faster alternative to that as far as functionality goes. I'll try to get a test systems going as soon as I can arrange it. Thanks!

Posted on 2019-01-02 21:02:31

im using 2 rtx 2080's on a asrock X399 , p2p2 does not work without NVLINK it does with NVLINK , here is what I just got

[simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080" IS capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 (GPU0) -> GeForce RTX 2080 (GPU1) : No
> Peer access from GeForce RTX 2080 (GPU1) -> GeForce RTX 2080 (GPU0) : No
Two or more GPUs with SM 2.0 or higher capability are required for simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

[simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080" IS capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 (GPU0) -> GeForce RTX 2080 (GPU1) : Yes
> Peer access from GeForce RTX 2080 (GPU1) -> GeForce RTX 2080 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> GeForce RTX 2080 (GPU0) supports UVA: Yes
> GeForce RTX 2080 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.52GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

Posted on 2019-01-03 00:09:32

I feel like I ran into this with these RTX cards under Windows as well, or something similar. The SimpleP2P test was not giving the full picture, though. If you have a chance, try the P2PBandwidthLatencyTest instead. I know I got better results with that under Windows.

Posted on 2019-01-03 00:14:37

here is the P2PBandwidthLatencyTest,
Im using driver NVIDIA-Linux-x86_64-415.25.run
and cuda_10.0.130_410.48_linux.run.
the 2080Ti's were out of my budget range
the 2080's work fine for my needs

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce RTX 2080, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce RTX 2080, pciBusID: 43, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 388.20 3.31
1 2.68 400.00
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 389.82 24.23
1 24.22 396.44
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 387.78 3.55
1 3.50 392.10
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 384.42 48.00
1 48.37 389.78
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.25 12.94
1 12.37 1.29

CPU 0 1
0 3.96 9.00
1 9.28 3.70
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.24 0.94
1 1.05 1.29

CPU 0 1
0 3.91 2.95
1 2.79 3.71

Posted on 2019-01-03 01:57:44

What this test with or without NVLINK? It looks like maybe with that, because of the ~24GB/s bandwidth listed, which is close to what you got with NVLINK enabled on the SimpleP2P test.

Posted on 2019-01-03 16:34:53


Did not want to remove the NVLINK bridge again, they are rather a pain taking them on/off. afraid of damaging something.
after putting the NVLINK back on I was getting this microcode error " w1p5s0 failed to remove Key from hardware"
and the system would not boot. When you install the NVLINK bridge you need to push down hard and hear it click into place , seems I did not have it seated all the way . One final heavy push, and it clicked in, then everything worked.

Posted on 2019-01-03 22:05:26

Gotcha. Yeah, it always feels like something is going to bend / break when I take them off - particularly the connector coming on the top edge of the GPU's PCB. But so far, I've done it a couple dozen times with no harmful effect on any of the cards :)

Posted on 2019-01-03 22:10:26
Donald Kinghorn

Hey Jerry, thanks for running that and posting back. I see I have some serious testing to do! Especially after peeking below at erdoom's comment!

Posted on 2019-01-03 22:11:35

Yap, got a replay from some guys at NVIDIA, they disabled P2P on the GeForce line for the 20 series, when without NVlink, They are supposed to be supported on the Quadro RTX line, but I don't have access to these cards so I can't verify. Anyways, this is a big blow to all the guys buying 8 gpu rigs for deep learning. I am getting worse results with 8 2080ti then with 8 1080ti, nasty :/

Posted on 2019-01-03 14:52:53
Donald Kinghorn

WOW! That is dreadful! I'll get on this as soon as I can to see if I can sort anything out.

Posted on 2019-01-03 22:29:04

So what happens when you use 8 2080ti gpu's with each paired by nvlink?? Does that pass the p2p bandwidth test ??
Sounds like this should still work...except you have to buy an Nvlink for each pair of 2080ti gpu cards.
Am i missing any info??

Posted on 2019-01-05 21:30:53

The question is, will there be peer 2 peer between cards that are not directly linked by a nvlink bridge, i.e.will the nvilink bridges "magically" open peer to peer between the rest of the cards via pcie. I am supposed to get some bridges next week to test with, I will update when i do. Just for reference this is what my system is spitting out at the moment:

nvidia-smi topo -p2p r



X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown

Posted on 2019-01-05 21:50:29

Hmmm...do not like where Nvidia is going with this but it definitely looks like if you are serious about multi-gpu setups for deep learning, Nvidia wants you to use Titan cards like the Titan V even though is is 3 time more expensive. I think Nvidia's reason is the Titan V is a 3rd of the price of a Tesla V100!!! I too am looking into multi gpu setups and I bascially have drawn the conclusion that the Titan will be better supported in the long run.
The big drawback so far with the Titan RTX is that it is NOT a blower type card and you would probably have to water cool it to have a minimum quad gpu setup.

Definitely get back some test results one you have 4 Nvlink bridges(ypou probably hate the idea of spending money on 4 bridges...I would!!! LOL)
My gut feeling as that unless you deep learning knows about different p2p topologies, performance will probably suck!!!
I write CUDA code and I defintely get great perforance from every byte because I know exactly what my multi-gpu topology is at all times, which is something that I would not assume from a lot of these tensorflow type libraries know about...

Posted on 2019-01-05 23:57:03

Actually it might be worse, from what i heard currently from the RTX line only the Quadro cards support it. There are blower versions of the 2080 and 2080 ti, that is what we have in our rig. https://www.pny.com/GeForce...

Posted on 2019-01-06 05:28:47

I am currently running a Titan xp and 1080ti in my workstation as a prototype system for options trading. I have estimated that I need the fastest quad gpu setup to be able to do real time options pricing in under a second. I have written the CUDA code in bare metal c/c++ and it runs just under 3 seconds which is quite good considering i started with 2 minute runs!!! My problem now is to spend the big bucks for a production system with the lowest price/ highest performance. My gut tells me to stay away from the aftermarket cards as I need to run the system 24/5 hours a week and in my mind the Titan xp blows away the 1080ti for performance as it is a least 20-30% faster running my Monte Carlo calculations. So my options are: Titan xp, 2080ti, Titan V and Titan RTX. If I was to buy this today, i would lean towards a quad Titan xp setup based on price performance but I wont be buying until the second half of the year so the Titan RTX looks to be it but for a quad setup, I cant see using them unless they are water cooled!!! So the cost difference between a water cooled Titan RTX and the blower Titan V may be negligible but the estimated performance of the Titan V over the Titan xp is approx 20-25% per card but double the price!!! So yes, I have NOT totally ruled out a 2080ti blower system yet but I am VERY impressed with the Titan xp so far...decisions!!! Please comment some more regarding performance testing on your 8 gpu rig system as once I build my production system, i will turn to a nice deep learning systems as well like yours...eventually

Posted on 2019-01-06 17:28:26

The Titan V is great as a single card, and if you don't need NVLink it works well in multi-GPU as well (rendering, for example). However, it does not support NVLink at all! Technically it has a connector on the PCB, but the heat sink and backplate block its use. It is also very costly, even considering the performance it offers. Where it most shines is FP64, for which it is a great value even considering the price.

Posted on 2019-01-06 05:37:27

Very true!! Hoping Nvidia will bring out a 32GB Titan V version for $3000 and drop the current 12GB Titan V price to $1200 as a replacement for the Titan xp!!! I know wishful thinking but hope springs eternal....

Posted on 2019-01-06 18:25:25
Donald Kinghorn

Hey All, I got some testing in over the weekend. I did confirm the P2P issue. However, I did not see much of a performance hit. I used the older cnn code that I've used in other posts. I use that because it is linked with NCCL so it will do multi-GPU with P2P. There is < 10% improvment of performance with an NVLINK bridge even thought the reported bandwidth is almost 10 time more than the fall back mem-copy on the 2080Ti. However, that was with just 2 GPU's! I would expect it to get worse with more cards ... I had 1 RTX Titan to test ... expect a couple of posts soon! Still planning on doing a 1-4 GPU test with a bunch of cards ...

Posted on 2019-01-07 22:40:06

Excellent as always!!!

Definitely...Please QUAD of every card you can get your hands on for testing.
Already know Nvidia will be fast for one card...its when you goto 4+ cards that price/performance really starts to dominate and
thank god we have you to be the guinea pig!!! LOL

Posted on 2019-01-10 00:41:21
Donald Kinghorn

I hear ya ... I am planning a BIG test but I'll have to invade Puget Labs for for several days to do that :-) I'm really only 10min away, but pretty comfy working from home. It can be hard to coordinate cards because we are all testing and only have a limited amount of dedicated test units. We have to pull from inventory and that is hit or miss depending on supply and demands for builds ... and the testing can take time ... and ... OK OK I'll just do it :-)

... yes I'm really curious about 4-card testing! 8 is tough but I can occasionally swing that too

I was able to take home 2 RTX Titans yesterday and will include the testing from last night in a simple post ( they were overloading my 1KWatt powersupply on longer running heavy loads but I got some results. Also William will have a couple of Quadros today and will confirm P2P behavior

We've hit the ground running for the New Year!

Posted on 2019-01-10 16:51:33

Great research! Thank you. Did you think about adding 2 x GTX 1080 Ti into comparison? GTX 1080 Ti are dirt cheap now. You can buy two GTX 1080 Ti cards for the price of one RTX 2080 Ti. I think this setup would be still better than one RTX 2080 Ti.

Posted on 2019-01-24 14:54:48
Donald Kinghorn

Thanks! I am still planning a larger GPU test with 1-4 cards of most everything I can get my hands on. The 1080Ti has been a great card. Probably the best compute performance per dollar, ever. I may try to get another one personally too while they are still available. My personal favorite card is the Titan V because I like to have the (great!) fp64 performance, expensive though... But, yes, a lot of really great work has been accomplished running on 1080Ti's ... probably more than NVIDIA realizes!

Posted on 2019-01-25 18:15:17
Jasper Vicenti

Thanks for the detailed reports! The info about NVLink compatibility has been especially helpful. Thanks to your NVLink compatibility report, I'm looking to build up a 4x 2080Ti box with dual 2-slot Quadro RTX NVLinks.

Regarding your note about Titan V performance on Ubuntu, you may want to look into updating your driver to 415.25 or greater and re-test. See here for more details: TITAN V max clock speed locked to 1,335 Mhz on Ubuntu. You may want to keep an eye on cooling performance to limit thermal throttling. I haven't had a chance to test this theory yet but wanted to make you aware of it.

Posted on 2019-02-05 17:38:50
Donald Kinghorn

... yes! I have the driver from the cuda install. I'll have to mess with the sys a bit to upgrade from that. I do get amazing performance with the Titan V just love that card for fp64

Those Quadros are nice! I did test with 2 of them. They have P2P over PCIe enabled so you may not need the NVLINK .. just be sure you are on a good board like our Peak Single workstation ASUS board ... you want 4 x X16 which means having a PLX chip on-board ...

Posted on 2019-02-06 23:37:56

I have 4x 2080 Ti arriving for a new ML build. I'm coming from a single Titan Xp. This is my first time running multi-GPU configuration for ML (I had 2x 980 in the past, but for gaming). I am considering setting each *pair* up in NVLink. Do you know if this is possible and would it make sense?

Posted on 2019-01-30 15:12:47
Donald Kinghorn

see my comment above about 4 x X16. I will be doing testing with 4 2080Ti's around the end of the month (I'll use as many different cards as I can get my hands on). I'll test with paired NVLINK too. This stuff usually doesn't make that much difference from what I've seen in the past but you never know until you try it. It can really depend on the code being run and the job characteristics

Posted on 2019-02-06 23:43:45

I am trying to configure my system but I not seeing all of my GPUs in the Bios. I have a 1600w power supply and the Asus Sage X299. Are there any specific bios settings you can recommend to get all 4 GPUs to post?

Posted on 2019-02-12 08:19:18

Would like to see testing of 2 RTX 2080Ti 11GB +/- NVlink with molecular dynamics (MD) simulations using NAMD (CUDA acceleration) and YASARA (OpenCL acceleration). Thanks.

Posted on 2019-04-14 20:24:44
Donald Kinghorn

I do test with NAMD quite a bit ... and I see from one of your comments on another post that you found some of that :-) The biggest thing with NAMD is that it's most just the no-bonded forces that get run on the GPU i.e. there is still a large CPU component in the calculations (this is not just NAMD, there are lots of programs that are not "completely" GPU accelerated) What I have seen for a few years is that the GPU's have gotten so fast that the CPU cores cannot keep up. You need a pretty good base CPU workstation. That's why that Threadripper testing I did was impressive with NAMD.

I have not tested with YASARA but that looks like it could be good ...

Posted on 2019-04-15 22:21:55

@Dr. Kinghorn:

What are your current hardware recommendations for MD simulations using either YASARA and OpenCL or NAMD and CUDA? In particular, are there problems with using "consumer" components such as i9 X CPUs and GeForce RTX GPUs as opposed to Xeon CPUs and Quadro GPUs?

Some people argue that consumer components are prone to errors and can only do single-precision computations, whereas workstation-class components have error correction and can do double-precision computations. However, the workstation-class components are much more expensive.

In addition, what are your current recommendations for single vs. dual CPUs and single vs dual GPUs (especially when the GPUs are coupled using NVlink)?

On the Puget Systems site, I have configured an X299 system with an i9-9920X CPU, 64 GB RAM and two RTX 2080Ti 11GB GPUs linked with NVlink. Is this a good system for MD simulation or would I need a dual CPU to take advantage of the dual GPUs even if they are linked with NVlink?

Posted on 2019-04-15 22:26:06
Donald Kinghorn

First, modern consumer components are wonderfully good and not really any more prone to errors than data center components ( maybe somewhat more but not much). It is a very rare event for us to see a failed Intel CPU, it does happen, but it is rare and that includes "Core" and Xeon parts. The Core-X and Xeon-W and -SP base cores are the nearly the same. Including AVX512 with dual FMA and great reliability. The Xeon can be a better "platform" because you have server grade motherboards and somewhat better memory performance. But for a single socket Workstation Core-X is really good (better), and an incredible value in my opinion.

On the GPU side it's the same story. In the early days of GPU computing it was pretty common to burn up "gamer" cards. That's not really the case any more. This has been the true for NVIDIA GeForce 9xx 10xx and so far 20xx cards. They have been really impressive for compute with very low failure rates even under heavy compute load. Performance with FP32 on the 2080TI is fantastic.

For double precision on GPU there is "Votla". I have been using a couple of Titan V card with double on some QM code (my own). (sadly NVIDIA stopped production of the TitanV ) The Titan V was the only "consumer" card with good full dbl precision (same basic chip as the V100 Tesla) The only thing other than Tesla V100 now is the Quadro GV100 which is basically a V100 with video out hardware on it. (but it's expensive)

However, nearly all GPU accelerated scientific programs assume FP32 single precision because that is where the best performance (with usable accuracy) is on the GPU. The MD codes don't really need dbl for calculating forces.

What you have spec'd out for a system is a pretty easy recommendation! 9920X is very good (so good, that it is often in shortage) and the 2080Ti's will give great performance. However, for NAMD (and probably the same for YASARA) the GPU's will out-perform the CPU during your job runs. You could probably get away with 1 card but I really like the idea of having 2. You will get some performance increase but it wont be anywhere near double because the calculations will be limited by CPU. You probably wont see much benefit from the NVLINK bridge, but again, I would still recommend it. The MD codes will be doing mostly non-bonded forces on the GPU (lots of them) there wont be much GPU-GPU communication. It will go back to CPU space for that. However, given that it's not a big expense I would get the NVLINK bridge. You may end up running code where it makes a significant difference at some point so having it is a good thing.

Look at https://www.pugetsystems.co...
That 8-core 9800X went from 7 day/ns down to 0.46 day/ns with the help of 1 2080Ti. (the million atom STMV job run) Adding, a second 2080Ti to that wouldn't help much. Your 9920X will get some benefit but not double the performance.

My bottom line evaluation of what you spec'd out for your workstation is that it will give excellent performance for what you are planning on doing, AND, it has more versatility and capability beyond that! It is really close to what I would do for myself for a new system! ( I would probably try to go to 128GB mem but you probably don't need it :-) It would be a great system for nearly any scientific workload and ML/AI work too. [it's stunning how good that hardware is for the cost in an historical perspective for scientific workstations]

The only practical improvements you could make would be more memory and storage and more CPU cores. Beyond that you would be looking at dual Xeon and that would be a big jump in cost. If you wanted to tighten up the cost on what you spec'd you could drop the second GPU and not lose a lot of performance.

[ I should note that NAMD is not going take much advantage of the (great) AVX512 vector unit. That's why the Threadripper with lots of cores did well with NAMD and multi-GPU's. However, for overall system reliability and versatility I still recommend the Intel platform you have spec'd ... at least right now! We will test the next round of AMD hardware heavily. ]

I write a long comment like this and it makes me want to do more thorough testing and benchmarking. I'm working on some "how-to's" right now that I've wanted to do for some time. But, the net time we get new release hardware in I will do testing again and I am automating that, so I might be able to get more programs and a wider variety of hardware under test. MD codes are definitely something I'll test.

Posted on 2019-04-16 17:03:29
James Thompson

Thanks so much, Donald. This is amazing useful work!

Posted on 2019-04-16 06:15:23

@Donald: Thanks for the very detailed response! I am glad to see that the system I have configured looks quite good to you, and it could be even better with some relatively minor adjustments.

However, something I am still unsure about is the question of thermal issues under the heavy CPU/GPU loads that I anticipate with my MD simulations. Some vendors have told me that I would need to use a custom liquid cooling system that includes both GPUs and the CPU in the loop, and that I would need a larger case than my favorite (Fractal R5). They say that under heavy loads, the RTX2080TI cards are likely to throttle down to lower clock speeds, resulting in performance decrements of 25% or so.

Yet, other vendors are saying that a custom liquid-cooling system for CPU and GPUs would take care of the thermal load and that this could be accomplished in a mid-tower such as the fractal R5.

Still other companies are not commenting on my configuration at all, which seems the most problematic!

Recommendations on cases and thermal solutions most welcome!

Posted on 2019-04-16 17:30:51

Chiming in here, I hope you don't mind :)

Fractal makes great cases, and the Define R5 can be set up with a perfectly fine airflow layout for cooling one CPU and two video cards without needing to resort to liquid-cooling. We do it all the time, though now we use the R6 instead of the R5.

However, regarding the video cards, I would advise using single-fan / rear-exhaust style cards. We've got an article about why, if you are curious:


With two cards you are on the border of it being an issue, and if you pump enough airflow through the system you could likely get away with the more popular multi-fan cards (like the NVIDIA Founders Edition)... but going with models that exhaust their heat out of the chassis instead will keep temps down, and allow you to be a little less zealous about maxing out the chassis fans.

Posted on 2019-04-16 17:37:01

William, good to hear from you again! You helped me put together my first Puget Systems machine, which is sitting next to me and still going strong -- it's still my fastest system out of the 12 computers in my office and lab. However, my Puget Systems machine has liquid cooling (CPU only), back when you were still supplying that option.

Thanks for the link to your article on fan design in GPUs. I intend to put 2 RTX2080Ti 11GB cards in the machine that I am now configuring. I had naively assumed that two fans in each card was better than one! Your article was quite enlightening.

I am still drawn to a liquid cooling solution, on the theory that liquid cooling should achieve lower temperatures than air cooling, and all components ought to run faster at lower temperatures. On the other hand, liquid cooling is more complicated in some respects, and I do worry about leaks creating havoc.

Posted on 2019-04-16 17:59:26

I'm glad to hear that system is still serving you well, and that you have been happy with it :)

Going for closed-loop liquid-cooling for the CPU is an okay option, if you want, and doesn't add much cost or complexity over a more traditional heatsink and fan combo. However, going "full", "open-loop", or "custom" liquid-cooling (lots of names for the same thing) adds a lot of cost and complexity - especially if you have the video cards included in that cooling loop. If you ever want to upgrade the cards down the road, or if a card fails and needs to be replaced, having it in a cooling loop means hours of work to swap it instead of minutes (with a normal, fan-cooled card). Liquid-cooling also brings in the risk of a leak that can damage other components in the system. Unless you are planning to push clock speeds way beyond default specs (overclocking - which I also don't recommend) or value the way liquid-cooling looks more than cost and convenience, then I don't think it is a good idea :)

Posted on 2019-04-16 18:08:56

Excellent points, William! I have to admit, the custom liquid-cooling systems do look, well, cool. However, in such moments of weakness, I have to remind myself that the purpose of the system is to generate spectacular results, not merely look spectacular.

Posted on 2019-04-16 19:18:27
Donald Kinghorn

I think William has you covered on the cooling. We have a couple of good options for the 2080Ti that will be great for compute. I personally like the Gigabyte blower card. It has an aggressive fan profile and a small ramp at the back to lower air pressure to the fan. I would recommend that for a 4 GPU setup. For your 2 GPU setup the ramp wont matter and your blower will ramp a little lower (quieter) ... but you definitely want the blower cards! We take a lot of care and do a lot of testing with our cooling design. Air can be really good!

William is so good that Puget Labs stole him from sales consulting :-)

Posted on 2019-04-17 17:17:37

Thanks for the further information -- I appreciate it very much.
Oscar is handling my current configuration for sales at Puget. I trust that he can coordinate with William and look at your comments to help with optimizing components.

Posted on 2019-04-17 17:48:57

TensorFlow performance on Radeon VII got massive boost and now it can match RTX 2080Ti:
Maybe for future RTX/Radeon Comparisson?

Posted on 2019-04-21 16:15:32
Donald Kinghorn

Those results are starting to look pretty good! I've been curious about ROCm for some time. I probably will try it.

Posted on 2019-04-22 17:46:36

Non-scientific measurement I've done suggests an MSI Duke 2080ti is a few percent points faster than a RADEON VII.

Posted on 2019-06-14 21:04:54
Donald Kinghorn

thanks for data point! I still haven't done any of the AMD testing but it looks like that will be happening within the next few weeks

Posted on 2019-06-17 17:33:24

FP32, mind you. I've not tried FP16.

Posted on 2019-06-17 19:15:10

Hi Can you please add RTX2070 super and 2 RTX2070 super +nvlink in the comparision table

Posted on 2019-09-09 11:03:24
Donald Kinghorn

Did that a couple of weeks ago ... "2 x RTX2070 Super with NVLINK TensorFlow Performance Comparison"

I really like the 2070 Super ... Enjoy!

Posted on 2019-09-09 19:59:45

Hi Donald or William, I'm considering a workstation with 4x RTX 6000 cards, and I'm wondering whether Nvlink is worth it. Something that does not make sense to me is that many people say there's not much benefit going from PCIe x8 to x16 for deep learning, if this is true, then why would Nvlink help at all?

Also, do two Nvlinked cards look like a single large card, or still as two separate cards to Pytorch or TF? Do I still need to modify my code to add data or model parallelism when I intend to run it on two cards connected with Nvlink?

Posted on 2019-12-09 20:27:40
Donald Kinghorn

... a couple of things...
First, RTX 6000's are Quadro so they will have Peer-2-Peer (P2P) enabled across the PCIe bus (The RTX GeForce cards do not, and have to go through CPU mem for communication).

X8 vs X16 PCIe usually doesn't have much performance impact for GPU-CPU communication (a few %) because it is heavily buffered at both ends.

NVLINK has 5-10 time the P2P bandwidth and lower latency than P2P over PCIe or across CPU mem. Does that matter? It depends. A lot of DL programs/models use distributed batch for parallelism so there is not that much P2P communication, gradient accumulation, etc.. But recurrent networks like LTSM's for instance CAN have significant P2P communication. For problems like that NVLINK does make a significant difference.

Guessing if NVLINK will have an impact or not will require thinking about the code you are running. ... It may or may not ...

NVLINK is just a (fast) communication bridge. The cards are still treated as separate GPU's. ( both data and model parallelism can benefit )

My personal feeling is, go ahead and get NVLINK, just in case ... It's a small cost relative to the cost of the system and it "might" speed things up beyond it's fraction of the total system cost.

I should also note that on RTX cards there are just pair links, so with 4 cards it will be 2 pairs with NVLINK. Communication between the pairs is still over the PCIe bus. When I tested, I was a little surprised that it worked as well as it did with this arrangement.

Posted on 2019-12-10 15:48:50

Ok, so Quadro cards with P2P enabled should see 64GBps GPU-GPU bandwidth on PCIe v4 bus, and Nvlink gets them 100GBps. Why do you say it's 5-10x faster? Have you tested Quadro cards with Nvlink? But in any case, and I'm still confused about it - if going from x8 to x16 P2P over PCIe does not help much for GPU-GPU communication in some task, why would Nvlink be more beneficial for the same tasks? Surely we should see more benefit when going from 32GBps (x8) to 64GBps (x16) than when going from 64GBps (x16) to 100Gbps (Nvlink), because the former would more likely to be the bottleneck. This should be true even for older PCIe versions - in any scenario where GPU-GPU bandwidth B is a bottleneck, going from B to 2B would be more beneficial than going from 2B to 4B, or even from 2B to 20B, as long as the bottleneck occurs below 4B. Does this make sense?

Posted on 2019-12-11 05:20:21
Donald Kinghorn

... you wont see that much bandwidth sine they are PCIe v3 devices ... and in general it's more complicated than what you are thinking ... the main variables are (in approx. order of importance); how the program you are using was written (doe it use or need P2P), the job you are running, and then the number of cards, bandwidth - and latency - of any interconnect (PCIe, NVLINK, network interface...)

You should read this post for a better idea of what is going on and to see some testing output ...
"P2P peer-to-peer on NVIDIA RTX 2080Ti vs GTX 1080Ti GPUs"

Also, don't over think it :-) It is essentially impossible to "predict" performance for any particular job run based on specs and isolated component synthetic benchmarks. ... but given the low cost of a couple of NVLINK bridges compared to the cost of the Quadro cards I would get them just in case they ever do help. ... because they certainly wont hurt performance!

Posted on 2019-12-11 18:39:52