Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1253
Article Thumbnail

NVIDIA GeForce RTX 2080 & 2080 Ti Do NOT Support Full NVLink in Windows 10

Written on October 5, 2018 by William George
Share:

Introduction

When NVIDIA announced the GeForce RTX product line in August 2018, one of the things they pointed out was that the old SLI connector used for linking multiple video cards had been dropped. Instead, RTX 2080 and 2080 Ti cards would use the NVLink connector, found on the high-end Quadro GP100 and GV100 cards. This caused much excitement since one of the features of NVLink on Quadros is the ability to combine the video memory on both cards and share it between them. This is extremely helpful in applications that can be memory-limited, like GPU based rendering, and having it available on GeForce cards seemed like a great boon. Afterward, though, NVIDIA only spoke of it using terms like "SLI over NVLink" - leading many to surmise that the GeForce RTX cards would not support the full NVLink feature set, and thus might not be able to pool memory at all. To clear this up we decided to investigate...

What is NVLink?

At its core, NVLink is a high-speed interconnect designed to allow multiple video cards (GPUs) to communicate directly with each other - rather than having to send data over the slower PCI-Express bus. It debuted on the Quadro GP100 and has been featured on a few other professional NVIDIA cards like the Quadro GV100 and Tesla V100.

What Can NVLink on Quadro Cards Do?

As originally implemented on the Quadro GP100, NVLink allows bi-directional communication between two identical video cards, including sharing access to the memory built onto each card. This allows video cards in such configurations to tackle larger projects than they could alone. In a larger implementation, it could connect multiple GPUs in a mesh network, with similar capabilities.

What Are the Requirements to Use NVLink on Quadros?

Special setup is necessary to use NVLink on compatible Quadro cards. Two NVLink bridges are required to connect them, and a third video card is needed to handle actual display output. Linked GPUs are then put in TCC mode, which turns off their outputs (hence the third card). Application-level support is also needed to enable memory pooling.

TCC Mode Being Enabled on Quadro GP100 Video Cards

This is how TCC is enabled on Quadro GP100s via the command line in Windows 10.

Do GeForce RTX 2080 and 2080 Ti Video Cards Have NVLink Connectors?

Technically, yes: there is a single NVLink connector on both the RTX 2080 and 2080 Ti cards (compared to two on the Quadro GP100 and GV100). If you look closely, though, you will see that the connectors on the RTX cards face the opposite direction of those on the Quadro cards. Check out the pictures below:

NVIDIA GeForce RTX 2080 and Quadro GP100 Side by Side

NVIDIA GeForce RTX 2080 and Quadro GP100 NVLink Connector Comparison

 

Are the GeForce RTX and Quadro NVLink Bridges the Same?

No, there are several differences between the NVLink bridges sold for the GeForce RTX cards and older ones built for Quadro GPUs. For example, they differ in both appearance and size - with the Quadro bridges designed to connect adjacent cards while the GeForce RTX bridges require leaving a slot or two between the connected video cards.

NVIDIA Quadro NVLink Bridge vs GeForce RTX NVLink Bridge (View From Top)

NVIDIA Quadro NVLink Bridge vs GeForce RTX NVLink Bridge (View From Bottom)

 

Are GeForce RTX and Quadro NVLink Bridges Interchangeable?

In our testing, the Quadro bridges physically fit but would not work on GeForce RTX 2080s. The GeForce bridge did work on a pair of Quadro GP100 cards, with some caveats. Due to its larger size, only one GeForce bridge could be installed on the pair of GP100s - meaning only half the potential bandwidth was available between them.

Dual NVIDIA Quadro GP100 Cards with Dual Quadro NVLink Bridges Installed

Dual NVIDIA Quadro GP100 Cards with Single GeForce RTX NVLink Bridge Installed

Dual NVIDIA GeForce RTX 2080 Cards with a Quadro NVLink Bridge Installed - Which Does Not Function

Dual NVIDIA GeForce RTX 2080 Cards with a GeForce RTX NVLink Bridge Installed

 

How Does NVLink on the GeForce RTX 2080 Compare to NVLink on the Quadro GP100?

After testing many different combinations of cards and NVLink bridges, we were unable to find any way to turn on TCC mode for the GeForce RTX cards. That means they cannot handle the "peer-to-peer" communication which is needed for full NVLink functionality. Even if they could, bandwidth would be much more limited due to fewer links.

Chart of NVIDIA Quadro GP100 and GeForce GTX 2080 NVLink Configurations and Capabilities

The chart above shows the results we found when using different combinations of video cards and NVLink bridges, including which combinations supported SLI and whether TCC could be enabled. Click to expand and see additional notes about each configuration.

Dual Quadro GP100 Video Cards Without NVLink Bridge in Peer-to-Peer Bandwidth Test

Dual Quadro GP100 Video Cards With Dual Quadro NVLink Bridges in Peer-to-Peer Bandwidth Test

Dual Quadro GP100 Video Cards With Single GeForce RTX NVLink Bridge in Peer-to-Peer Bandwidth Test

Dual GeForce RTX 2080 Video Cards With NVLink Bridge Failing Peer-to-Peer Bandwidth Test

These screenshots from the Windows command line show peer-to-peer bandwidth across cards with different types of NVLink bridges installed. The first three are pairs of GP100s with no bridge, the GeForce RTX bridge, and then dual Quadro bridges - while the last screenshot shows that the RTX 2080 cards did not support peer-to-peer communication at all (regardless of what bridge was installed).

Will GeForce RTX Cards Support Memory Pooling in Windows?

At this point, all the data we have points to "No". Between the limited bandwidth that a single link offers, compared to four links on the Quadro GP100, and the fact that TCC mode cannot be enabled and thus peer-to-peer communication is not fully functional, it looks like GeForce RTX cards will not support memory pooling as many people hoped.

GeForce RTX 2080 Video Cards Do Not Support TCC Mode

TCC mode cannot be enabled on the GeForce RTX 2080 video cards in Windows.

Does NVLink Work on GeForce RTX Cards in Linux?

My colleague Dr. Don Kinghorn conducted similar tests in Ubuntu 18.04, and he found that peer-to-peer communication over NVLink did work on RTX 2080 cards in that operating system. This functionality in Linux does not depend on a different driver mode like TCC, so with that hurdle removed the hardware link itself seems to work properly.

So What is NVLink on GeForce RTX Cards Good For?

While they do not appear to support the full NVLink feature set in Windows, even the single link that the RTX 2080 and 2080 Ti have is far faster than the older SLI interconnect. That seems to be the focus on these mainstream, gaming-oriented cards: implementing SLI over a faster NVLink connection. That goal is accomplished, as shown in benchmarks elsewhere.

Will GeForce RTX Cards Gain More NVLink Functionality in the Future?

Future driver updates from NVIDIA could change the situation dramatically, and might unlock capabilities that are currently unavailable. Additionally, the 2.5 Geeks Webcast interviewed a NVIDIA engineer who indicated that NVLink capabilities on these cards will be exposed via DirectX APIs - a different approach than TCC that Quadro cards have used, which is what we tested here.

Tags: NVIDIA, GeForce, RTX, 2080, 2080 Ti, NVLink, SLI, Bridge, Quadro, GP100, GPU, Memory, Pooling
Padi

Memory pooling is possible for GeForce RTX according Nvidia’s Director of Technical Marketing, Tom Peterson, during HotHardware 2.5 Geeks podcast:

„Petersen explained that this would not be the case for GeForce RTX cards. The NVLink interface would allow such a use case, but developers would need to build their software around that function. “While it's true this is a memory to memory link; I don't think of it as magically doubling the frame buffer. It's more nuanced than that today,” said Petersen. “It's going to take time for people to understand how people think of mGPU setup and maybe they will look at new techniques. NVLink is laying a foundation for future mGPU setup.”

edit: lik fixed
https://www.tomshardware.co...

Posted on 2018-10-06 08:17:01

The link in your comment seems to have been cut off, but I found the podcast episode you are referring to. Do you happen to know what time stamp this particular quote is from? I'd like to go through and listen to the context around it, but I was hoping to avoid listening to the whole hour-long podcast :-)

Posted on 2018-10-06 14:59:27
ryan o'connor

https://youtu.be/YNnDRtZ_OD...

Posted on 2018-10-07 04:01:34

Yeah, talk of NVLink starts just before the 38:00 mark and goes until about 46:30. I ended up watching all of it earlier today, but thank you for the direct link :)

I am going to listen to just that ~8 minute portion again tomorrow, and then write some thoughts.

Posted on 2018-10-07 07:12:46
ryan o'connor

No problem! Interested to hear what you think

Posted on 2018-10-07 18:30:36

Okay, here is the section that I think bears most closely on what our article above covers. It goes from about 41:56 to 44:15 in the video above and addresses two questions:

Interviewer: "NVIDIA collective communications library, the NCCL library, for developing atop NVLink, will GeForce users get access to that for playing with communications and buffers?"

NVIDIA Engineer: "I expect that the answer is 'yes' to that. So NVLink is a software visible capability, and its gonna be exposed primarily through the DX [DirectX] APIs. I'm not sure exactly... NCCL, I'm not super familiar with that but the DX APIs will expose NVLink."

Interviewer: "I had a question, generally speaking, in terms of when you were talking about "hey, what's in your frame buffer?" - in the way I understand the way NVLink works in machine learning and supercomputers, you know, high performance computing - you now have, let's say in the case of two 8GB frame buffer cards, you now have a contiguous 16GB frame buffer. Is that too simplified, simplifying it too much?"

NVIDIA Engineer: "I think that sets the wrong expectation, right? When people say that they're trying to say I can now game with 16GB textures. And its really that style of memory scaling will require app work, right? Its not just gonna magically share that memory. Now its true that you could set it up to do that, right? You could set it up a memory map so that, you know, effectively it looked like a giant frame buffer - but it would be a terrible performance thing. Because the game would really need to know that there is latency to access that second chunk of memory, and its not at all the same. So think of it as it is true that this is a memory to memory kind of link, but I don't just think of it as magically doubling the frame buffer. Its much more nuanced than that today, and its going to really take time for people to understand "hey, NVLink is changing the way I should think about my multi-GPU setup and, effectively, maybe I should start looking at new techniques", right? And that's why we did NVLink. NVLink is not really to make SLI a little bit better, its to lay a foundation for the future of multi-GPU."

So it sounds to me like what is going on, for these GeForce cards, is that they are going to expose NVLink capabilities in a different way than Quadro cards have. That makes sense, in a way, since GeForce cards are aimed at a different audience (mainstream, largely gamers) and need to be accessible to game developers in ways that they are already somewhat familiar with. However, if NVIDIA only allow access to NVLink on GeForce cards through DirectX APIs then that may interfere with using it in applications that are more focused on GPU computation.

I think I will add one more section onto the article above, talking about how just because the traditional way to test NVLink GPU communication doesn't work on the GeForce cards does not mean they will never be able to work together in a similar way. We are, of course, very early in the release of this RTX / Turing GPU generation - and both other APIs / approaches to the issue as well as future driver updates could change the situation :)

Posted on 2018-10-08 18:39:50
Padi

Amazing summary. Thank you for taking the effort transcribing it. This was the way I understood it when OTOY talked about their implementation. It is not automagically double the VRAM and it will have a speed hit every time assets need to be exchanged over NVLink. The hope is that this penalty is much smaller than going out-of-core to system memory to fetch assets which don‘t fit the 11 GB VRAM but might fit in a 22 GB pool of VRAM.

The good news is that GPU render engines are already actively working on using the new API as described in the post I linked on an older article here:

https://www.reddit.com/r/Re...

The odds are good that we will benefit from the GeForce NVLink as well and the 2080 Ti will have better bandwidth compared to the 2080 cards.

Posted on 2018-10-08 19:01:55

Between the potential of NVLink and RT cores, I think there will be a lot of growth room for GPU rendering on this generation of cards. I am excited to see where it goes, and to test Octane, Redshift, and V-Ray as they release updates that utilize Turing's capabilities. It may also be interesting to replicate the testing above once we have a pair of RTX 2080 Ti cards (we have only one at the moment) to see if they report a different number of links than the vanilla 2080 cards.

Posted on 2018-10-08 19:21:38
Padi

Sorry for the cut off link. You will find the context in the article. I did listen to the interview a month ago but I don‘t remember specific timestamps:

https://www.tomshardware.co...

Posted on 2018-10-07 05:53:00

Thank you for sharing that! I am going to re-listen to the applicable part of the interview tomorrow, and write some of my thoughts on it here in the comments.

Posted on 2018-10-07 07:13:58

I just posted a reply above, to Ryan O'Connor, addressing the video interview you brought up.

Posted on 2018-10-08 18:40:19
Michael

Any chance you can test this on linux, where TCC mode is not an issue?

Posted on 2018-10-07 03:41:59

That may be a little bit outside my area of expertise, but it would certainly be interesting to see if there is any different behavior on Linux.

Posted on 2018-10-07 07:54:54
Donald Kinghorn

Hi Michael, William ask if I could comment ... I've just done a bunch of NVLINK testing in Linux (Ubuntu 18.04, CUDA 10.0 and driver 410) It looks like full NVLINK but with a bit lower performance than you would see on the V100 server hardware. I'll have a full post up at https://www.pugetsystems.co... in a couple of days. I'll be looking at TensorFlow performance along with general performance like the following testing... the following is 2 x RTX2080 founder edition cards

kinghorn@i9:~/projects/samples-10.0/bin/x86_64/linux/release$ nvidia-smi nvlink -c
GPU 0: GeForce RTX 2080 (UUID: GPU-2cac9708-1ed8-0312-ada8-ce3fb52a556c)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: false

cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.53GB/s

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 389.09 5.82
1 5.82 389.35
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 386.63 24.23
1 24.23 389.76
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 386.41 11.59
1 11.57 391.01
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 382.58 48.37
1 47.95 390.62

Posted on 2018-10-11 15:26:25

Thank you for doing that testing, Don! It is looking like the issue with NVLink on these GeForce RTX cards is purely because NVIDIA is not allowing TCC mode in the current Windows drivers. I will update the article text (and maybe the title too) to better reflect that.

Posted on 2018-10-11 16:58:52
Michael

Great that's terrific news. The 2080ti as a tu102 chip should support twice the bandwidth as the 2080. I'm curious whether this results in training speedups with memory pooling. Will look forward to your writeup.

Posted on 2018-10-12 18:54:46

I couldn't test P2P bandwidth on Windows, of course, but I was able to see that the 2080 Ti cards have two links available - compared to just one on the vanilla 2080 (and none on the upcoming 2070s, as I understand it). So assuming you are using an OS and software setup that works properly with NVLink, then a pair of 2080 Ti should indeed have double the P2P bandwidth of the 2080 :)

Posted on 2018-10-12 18:57:26
Lee aste

Oh...
so Can't i use GP100 or GV100 NVLINK Bridge for 2080ti 2pcs?
I use Matx motherboard, so I need to do sli, need 2-slot nvlink bridge.. but there is no 2-slot nvlink bridge.. except Quadro nvlinj bridges..
Is there a way?

Posted on 2018-10-08 10:26:42

The Quadro bridge did not work on GeForce RTX cards for us, so I would not expect it to work for you either. Moreoever, I would be concerned about using two of these dual-fan cards right next to each other. The heatsink configuration on the NVIDIA Founders Edition cards in this generation is not built for having cards next to each other without at least one slot in-between for airflow. I think that may be why they don't offer a 2-slot NVLink SLI Bridge.

Posted on 2018-10-08 18:11:30

I did some testing under Win10 1809 and 416.16 drivers and during my single application monitoring of VRAM usage I hit 11.7GB, 700MB over (keep in mine this is single app, not combined OS + app). This was an "SLI" aware app that does indeed use both mGPU with a supporting nVidia profile under DX11. If the 700MB was swapping to main system RAM then I would have expected to see a sharp decline in FPS but no such decline happened at the point the app exceed 11GB usage, FPS was very consistent. So in my real world test case and not the "discussion" case, it seems that Memory Pooling is happening. In my case the application was a flight simulator (Lockheed Martin's Prepar3D V4.x). I can probably run more test by increasing the shadow map size to sharpen shadow quality as this will use more VRAM also and should push me further past 11GB.

Posted on 2018-10-14 16:59:55

Thank you for sharing your experience! If all that is needed is enabling SLI, in order to have memory pooling, that would be nice... but it is definitely a change from how NVLink and memory pooling worked in the past (on Quadro cards). I hope NVIDIA puts out some more official information about this, and it would be nice if they also put more details in their control panel - especially showing memory pooling and usage.

Posted on 2018-10-15 19:22:49
-V-

VRay apparently got it to work.

Posted on 2018-10-15 01:26:31

Chaos Group (V-Ray) and OTOY (OctaneRender) have both talked about it, but I haven't seen anything published with detailed information directly showing NVLink at work on GeForce RTX cards in either of those rendering engines. I would love to know more about what they have actually tested and how they got memory pooling working, if indeed they have. It would also be great if they would update their benchmarks to utilize it - both V-Ray Benchmark and OctaneBench are lagging behind their latest releases :(

Posted on 2018-10-15 19:11:22
nejck

You guys should be aware that you probably need to enable "SLI" in order for the NVLink to work on the RTX series. Memory pooling also works if implemented in the application. I'd recommend taking a look at this post:
https://www.facebook.com/gr...

Posted on 2018-10-15 05:58:16

That is really weird - they used bridges from Quadro cards and it worked for them, when that definitely did not work for us (not even SLI mode was available when trying to use those bridges).

Hmm... GV100 bridges? We used ones from the GP100. Maybe the bridges themselves have been updated over the Quadro GP100 -> GV100 generation change? The coloring is different - the bridges in that Facebook post look golden in color, rather than silver like the Quadro GP100 bridges we have.

It is good to see that they are using blower-style RTX cards, though - looks like the same Asus series we tested recently, though they have the 2080 Ti variants (lucky!).

I still am unsure how software like this is functioning with P2P over NVLink without being able to put the cards into TCC mode... but maybe memory pooling in this generation somehow doesn't need that? I'll play with this some more when I have a chance.

Posted on 2018-10-15 19:20:26