Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1239
Article Thumbnail

NVIDIA Dual-Fan GeForce RTX Coolers Ruining Multi-GPU Performance

Written on September 28, 2018 by William George
Share:

Introduction

The new GeForce RTX series cards perform well in GPU based rendering, as individual cards, and have great potential for the future thanks to their new RT cores. However, when stacking them together to measure multi-GPU scaling we ran into some serious problems.

After wrapping up our testing of the new RTX 2080 and 2080 Ti as single cards, we wanted to see how well they scale in popular rendering engines like OctaneRender, Redshift, and V-Ray. We have looked at that in the past and found that multi-GPU scaling is quite good in these applications, and many of our customers use 2, 3, or even 4 GPUs to get the fastest possible render times. We only have one 2080 Ti at the moment, so we had to go with a set of four of the Founders Edition RTX 2080 cards for this test.

NVIDIA GeForce RTX 2080 Founders Edition

UPDATE: Single-fan versions of the GeForce RTX cards are now available, which don't have the problems described in this article.

Test Setup

Normally we run each test three times, make sure the results are pretty close, and then take the best one (highest score or lowest time, depending on the situation) as the final result. This helps ensure that something going on in the background doesn't throw things off and that each hardware configuration we test gets a fair shake. Over the course of our normal three runs of the Octane Benchmark and Redshift Demo with quad RTX 2080s, though, we found very odd behavior: each of the three runs was substantially slower than the one before it. We ran it again, this time with 5 runs for OctaneRender and 8 runs for Redshift - and not only did we see the same pattern, but it continued with slower performance the longer the tests were running. This is very different from what we have seen in past multi-GPU scaling tests, so we knew something was amiss.

If you would like full details on the hardware configuration we ran these tests on, just .

Benchmark Results & Analysis

To try and determine what was going on, we ran GPUz 2.11.0 with one instance per video card and logged the results. What we found was that over time, as the tests continued, the temperatures on three of the four cards were getting very high and then the clock speed was throttling down dramatically. Modern video cards (and CPUs too) are designed to do this as a safety precaution, to avoid damage to the cards from overheating or software errors/crashes. Fan speeds start out slow, to stay quiet, and then ramp up as cards get warmer - and eventually, if the fans cannot keep temps in check even as they approach full speed, the clock speed of the GPU throttles down to reduce heat output. But on these cards, we weren't just seeing a little bit of downclocking, but toward the end of our test runs some cards were running at less than half of the speed NVIDIA lists in their specs! That is a huge difference in clock speed, and it translated into a massive drop in overall rendering performance. Here is what we found, both in terms of raw results and performance drop over time - with average GPU clock speeds included so that you can see the correlation. Let's start with OctaneRender 3.08:

OctaneBench 3.08 Showing Scores and Performance Dropping Over Time on Quad NVIDIA GeForce RTX 2080 GPUs

OctaneBench 3.08 Showing Performance Degradation Over Time on Quad NVIDIA GeForce RTX 2080 GPUs

That second chart shows it best: Octane performance drops 30%, over the course of five benchmark runs, from what we would expect based on the performance of a single RTX 2080 and typical GPUs scaling in OctaneRender. Even during the very first run, there is a big enough clock speed drop that we didn't get even one benchmark score in the range we expected. We found a single RTX 2080 to score ~179.5, and in past tests, OctaneBench has demonstrated near-perfect scaling - hence our expectation to see a score around 718 from four of them. The "expected" clock speed is based on NVIDIA's claimed spec of 1800MHz boost clock for the Founders Edition GeForce RTX 2080 cards we used. GPUz did report the cards starting off that high, and in fact a bit higher: of the four cards in our testbed, two started these runs at 1890MHz, one at 1905MHz, and another at 1950MHz. They didn't stay that fast for long, though.

After wrapping up five runs in OctaneBench, we let the cards cool down a bit before starting Redshift. We had already updated to the latest release of Redshift (2.6.22) in order to get the new Turing-based GeForce RTX cards to work, so once the system had cooled down we ran the Redshift Demo on a loop for about 30 minutes - resulting in eight runs, with the same GPUz clock speed tracking:

Redshift 2.6.22 Demo Showing Scores and Performance Dropping Over Time on Quad NVIDIA GeForce RTX 2080 GPUs

Redshift 2.6.22 Demo Showing Performance Degradation Over Time on Quad NVIDIA GeForce RTX 2080 GPUs

Here, again, we see a big drop in rendering speed over time with results evening out after six runs. At that point, performance seems to stabilize at about 26% lower than expected. Redshift hasn't scaled as well as Octane in our past tests, so the expectation here was based on the 374% speedup we saw when going from one to four GTX 1080 Ti cards in one of our recent articles. The expected GPU clock speed was again based on the listed boost clock spec from NVIDIA, but as mentioned above we saw clock speeds reported even higher than that in GPUz. The first run managed to average higher clock speeds, even, and performance was almost exactly what we had expected - but the second run took a big hit, and render times kept getting longer as time went on and the cards got hotter.

I should also point out here how we got the average GPU frequency. GPUz was used to track each card individually, with measurements taken of the clock speed every second. We then took those clock speeds over the course of each test run, averaged them per card, and then averaged the four cards' individual results together. This doesn't tell the whole story, as we found that the bottom card - the one with its fans exposed to the air, instead of being next to another card - actually managed to keep its temps in check and its clock speed high during the entirety of both test runs. The next two cards, the ones in the middle, fared the poorest in terms of temperatures and throttling... and then the top card, which had its backside (with a thermal cooling plate) exposed to open air did better than the middle two - but still throttled pretty heavily after a while.

We did also test V-Ray, which has a GPU rendering component to its public benchmark, but with this many GPUs, it only takes 20 seconds to complete. Running it repeatedly involves a downtime of 10 to 20 seconds between tests, though, so the cards never have the chance to get as hot as the longer OctaneBench and Redshift Demos (which also only have about 2 seconds between repeat runs).

What is Causing this Throttling?

So what is going on here? Why are these cards having such problems with heat and downclocking, when past NVIDIA GeForce cards we have tested - including former Founders Edition models - have always done so well with multi-GPU rendering?

The answer lies in the heatsink and fan layout on these new cards. Past Founders Edition and "reference" style GeForce cards have had a single fan, near the front of the card, blowing back across the heatsink and exhausting the bulk of the hot air out the rear of the case. These new models, however, have dual fans - and the fins on the heatsink are arranged vertically, rather than horizontally. That means that they do not push hot air out the back of the system, but instead vent it up into the computer. Even in our open-air testbed system, this ends up being a far poorer cooling setup when cards are installed back-to-back, which is required when putting three or four of them on a single motherboard. One card by itself does okay this way, though it will heat up the inside of a chassis more than the blower-style cooling system, and even two cards - if separated by a slot or two - seem to do fine... but when they are put next to each other, this dual fan cooling layout proves to be a huge problem for GPU intensive workloads like rendering.

Dual Fan GPU Coolers are Bad for Multi-GPU Workstations

Thankfully, it looks like some OEMs are going to produce single-fan, blower-style GeForce RTX cards. When we can get our hands on a set of four of them we will try this again, and hopefully be able to publish proper multi-GPU scaling results for the 2080 and 2080 Ti.

Why Are Dual Fan NVIDIA RTX GPUs Throttling?

The new dual-fan cooling setup on NVIDIA's GeForce RTX 2080 and 2080 Ti Founders Edition cards vents heat upward, into the computer, rather than out the back of the system like previous single-fan designs. This means that putting multiple cards in a computer, especially back-to-back, leads to overheating and performance throttling under heavy GPU load.

Should You Buy GeForce RTX Video Cards for OctaneRender or Redshift?

A single RTX 2080 or 2080 Ti is a fine choice for GPU based rendering now and maybe even better in the future if engines take advantage of the new RT cores in these GPUs. However, the dual-fan Founders Edition cards are NOT good for multi-GPU configurations. Wait for single-fan, blower style cards instead.

Recommended Systems for OctaneRender

Recommended Systems for Redshift

Tags: Multi, GPU, Scaling, Rendering, Octane, Render, Redshift, Benchmark, NVIDIA, GeForce, RTX, 2080, Performance, Heat, Cooling, Fans, Video, Card
Håkon Broder Lund

With these cards having so large dies, its not unexpected they will produce more heat. For what I have been reading the single blower cards are clocked lower than the reference cooler cards to limit heat. They would probably run better and throttle less even thought they are clocked lower.

Also, awesome work guys. You are killing it with tons of insightful articles in rapid pace. Hats off!

Posted on 2018-09-29 00:46:20

That is one of the things we'll be able to see, once we get the blower style cards in, since we've already got the data on these Founders Edition cards. There may also be differently clocked blower cards from different manufacturers. I think the brands we've seen advertise such cards - so far - are EVGA, Asus, and PNY.

Posted on 2018-10-01 15:57:10
Håkon Broder Lund

That would be a very interesting article so see. Water cooling would be best performance wise, it is just too risky for a production workstation in terms of leaks and maintenance, I don't expect you guys to test that as it is outside your business scope.

Posted on 2018-10-01 22:38:20
Jakub Badełek

Great analysis Matt (I mean this one and all the other RTX tests)! I think the best solution would be a custom water loop for cooling multigpu setup like this, but than again this is kind of pain in the... back, and rather difficult to sell after a few years. Anyway, I hope you'll throw in Premiere Pro tests too ;)

Posted on 2018-10-01 12:49:44

This was actually William's testing, not mine :) He does most of our 3D modeling/Rendering/Photogrammetry work, which is where this issue mostly comes up.

Water cooling would probably fix the issue, but that is really complex and expensive. I think just using blower-style coolers should be all that is necessary to resolve the throttling, but we'll have to see for sure once those cards are actually available.

As for Premiere Pro, that shouldn't be affected by this issue at all. Premiere can technically use more than one card, but unless you are using very low-end cards you won't actually see any performance gain doing so.

Posted on 2018-10-01 16:14:40
obrad

I can see some 3 fans version like Zotac does someone test these in dual configuration ?

Posted on 2018-12-05 13:08:34

We haven't tested any triple-fan video cards, but I suspect the main issue will still remain with those just like with the two-fan cards. If they are exhausting most of the heat back into the computer, rather than our the back of the card (and thus outside of the system) then having them stacked together means that the hot air they are exhausting will feed right back into the fans and lead to very poor cooling. High-end video cards generate a lot of heat, and it is important to get that heat out of the system rather than just recirculating it back inside.

Posted on 2018-12-05 17:28:30